Advanced search
1 file | 705.76 KB Add to list

Generating public transport data based on population distributions for RDF benchmarking

Ruben Taelman (UGent) , Pieter Colpaert (UGent) , Erik Mannens (UGent) and Ruben Verborgh (UGent)
(2019) SEMANTIC WEB. 10(2). p.305-328
Author
Organization
Abstract
When benchmarking RDF data management systems such as public transport route planners, system evaluation needs to happen under various realistic circumstances, which requires a wide range of datasets with different properties. Real-world datasets are almost ideal, as they offer these realistic circumstances, but they are often hard to obtain and inflexible for testing. For these reasons, synthetic dataset generators are typically preferred over real-world datasets due to their intrinsic flexibility. Unfortunately, many synthetic dataset that are generated within benchmarks are insufficiently realistic, raising questions about the generalizability of benchmark results to real-world scenarios. In order to benchmark geospatial and temporal RDF data management systems such as route planners with sufficient external validity and depth, we designed PODiGG, a highly configurable generation algorithm for synthetic public transport datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PODiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PODiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data.
Keywords
Public Transport, dataset generator, benchmarking, RDF, linked data

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 705.76 KB

Citation

Please use this url to cite or link to this publication:

MLA
Taelman, Ruben, et al. “Generating Public Transport Data Based on Population Distributions for RDF Benchmarking.” SEMANTIC WEB, vol. 10, no. 2, Ios Press, 2019, pp. 305–28, doi:10.3233/SW-180319.
APA
Taelman, R., Colpaert, P., Mannens, E., & Verborgh, R. (2019). Generating public transport data based on population distributions for RDF benchmarking. SEMANTIC WEB, 10(2), 305–328. https://doi.org/10.3233/SW-180319
Chicago author-date
Taelman, Ruben, Pieter Colpaert, Erik Mannens, and Ruben Verborgh. 2019. “Generating Public Transport Data Based on Population Distributions for RDF Benchmarking.” SEMANTIC WEB 10 (2): 305–28. https://doi.org/10.3233/SW-180319.
Chicago author-date (all authors)
Taelman, Ruben, Pieter Colpaert, Erik Mannens, and Ruben Verborgh. 2019. “Generating Public Transport Data Based on Population Distributions for RDF Benchmarking.” SEMANTIC WEB 10 (2): 305–328. doi:10.3233/SW-180319.
Vancouver
1.
Taelman R, Colpaert P, Mannens E, Verborgh R. Generating public transport data based on population distributions for RDF benchmarking. SEMANTIC WEB. 2019;10(2):305–28.
IEEE
[1]
R. Taelman, P. Colpaert, E. Mannens, and R. Verborgh, “Generating public transport data based on population distributions for RDF benchmarking,” SEMANTIC WEB, vol. 10, no. 2, pp. 305–328, 2019.
@article{8612888,
  abstract     = {{When benchmarking RDF data management systems such as public transport route planners, system evaluation needs to happen under various realistic circumstances, which requires a wide range of datasets with different properties. Real-world datasets are almost ideal, as they offer these realistic circumstances, but they are often hard to obtain and inflexible for testing. For these reasons, synthetic dataset generators are typically preferred over real-world datasets due to their intrinsic flexibility. Unfortunately, many synthetic dataset that are generated within benchmarks are insufficiently realistic, raising questions about the generalizability of benchmark results to real-world scenarios. In order to benchmark geospatial and temporal RDF data management systems such as route planners with sufficient external validity and depth, we designed PODiGG, a highly configurable generation algorithm for synthetic public transport datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PODiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PODiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data.}},
  author       = {{Taelman, Ruben and Colpaert, Pieter and Mannens, Erik and Verborgh, Ruben}},
  issn         = {{1570-0844}},
  journal      = {{SEMANTIC WEB}},
  keywords     = {{Public Transport,dataset generator,benchmarking,RDF,linked data}},
  language     = {{eng}},
  number       = {{2}},
  pages        = {{305--328}},
  publisher    = {{Ios Press}},
  title        = {{Generating public transport data based on population distributions for RDF benchmarking}},
  url          = {{http://doi.org/10.3233/SW-180319}},
  volume       = {{10}},
  year         = {{2019}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: