Flexible RDF generation from RDF and heterogeneous data sources with SPARQL-Generate
Résumé
RDF aims at being the universal abstract data model for structured
data on the Web. While there is effort to convert data in RDF, the vast majority
of data available on the Web does not conform to RDF. Indeed, exposing data
in RDF, either natively or through wrappers, can be very costly. In this context,
transformation or mapping languages that define generation of RDF from non-
RDF data represent an efficient solution. Furthermore, the declarative aspect of
these solutions makes them easy to adapt to any change in the input data model,
or in the output knowledge model. This paper introduces a novel such transformation
language (SPARQL-Generate), an extension of SPARQL for querying not
only RDF datasets but also documents in arbitrary formats. Its implementation on
top of Apache Jena currently covers use cases from related work and more, and
enables to query and transform web documents in XML, JSON, CSV, HTML,
CBOR, and plain text with regular expressions.