3.3k words in total, 17 minutes required. 图谱实践笔记第三篇:Jena + GeoSPARQL GeoSPARQLGeoSPARQL: A Geographic Query Language for RDF Data; It supports representing and querying geospatial data on the Semantic Web。参见项目主页。 Representing geospatial information is done using high level ontologies inspired from GIS terminology Geometries are represented using literals of spatial datatypes Literals are serialized using OGC standards WKT and GML Families of functions are offered for querying geometries 它是Open Geospatial Consortium (OGC) 的规范,相关的namespace如下: ogc:http://www.opengis.net/ geo:http://www.opengis.net/ont/geosparql# geof:http://www.opengis.net/def/function/geosparql/ geor:http://www.opengis.net/def/rule/geosparql/ sf:http://www.opengis.net/ont/sf# gml:http://www.opengis.net/ont/gml# 一个GeoSPARQL的demo网站: http://www.geosparql.org/。看起来应该是目前OGC主页的前辈。 直接用经纬度来执行12345678910111213PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX spatial: <http://jena.apache.org/spatial#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT ?objectWHERE {?object geo:lat ?lat.?object geo:long ?long.FILTER((xsd:double(?lat)>=40.73) && (xsd:double(?long)>=-74) && (xsd:double(?lat)<=41) && (xsd:double(?long)<=-73.98))} LIMIT 20 用spatial:nearby来执行12345678910111213141516PREFIX co: <http://www.geonames.org/countries/#>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX spatial: <http://jena.apache.org/spatial#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX gn: <http://www.geonames.org/ontology#>PREFIX foaf:<http://xmlns.com/foaf/0.1/>PREFIX loticoowl:<http://www.lotico.com/ontology/>SELECT ?objectWHERE { ?object spatial:nearby(40.74 -73.989 1 'mi'). ?object rdfs:label ?label} LIMIT 10 GeoSPARQL六大FeatureGeoSPARQL Class Dependency Core defines two top level classes that can be used to organize geospatial data, i.e., geo:SpatialObject and geo:Feature; Topology Vocabulary the extension is used for representing topological information about features. topological information can be derived from geometric information or it might be captured by asserting explicitly the topological relations between features. include: Simple Features Relation Family (geo:sf***); Egenhofer Relation Family (geo:eh***); RCC8 Relation Family (geo:rcc8e***). 三个family的关系的等同性如下图(不同拓扑关系家族的关系等同性)所示. Geometry Extension defines a vocabulary for asserting and querying information about geometry data, and it defines query functions for operating on geometry data. geo:hasGeometry (link a feature with a geometry that represents its spatial extent) geo:dimension (topological dimension of this geometric object, which must be less than or equal to the coordinate dimension) geo:wktLiteral (consist of an optional URI identifying the coordinate reference system followed by Simple Features Well Known Text (WKT) describing a geometric value) 如下图(wktLiteral的示例)所示. 函数 geof:intersection geof:convexHull; geof:distance; geof:envelope(returns the minimum bounding box) Geometry Topology defines a collection of topological query functions that operate on geometry literals 同样包含三个relation family geof:ehEquals(geom1: ogc:geomLiteral, geom2: ogc:geomLiteral): xsd:boolean geof:sfWithin(geom1: ogc:geomLiteral, geom2: ogc:geomLiteral): xsd:boolean geof:sfOverlaps(geom1: ogc:geomLiteral, geom2: ogc:geomLiteral): xsd:boolean RDFS Entailment Extension matching implicitly derived RDF triples in GeoSPARQL queries. Query Rewrite Extension defines a set of RIF rules that use topological extension functions defined in Geometry Topology to establish the existence of direct topological predicates defined in Topology Vocabulary. 不同拓扑关系家族的关系等同性 wktLiteral的示例 最后,推荐一个非常好的教程[2]。 SRS/CRSA spatial reference system (SRS) or coordinate reference system (CRS) is a coordinate-based local, regional or global system used to locate geographical entities. Spatial reference systems can be referred to using a SRID integer, including EPSG codes defined by the International Association of Oil and Gas Producers. SRID是用于区分不同的坐标系系统的,例如UTM, Zone 17N, NAD27 — SRID 2029; WGS84 — SRID 4326。 The GeoSPARQL standard specifies that WKT Geometry Literals without an SRS URI are defaulted to CRS84 (WGS84) http://www.opengis.net/def/crs/OGC/1.3/CRS84. DE-9IMThe Dimensionally Extended nine-Intersection Model (DE-9IM) is a topological model and a standard used to describe the spatial relations of two regions[1]。 DE-9IM 为了表示spatial predicate,可以将dim的定义转换为TRUE和FALSE,用于描述两个Geometry的关系。 Jena GeoSPARQLApache Jena实现了对于GeoSPARQL的支持,其主页:https://jena.apache.org/documentation/geosparql/index.html。对其留个重要Feature进行了实现,Simple Feature, Egenhofer and RCC8三大家族都支持。 Usage12345678910// The indexes can be configured by size, retention duration and frequency of clean upGeoSPARQLConfig.setupMemoryIndex();Model model = .....;String query = ....;try (QueryExecution qe = QueryExecutionFactory.create(query, model)) { ResultSet rs = qe.execSelect(); ResultSetFormatter.outputAsTSV(rs);} APIsThe main class to handle geometries and their spatial relations is the GeometryWrapper. The GeometryWrapperFactory can be used to directly construct a GeometryWrapper. Parse a Geometry Literal: 1GeometryWrapper geometryWrapper = WKTDatatype.INSTANCE.parse("POINT(1 1)"); Extract from a Jena Literal: 1GeometryWrapper geometryWrapper = GeometryWrapper.extract(geometryLiteral); Create from a JTS (Java Topology Suite) Geometry: 1GeometryWrapper geometryWrapper = GeometryWrapperFactory.createGeometry(geometry, srsURI, geometryDatatypeURI); Create from a JTS Point Geometry: 1GeometryWrapper geometryWrapper = GeometryWrapperFactory.createPoint(coordinate, srsURI, geometryDatatypeURI); Convert CRS/SRS: 1GeometryWrapper otherGeometryWrapper = geometryWrapper.convertCRS("http://www.opengis.net/def/crs/EPSG/0/27700"); Spatial Relation: 1boolean isCrossing = geometryWrapper.crosses(otherGeometryWrapper); DE-9IM Intersection Pattern: 1boolean isRelated = geometryWrapper.relate(otherGeometryWrapper, "TFFFTFFFT"); Geometry Property: 1boolean isEmpty = geometryWrapper.isEmpty(); DependenciesApache SIS/SIS_DATA Environment Variable: SIS provides data structures for geographic features and associated meta-data along with methods to manipulate those data structures. Java Topology Suite: a Java library for creating and manipulating vector geometry. Dataset ConversionMethods to convert datasets between serialisations and spatial/coordinate reference systems are available in: org.apache.jena.geosparql.configuration.GeoSPARQLOperations: Load a Jena Model from file:1Model dataModel = RDFDataMgr.loadModel("data.ttl"); Convert Feature-GeometryLiteral to the GeoSPARQL Feature-Geometry-GeometryLiteral structure:1Model geosparqlModel = GeoSPARQLOperations.convertGeometryStructure(dataModel); Convert Feature-Lat, Feature-Lon Geo predicates to the GeoSPARQL Feature-Geometry-GeometryLiteral structure, with option to remove Geo predicates:1Model geosparqlModel = GeoSPARQLOperations.convertGeoPredicates(dataModel, true); Convert Geometry Literals to the WGS84 spatial reference system and WKT datatype:12Model model = GeoSPARQLOperations.convert(geosparqlModel, "http://www.opengis.net/def/crs/EPSG/0/4326", "http://www.opengis.net/ont/geosparql#wktLiteral"); Create Spatial Index for a Model within a Dataset for spatial querying:1Dataset dataset = SpatialIndex.wrapModel(model); Spatial IndexA Spatial Index is required for the jena-spatial property functions and is optional for the GeoSPARQL spatial relations. Only a single SRS can be used for a Spatial Index and it is recommended that datasets are converted to a single SRS, see GeoSPARQLOperations. The jena-spatial module contains several SPARQL functions for querying datasets using the WGS84 Geo predicates for latitude (http://www.w3.org/2003/01/geo/wgs84_pos#lat) and longitude (http://www.w3.org/2003/01/geo/wgs84_pos#long). Geo predicates can be converted to Geometry Literals in query and then used with the GeoSPARQL filter functions.123456?subj wgs:lat ?lat .?subj wgs:long ?lon .BIND(spatialF:convertLatLon(?lat, ?lon) as ?point) .#Coordinate order is Lon/Lat without stated SRS URI.BIND("POLYGON((...))"^^<http://www.opengis.net/ont/geosparql#wktLiteral> AS ?box) .FILTER(geof:sfContains(?box, ?point)) Alternatively, utilizing more shapes, relations and spatial reference systems can be achieved by converting the dataset to the GeoSPARQL structure.12345?subj geo:hasGeometry ?geom .?geom geo:hasSerialization ?geomLit .#Coordinate order is Lon/Lat without stated SRS URI.BIND("POLYGON((...))"^^<http://www.opengis.net/ont/geosparql#wktLiteral> AS ?box) .FILTER(geof:sfContains(?box, ?geomLit)) GeoSPARQL FusekiIt uses the embedded server Fuseki and provides additional parameters for dataset loading. Currently, there is no GUI interface as provided in the Fuseki distribution. Command Line java -jar jena-fuseki-geosparql-VER.jar ARGS Load RDF file into memory, write spatial index to file and run server: geosparql-fuseki -rf “test.rdf” -si “spatial.index” Load RDF file into persistent TDB and run server: geosparql-fuseki -rf “test.rdf” -t “TestTDB” Load from persistent TDB and run server: geosparql-fuseki -t “TestTDB” Load from persistent TDB, change port and run server: geosparql-fuseki -t “TestTDB” -p 3030 Usage123456String service = "http://localhost:3030/ds";String query = ....;try (QueryExecution qe = QueryExecutionFactory.sparqlService(service, query)) { ResultSet rs = qe.execSelect(); ResultSetFormatter.outputAsTSV(rs);} Without GeoSPARQLGeoSPARQL需要将数据集转化为专用的wktLiteral,实际在处理上并不算很flexible的。例如,在YAGO3中,经纬度是通过两个predicate hasLatitude和hasLongitude来关联到yago:Degrees这个类型的literal的,转换起来并不轻松。 并且,通过在YAGO的完整数据集上做验证,我们发现查询spatial信息并十分耗时。当然,考虑到YAGO的数据量,java heap还是要留够的。 接下来,我们将查询YAGO相关数据的任务描述和具体实现过程描述如下。 问题描述 从YAGO中获取在特定地理范围内的实体,以及所有和这些实体相关的一度属性。 为了完成这个任务,我们首先需要根据经纬度信息选择一定的seed entities,并对每个seed entities进行扩展,拿到它们中每个实体相关的statement。 YAGO Endpoint搭建这一步是通过Jena和Jena Fuseki来持续化读入ttl数据来完成的,搭建好引擎后,即可对数据进行查询。 如果不想自己搭建Endpoint,目前可以访问:https://linkeddata1.calcul.u-psud.fr/sparql。它是巴黎-萨克雷大学(université paris saclay)维护的Virtuoso SPARQL Query Endpoint。但是这个服务经常不稳定。 如果自己搭建YAGO的Endpoint的话,首先需要到其网站下载相关的dataset,https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/。 YAGO官方将数据分成了不同的PART,按照需要我们可以只选择部分数据进行load。例如,在本例中,我们选择了TAXONOMY, SIMPLETAX, CORE, GEONAMES, LINK, 和一起和英文相关的WIKIPEDIA内容。 完成这步后,我们需要对数据集进行预处理,因为部分ttl文件存在编码的问题。具体使用sed命令,如下 1234sed -i 's/|/-/g' ./* && sed -i 's/\\\\/-/g' ./* && sed -i 's/–/-/g' ./*sed -i 's/\/n//g' ./*sed -i 's/\\n//g' ./*sed -i 's/\\/-/g' ./* 完成一些插入替换后,我们就可以利用jena来将数据持久化导入到TDB中,命令如下 1tdbloader --loc=path\to\tdb_files *.ttl 注意,这一步数据集太大,容易出现报错。好在tdbloader这个命令本身是incremental updating的,所以还是可以分批导入。记得将tdb_files的数据进行备份,以免出现写入破坏。 这一步完成后,可以在本机或者服务器上启动fuseki服务来验证,如下 1java -Xms1024M -Xmx14g -jar fuseki-server.jar --update --loc path\to\tdb_files //myGraph 上述命令中,我们加大了内存配置,这是因为如果启动查询服务,是非常吃内存的。 同时,我们也可以直接用一个tdbquery来验证数据是否正确 1tdbquery --loc path\to\tdb_files --query q1.rq 在q1.rq中我们写入一个简单的SPARQL语句 1SELECT ?S WHERE {?s ?p ?o.} LIMIT 10 在命令行中,我们就能看到tdbquery返回的序列化的结果。注意tdbquery和tdbloader都是在jena的bin文件夹下。注意tdb和tdb2是不同的,目前没有测试过tdb2,记得两者的读写不可混合使用,但是tdb的数据可以转换为tdb2。 如果你的服务器打开了特定的端口,比如默认的3030,则可以在web端打开页面并且执行query,地址为localhost:3030。 到这里,我们的endpoint搭建就完成了。 其他的内容,诸如web端的数据集管理和查询,就可以自己简单摸索了。 编写Jena程序我们构建一个简单的maven项目,并且加入相关依赖,如下: 12345678<dependencies> <dependency> <groupId>org.apache.jena</groupId> <artifactId>apache-jena-libs</artifactId> <type>pom</type> <version>3.14.0</version> </dependency> </dependencies> 随后,我们编写如下代码 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394public class CreateSubgraph { private static void myQuery() throws Exception {// for windows// String dictionaryPath = "C:\\Data";// String directory = dictionaryPath + "\\yago_graph_data";// String output = dictionaryPath + "\\results\\test\\"; String directory = "/home/ubuntu/yago3/tdb"; String output = "/home/ubuntu/sake_preprocessing/output/london/"; File outputDir = new File(output); if (!outputDir.exists()) { outputDir.mkdirs(); } //Open TDB Dataset Dataset dataset = TDBFactory.createDataset(directory); System.out.println("dataset empty:" + dataset.isEmpty()); //Retrieve Named Graph from Dataset, or use Default Graph. Model model = dataset.getDefaultModel(); System.out.println("model empty:" + model.isEmpty()); long count = 0; String queryString = "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n" + "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> \n" + "PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>\n" + "PREFIX spatial: <http://jena.apache.org/spatial#>\n" + "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n" + "PREFIX yago: <http://yago-knowledge.org/resource/>\n" + "\n" + "SELECT ?object ?lat ?long \n" + "WHERE {\n" +// "?object rdf:type yago:yagoGeoEntity.\n" + "?object yago:hasLatitude ?lat.\n" + "?object yago:hasLongitude ?long.\n" + // convert the lat and lon directly from yago:degrees to xsd:double "FILTER((xsd:double(?lat)>=51.286685) && (xsd:double(?long)>=-0.510451) && (xsd:double(?lat)<=51.691515) && (xsd:double(?long)<=0.333228))\n" + //Greater London "}"; System.out.println(queryString); Query query = QueryFactory.create(queryString); QueryExecution qExec = QueryExecutionFactory.create(query, model); System.out.println("Start execute query"); ResultSet results = qExec.execSelect(); List<QuerySolution> solutions = ResultSetFormatter.toList(results); for(int i = 0; i < solutions.size(); i++){ // get the entity we want RDFNode objectNode= solutions.get(i).get("?object"); String uri = objectNode.asResource().getURI(); // give it a name, remove the prefix of yago resource String localName = uri.substring(35).replace("/", "u0001"); if("".equals(localName) || localName == null){ count++; localName = "BlankNode_" + count; } // for each entity, generate a ttl to store its relevant statements File newFile = new File(output + localName + ".ttl"); FileWriter fw = new FileWriter(newFile); BufferedWriter bw = new BufferedWriter(fw); // as head entity StmtIterator iter = model.listStatements(objectNode.asResource(), null, (RDFNode) null); while(iter.hasNext()){ try{ bw.write(iter.nextStatement().asTriple().toString() + "\n"); }catch (IOException e){ e.printStackTrace(); System.out.println(e.toString()); } } // as tail entity iter = model.listStatements(null, null, objectNode); while(iter.hasNext()){ try{ bw.write(iter.nextStatement().asTriple().toString() + "\n"); }catch (IOException e){ e.printStackTrace(); System.out.println(e.toString()); } } bw.close(); fw.close(); } System.out.println("Finish writing to file"); System.out.println("Final count:" + solutions.size()); dataset.close(); } public static void main(String[] args) throws Exception { System.out.println("Start application"); myQuery(); System.out.println("Finish application!"); }} 好的,大功告成。这里我们查询了在The Greater London的地理范围内(用了一个MBR逼近)的entities及其相关的信息,我们随便找一个实体来看看效果,就选择Big Ben大本钟吧: 123456789http://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/wasCreatedOnDate "1870-##-##"^^http://www.w3.org/2001/XMLSchema#datehttp://yago-knowledge.org/resource/Big_Ben @owl:sameAs http://dbpedia.org/resource/Big_Benhttp://yago-knowledge.org/resource/Big_Ben @owl:sameAs http://sws.geonames.org/6618994http://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/isLocatedIn http://yago-knowledge.org/resource/Londonhttp://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/isLocatedIn http://yago-knowledge.org/resource/Englandhttp://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/isLocatedIn http://yago-knowledge.org/resource/City_of_Westminsterhttp://yago-knowledge.org/resource/Big_Ben @rdfs:label "Big Ben"@eng...http://yago-knowledge.org/resource/Monumental_Challenge @http://yago-knowledge.org/resource/linksTo http://yago-knowledge.org/resource/Big_Ben 查询速度也是相当快的。因此,在既没有用到geospatial的相关的函数,也没有用到spatial index来匹配经纬度的情况下,我们建议就使用最为简单的经纬度的literal matching即可完成所需的任务。 扩展阅读1.https://en.wikipedia.org/wiki/DE-9IM. ↩2.http://www.lirmm.fr/rod/slidesRoD04102018/RoD2018-tutorial.pdf. ↩ ← Previous Post Next Post→ Table of Contents GeoSPARQLGeoSPARQL六大FeatureSRS/CRSDE-9IMJena GeoSPARQLUsageAPIsDependenciesDataset ConversionSpatial IndexGeoSPARQL FusekiCommand LineUsageWithout GeoSPARQL问题描述YAGO Endpoint搭建编写Jena程序扩展阅读