图谱实践笔记3 - GeoSPARQL

Author: Steven Date: Mar 20, 2019 Updated On: May 5, 2022
Categories: KG
3.3k words in total, 17 minutes required.

图谱实践笔记第三篇:Jena + GeoSPARQL

GeoSPARQL

GeoSPARQL: A Geographic Query Language for RDF Data; It supports representing and querying geospatial data on the Semantic Web。参见项目主页

  • Representing geospatial information is done using high level ontologies inspired from GIS terminology
  • Geometries are represented using literals of spatial datatypes
  • Literals are serialized using OGC standards WKT and GML
  • Families of functions are offered for querying geometries

它是Open Geospatial Consortium (OGC) 的规范,相关的namespace如下:

  1. ogc:http://www.opengis.net/
  2. geo:http://www.opengis.net/ont/geosparql#
  3. geof:http://www.opengis.net/def/function/geosparql/
  4. geor:http://www.opengis.net/def/rule/geosparql/
  5. sf:http://www.opengis.net/ont/sf#
  6. gml:http://www.opengis.net/ont/gml#

一个GeoSPARQL的demo网站: http://www.geosparql.org/。看起来应该是目前OGC主页的前辈。

直接用经纬度来执行

1
2
3
4
5
6
7
8
9
10
11
12
13
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>  

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX spatial: <http://jena.apache.org/spatial#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?object
WHERE {
?object geo:lat ?lat.
?object geo:long ?long.
FILTER((xsd:double(?lat)>=40.73) && (xsd:double(?long)>=-74) && (xsd:double(?lat)<=41) && (xsd:double(?long)<=-73.98))
} LIMIT 20

用spatial:nearby来执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PREFIX co: <http://www.geonames.org/countries/#>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX spatial: <http://jena.apache.org/spatial#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
PREFIX loticoowl:<http://www.lotico.com/ontology/>


SELECT ?object
WHERE {
?object spatial:nearby(40.74 -73.989 1 'mi').
?object rdfs:label ?label
} LIMIT 10

GeoSPARQL六大Feature

GeoSPARQL Class DependencyGeoSPARQL Class Dependency

  • Core
    • defines two top level classes that can be used to organize geospatial data, i.e., geo:SpatialObject and geo:Feature;
  • Topology Vocabulary
    • the extension is used for representing topological information about features.
    • topological information can be derived from geometric information or it might be captured by asserting explicitly the topological relations between features.
    • include: Simple Features Relation Family (geo:sf***); Egenhofer Relation Family (geo:eh***); RCC8 Relation Family (geo:rcc8e***).
    • 三个family的关系的等同性如下图(不同拓扑关系家族的关系等同性)所示.
  • Geometry Extension
    • defines a vocabulary for asserting and querying information about geometry data, and it defines query functions for operating on geometry data.
    • geo:hasGeometry (link a feature with a geometry that represents its spatial extent)
    • geo:dimension (topological dimension of this geometric object, which must be less than or equal to the coordinate dimension)
    • geo:wktLiteral (consist of an optional URI identifying the coordinate reference system followed by Simple Features Well Known Text (WKT) describing a geometric value) 如下图(wktLiteral的示例)所示.
    • 函数 geof:intersection geof:convexHull; geof:distance; geof:envelope(returns the minimum bounding box)
  • Geometry Topology
    • defines a collection of topological query functions that operate on geometry literals
    • 同样包含三个relation family
    • geof:ehEquals(geom1: ogc:geomLiteral, geom2: ogc:geomLiteral): xsd:boolean
    • geof:sfWithin(geom1: ogc:geomLiteral, geom2: ogc:geomLiteral): xsd:boolean
    • geof:sfOverlaps(geom1: ogc:geomLiteral, geom2: ogc:geomLiteral): xsd:boolean
  • RDFS Entailment Extension
    • matching implicitly derived RDF triples in GeoSPARQL queries.
  • Query Rewrite Extension
    • defines a set of RIF rules that use topological extension functions defined in Geometry Topology to establish the existence of direct topological predicates defined in Topology Vocabulary.

不同拓扑关系家族的关系等同性不同拓扑关系家族的关系等同性

wktLiteral的示例wktLiteral的示例

最后,推荐一个非常好的教程[2]

SRS/CRS

A spatial reference system (SRS) or coordinate reference system (CRS) is a coordinate-based local, regional or global system used to locate geographical entities.

Spatial reference systems can be referred to using a SRID integer, including EPSG codes defined by the International Association of Oil and Gas Producers.

SRID是用于区分不同的坐标系系统的,例如UTM, Zone 17N, NAD27 — SRID 2029; WGS84 — SRID 4326。

The GeoSPARQL standard specifies that WKT Geometry Literals without an SRS URI are defaulted to CRS84 (WGS84) http://www.opengis.net/def/crs/OGC/1.3/CRS84.

DE-9IM

The Dimensionally Extended nine-Intersection Model (DE-9IM) is a topological model and a standard used to describe the spatial relations of two regions[1]

DE-9IMDE-9IM

为了表示spatial predicate,可以将dim的定义转换为TRUE和FALSE,用于描述两个Geometry的关系。

Jena GeoSPARQL

Apache Jena实现了对于GeoSPARQL的支持,其主页:https://jena.apache.org/documentation/geosparql/index.html。对其留个重要Feature进行了实现,Simple Feature, Egenhofer and RCC8三大家族都支持。

Usage

1
2
3
4
5
6
7
8
9
10
// The indexes can be configured by size, retention duration and frequency of clean up
GeoSPARQLConfig.setupMemoryIndex();

Model model = .....;
String query = ....;

try (QueryExecution qe = QueryExecutionFactory.create(query, model)) {
ResultSet rs = qe.execSelect();
ResultSetFormatter.outputAsTSV(rs);
}

APIs

The main class to handle geometries and their spatial relations is the GeometryWrapper. The GeometryWrapperFactory can be used to directly construct a GeometryWrapper.

  • Parse a Geometry Literal:

    1
    GeometryWrapper geometryWrapper = WKTDatatype.INSTANCE.parse("POINT(1 1)");
  • Extract from a Jena Literal:

    1
    GeometryWrapper geometryWrapper = GeometryWrapper.extract(geometryLiteral);
  • Create from a JTS (Java Topology Suite) Geometry:

    1
    GeometryWrapper geometryWrapper = GeometryWrapperFactory.createGeometry(geometry, srsURI, geometryDatatypeURI);
  • Create from a JTS Point Geometry:

    1
    GeometryWrapper geometryWrapper = GeometryWrapperFactory.createPoint(coordinate, srsURI, geometryDatatypeURI);
  • Convert CRS/SRS:

    1
    GeometryWrapper otherGeometryWrapper = geometryWrapper.convertCRS("http://www.opengis.net/def/crs/EPSG/0/27700");
  • Spatial Relation:

    1
    boolean isCrossing = geometryWrapper.crosses(otherGeometryWrapper);
  • DE-9IM Intersection Pattern:

    1
    boolean isRelated = geometryWrapper.relate(otherGeometryWrapper, "TFFFTFFFT");
  • Geometry Property:

    1
    boolean isEmpty = geometryWrapper.isEmpty();

Dependencies

Apache SIS/SIS_DATA Environment Variable: SIS provides data structures for geographic features and associated meta-data along with methods to manipulate those data structures.

Java Topology Suite: a Java library for creating and manipulating vector geometry.

Dataset Conversion

Methods to convert datasets between serialisations and spatial/coordinate reference systems are available in: org.apache.jena.geosparql.configuration.GeoSPARQLOperations:

  • Load a Jena Model from file:
    1
    Model dataModel = RDFDataMgr.loadModel("data.ttl");
  • Convert Feature-GeometryLiteral to the GeoSPARQL Feature-Geometry-GeometryLiteral structure:
    1
    Model geosparqlModel = GeoSPARQLOperations.convertGeometryStructure(dataModel);
  • Convert Feature-Lat, Feature-Lon Geo predicates to the GeoSPARQL Feature-Geometry-GeometryLiteral structure, with option to remove Geo predicates:
    1
    Model geosparqlModel = GeoSPARQLOperations.convertGeoPredicates(dataModel, true);
  • Convert Geometry Literals to the WGS84 spatial reference system and WKT datatype:
    1
    2
    Model model = GeoSPARQLOperations.convert(geosparqlModel,
    "http://www.opengis.net/def/crs/EPSG/0/4326", "http://www.opengis.net/ont/geosparql#wktLiteral");
  • Create Spatial Index for a Model within a Dataset for spatial querying:
    1
    Dataset dataset = SpatialIndex.wrapModel(model);

Spatial Index

A Spatial Index is required for the jena-spatial property functions and is optional for the GeoSPARQL spatial relations. Only a single SRS can be used for a Spatial Index and it is recommended that datasets are converted to a single SRS, see GeoSPARQLOperations.

The jena-spatial module contains several SPARQL functions for querying datasets using the WGS84 Geo predicates for latitude (http://www.w3.org/2003/01/geo/wgs84_pos#lat) and longitude (http://www.w3.org/2003/01/geo/wgs84_pos#long).

Geo predicates can be converted to Geometry Literals in query and then used with the GeoSPARQL filter functions.

1
2
3
4
5
6
?subj wgs:lat ?lat .
?subj wgs:long ?lon .
BIND(spatialF:convertLatLon(?lat, ?lon) as ?point) .
#Coordinate order is Lon/Lat without stated SRS URI.
BIND("POLYGON((...))"^^<http://www.opengis.net/ont/geosparql#wktLiteral> AS ?box) .
FILTER(geof:sfContains(?box, ?point))

Alternatively, utilizing more shapes, relations and spatial reference systems can be achieved by converting the dataset to the GeoSPARQL structure.

1
2
3
4
5
?subj geo:hasGeometry ?geom .
?geom geo:hasSerialization ?geomLit .
#Coordinate order is Lon/Lat without stated SRS URI.
BIND("POLYGON((...))"^^<http://www.opengis.net/ont/geosparql#wktLiteral> AS ?box) .
FILTER(geof:sfContains(?box, ?geomLit))

GeoSPARQL Fuseki

It uses the embedded server Fuseki and provides additional parameters for dataset loading. Currently, there is no GUI interface as provided in the Fuseki distribution.

Command Line

java -jar jena-fuseki-geosparql-VER.jar ARGS

Load RDF file into memory, write spatial index to file and run server:

geosparql-fuseki -rf “test.rdf” -si “spatial.index”

Load RDF file into persistent TDB and run server:

geosparql-fuseki -rf “test.rdf” -t “TestTDB”

Load from persistent TDB and run server:

geosparql-fuseki -t “TestTDB”

Load from persistent TDB, change port and run server:

geosparql-fuseki -t “TestTDB” -p 3030

Usage

1
2
3
4
5
6
String service = "http://localhost:3030/ds";
String query = ....;
try (QueryExecution qe = QueryExecutionFactory.sparqlService(service, query)) {
ResultSet rs = qe.execSelect();
ResultSetFormatter.outputAsTSV(rs);
}

Without GeoSPARQL

GeoSPARQL需要将数据集转化为专用的wktLiteral,实际在处理上并不算很flexible的。例如,在YAGO3中,经纬度是通过两个predicate hasLatitudehasLongitude来关联到yago:Degrees这个类型的literal的,转换起来并不轻松。

并且,通过在YAGO的完整数据集上做验证,我们发现查询spatial信息并十分耗时。当然,考虑到YAGO的数据量,java heap还是要留够的。

接下来,我们将查询YAGO相关数据的任务描述和具体实现过程描述如下。

问题描述

从YAGO中获取在特定地理范围内的实体,以及所有和这些实体相关的一度属性。

为了完成这个任务,我们首先需要根据经纬度信息选择一定的seed entities,并对每个seed entities进行扩展,拿到它们中每个实体相关的statement。

YAGO Endpoint搭建

这一步是通过Jena和Jena Fuseki来持续化读入ttl数据来完成的,搭建好引擎后,即可对数据进行查询。

如果不想自己搭建Endpoint,目前可以访问:https://linkeddata1.calcul.u-psud.fr/sparql。它是巴黎-萨克雷大学(université paris saclay)维护的Virtuoso SPARQL Query Endpoint。但是这个服务经常不稳定。

如果自己搭建YAGO的Endpoint的话,首先需要到其网站下载相关的dataset,https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/

YAGO官方将数据分成了不同的PART,按照需要我们可以只选择部分数据进行load。例如,在本例中,我们选择了TAXONOMY, SIMPLETAX, CORE, GEONAMES, LINK, 和一起和英文相关的WIKIPEDIA内容。

完成这步后,我们需要对数据集进行预处理,因为部分ttl文件存在编码的问题。具体使用sed命令,如下

1
2
3
4
sed -i 's/|/-/g' ./* && sed -i 's/\\\\/-/g' ./* && sed -i 's/–/-/g' ./*
sed -i 's/\/n//g' ./*
sed -i 's/\\n//g' ./*
sed -i 's/\\/-/g' ./*

完成一些插入替换后,我们就可以利用jena来将数据持久化导入到TDB中,命令如下

1
tdbloader --loc=path\to\tdb_files *.ttl

注意,这一步数据集太大,容易出现报错。好在tdbloader这个命令本身是incremental updating的,所以还是可以分批导入。记得将tdb_files的数据进行备份,以免出现写入破坏。

这一步完成后,可以在本机或者服务器上启动fuseki服务来验证,如下

1
java -Xms1024M -Xmx14g -jar fuseki-server.jar --update --loc path\to\tdb_files //myGraph

上述命令中,我们加大了内存配置,这是因为如果启动查询服务,是非常吃内存的。

同时,我们也可以直接用一个tdbquery来验证数据是否正确

1
tdbquery --loc path\to\tdb_files --query q1.rq

在q1.rq中我们写入一个简单的SPARQL语句

1
SELECT ?S WHERE {?s ?p ?o.} LIMIT 10

在命令行中,我们就能看到tdbquery返回的序列化的结果。注意tdbquery和tdbloader都是在jena的bin文件夹下。注意tdb和tdb2是不同的,目前没有测试过tdb2,记得两者的读写不可混合使用,但是tdb的数据可以转换为tdb2。

如果你的服务器打开了特定的端口,比如默认的3030,则可以在web端打开页面并且执行query,地址为localhost:3030。

到这里,我们的endpoint搭建就完成了。

其他的内容,诸如web端的数据集管理和查询,就可以自己简单摸索了。

编写Jena程序

我们构建一个简单的maven项目,并且加入相关依赖,如下:

1
2
3
4
5
6
7
8
<dependencies>
<dependency>
<groupId>org.apache.jena</groupId>
<artifactId>apache-jena-libs</artifactId>
<type>pom</type>
<version>3.14.0</version>
</dependency>
</dependencies>

随后,我们编写如下代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
public class CreateSubgraph {

private static void myQuery() throws Exception {
// for windows
// String dictionaryPath = "C:\\Data";
// String directory = dictionaryPath + "\\yago_graph_data";
// String output = dictionaryPath + "\\results\\test\\";
String directory = "/home/ubuntu/yago3/tdb";
String output = "/home/ubuntu/sake_preprocessing/output/london/";
File outputDir = new File(output);
if (!outputDir.exists()) {
outputDir.mkdirs();
}
//Open TDB Dataset
Dataset dataset = TDBFactory.createDataset(directory);
System.out.println("dataset empty:" + dataset.isEmpty());
//Retrieve Named Graph from Dataset, or use Default Graph.
Model model = dataset.getDefaultModel();
System.out.println("model empty:" + model.isEmpty());

long count = 0;
String queryString = "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n" +
"PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> \n" +
"PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>\n" +
"PREFIX spatial: <http://jena.apache.org/spatial#>\n" +
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n" +
"PREFIX yago: <http://yago-knowledge.org/resource/>\n" +
"\n" +
"SELECT ?object ?lat ?long \n" +
"WHERE {\n" +
// "?object rdf:type yago:yagoGeoEntity.\n" +
"?object yago:hasLatitude ?lat.\n" +
"?object yago:hasLongitude ?long.\n" +
// convert the lat and lon directly from yago:degrees to xsd:double
"FILTER((xsd:double(?lat)>=51.286685) && (xsd:double(?long)>=-0.510451) && (xsd:double(?lat)<=51.691515) && (xsd:double(?long)<=0.333228))\n" + //Greater London
"}";
System.out.println(queryString);
Query query = QueryFactory.create(queryString);
QueryExecution qExec = QueryExecutionFactory.create(query, model);
System.out.println("Start execute query");
ResultSet results = qExec.execSelect();
List<QuerySolution> solutions = ResultSetFormatter.toList(results);
for(int i = 0; i < solutions.size(); i++){
// get the entity we want
RDFNode objectNode= solutions.get(i).get("?object");
String uri = objectNode.asResource().getURI();
// give it a name, remove the prefix of yago resource
String localName = uri.substring(35).replace("/", "u0001");
if("".equals(localName) || localName == null){
count++;
localName = "BlankNode_" + count;
}

// for each entity, generate a ttl to store its relevant statements
File newFile = new File(output + localName + ".ttl");
FileWriter fw = new FileWriter(newFile);
BufferedWriter bw = new BufferedWriter(fw);
// as head entity
StmtIterator iter = model.listStatements(objectNode.asResource(), null, (RDFNode) null);
while(iter.hasNext()){
try{
bw.write(iter.nextStatement().asTriple().toString() + "\n");
}catch (IOException e){
e.printStackTrace();
System.out.println(e.toString());
}

}
// as tail entity
iter = model.listStatements(null, null, objectNode);
while(iter.hasNext()){
try{
bw.write(iter.nextStatement().asTriple().toString() + "\n");
}catch (IOException e){
e.printStackTrace();
System.out.println(e.toString());
}
}
bw.close();
fw.close();
}
System.out.println("Finish writing to file");
System.out.println("Final count:" + solutions.size());
dataset.close();

}

public static void main(String[] args) throws Exception {
System.out.println("Start application");
myQuery();
System.out.println("Finish application!");
}

}

好的,大功告成。这里我们查询了在The Greater London的地理范围内(用了一个MBR逼近)的entities及其相关的信息,我们随便找一个实体来看看效果,就选择Big Ben大本钟吧:

1
2
3
4
5
6
7
8
9
http://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/wasCreatedOnDate "1870-##-##"^^http://www.w3.org/2001/XMLSchema#date
http://yago-knowledge.org/resource/Big_Ben @owl:sameAs http://dbpedia.org/resource/Big_Ben
http://yago-knowledge.org/resource/Big_Ben @owl:sameAs http://sws.geonames.org/6618994
http://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/isLocatedIn http://yago-knowledge.org/resource/London
http://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/isLocatedIn http://yago-knowledge.org/resource/England
http://yago-knowledge.org/resource/Big_Ben @http://yago-knowledge.org/resource/isLocatedIn http://yago-knowledge.org/resource/City_of_Westminster
http://yago-knowledge.org/resource/Big_Ben @rdfs:label "Big Ben"@eng
...
http://yago-knowledge.org/resource/Monumental_Challenge @http://yago-knowledge.org/resource/linksTo http://yago-knowledge.org/resource/Big_Ben

查询速度也是相当快的。因此,在既没有用到geospatial的相关的函数,也没有用到spatial index来匹配经纬度的情况下,我们建议就使用最为简单的经纬度的literal matching即可完成所需的任务。

扩展阅读