Data Indexing and Query Engine
Indexing
Traditionally, RDF graph databases are built with multiple column-oriented indices that allow access to triples by permutations of a subject (S), predicate (P), object (O) and graph (G) dubbed collation orders. E.g., with an GSP index, we can quickly scan, filter or access triples as long as we access it in order G-S-P-O. If we were looking for all quads with a specific O, we would need to do a full index scan. Systems have to compromise between adding more collation orders for query speed at the cost of higher storage and update costs.
Tentris breaks free of this compromise. Tentris relies on a novel data structure, namely the hypertie, for storing and indexing knowledge graphs. It is a monolithic data structure that provides the same access capabilities as indices for all collation orders and thus does not compromise on query speed. The impact on storage and update speed is mitigated by eliminating redundancies that indices for multiple collation orders would have. This way we bring together OLAP and OLTP capabilities in a single graph database instance.
A key characteristic of the hypertie is that it enables the use of worst-case optimal multi-way joins.
Query Engine
The query engine of Tentris evaluates queries using multi-way joins. The class of joins Tentris uses is also called worst-case optimal joins (WCOJs). It has been proven that WCOJs are optimal and especially asymptotically faster than traditional binary joins. One example of queries where they excel are triangle queries like:
SELECT * WHERE {
?r1 :p1 ?r2 .
?r1 :p1 ?r3 .
?r2 :p1 ?r3 .
}
If you review your SPARQL queries, you will find plenty of such (undirected) triangles or larger circular dependencies between triple patterns.
Unlike binary joins that carry out join operations on two query operands (e.g., tables or triple patterns) at a time, multi-way joins carry out join operations by evaluating a single variable a variable at a time. Multi-way joins allow results to be incrementally evaluated and do not require the materialization of intermediate results.
This has multiple practical advantages:
- (Asymptotically) faster queries. Asymptotic means massive gains. It often speeds up queries from days to minutes. Because in SPARQL structures like circular variable dependencies are very common and slow down traditional binary join systems dramatically.
- Query execution is RAM-efficient. In classical binary-join-systems, materializing intermediate join results often bloats the RAM demand. Multi-way joins do not require this. The memory requirement is bound by the actual result size. The server has to collect the final results after all to serve them to the client.
- Streaming allows results larger than RAM. Tentris allows you to consume SPARQL results as an HTTP stream. Under
certain constraints, e.g., no
ORDER BY
clauses, this allows executing queries that produce multi-TB results. Make sure your client is able to handle the result.
References
For information about our index and query engine, please refer to the following peer-reviewed publications.
- Bigerl et al. "Tentris – A Tensor-Based Triple Store" (International Semantic Web Conference 2020)
- Bigerl et al. "Hashing the Hypertrie: Space- and Time-Efficient Indexing for SPARQL in Tensors" (International Semantic Web Conference 2022)
- Karalis et al. "Efficient Evaluation of Conjunctive Regular Path Queries Using Multi-way Joins" (Extended Semantic Web Conference 2024)
- Karalis et al. "Evaluating Negation with Multi-way Joins Accelerates Class Expression Learning" (European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2024)