The hitchhiker's guide to RDF2vec.
RDF2vec is a tool for creating vector representations of RDF graphs. In essence, RDF2vec creates a numeric vector for each node in an RDF graph.
RDF2vec was developed by Petar Ristoski as a key contribution of his PhD thesis Exploiting Semantic Web Knowledge Graphs in Data Mining [Ristoski, 2019], which he defended in January 2018 at the Data and Web Science Group at the University of Mannheim, supervised by Heiko Paulheim. In 2019, he was awarded the SWSA Distinguished Dissertation Award for this outstanding contribution to the field.
RDF2vec was inspired by the word2vec approach [Mikolov et al., 2013] for representing words in a numeric vector space. word2vec takes as input a set of sentences, and trains a neural network using one of the two following variants: predict a word given its context words (continuous bag of words, or CBOW), or to predict the context words given a word (skip gram, or SG):
This approach can be applied to RDF graphs as well. In the original version presented at ISWC 2016 [Ristoski and Paulheim, 2016], random walks on the RDF graph are used to create sequences of RDF nodes, which are then used as input for the word2vec algorithm:
It has been shown that such a representation can be utilized in many application scenarios, such as using knowledge graphs as background knowledge in data mining tasks, or for building content-based recommender systems [Ristoski et al., 2019].
Consider the following example graph:
From this graph, a set of random walks that could be extracted may look as follows:
Hamburg -> country -> Germany -> leader -> Angela_Merkel Germany -> leader -> Angela_Merkel -> birthPlace -> Hamburg Hamburg -> leader -> Peter_Tschentscher -> residence -> Hamburg
For those random walks, we consider each element (i.e., an entity or a predicate) as a word when running word2vec. As a result, we obtain vectors for eall entities (and all predicates) in the graph.
The resulting vectors have similar properties as word2vec embeddings. In particular, similar entities are closer in the vector space than dissimilar ones (see [Hubert et al., 2024]), which makes those representations ideal for learning patterns about those entities. Typically, cosine similarity is used as a similarity metric to identify related entities, however, some works report better results when using Euclidean distance or Laplacian kernel. [Bakshizadeh et al., 2024]
In the example below, showing embeddings for DBpedia and Wikidata, countries and cities are grouped together, and European and Asian cities and countries form clusters:
The two figures above indicate that classes (in the example: countries and cities) can be separated well in the projected vector space, indicated by the dashed lines. [Zouaq and Martel, 2020] have compared the suitability for separating classes in a knowledge graph for different knowledge graph embedding methods. They have shown that RDF2vec is outperforming other embedding methods like TransE, TransH, TransD, ComplEx, and DistMult, in particular on smaller classes. On the task of entity classification, RDF2vec shows results which are competitive with more recent graph convolutional neural networks [Schlichtkrull et al., 2018]. [Meghraoui et al., 2024] have proposed the idea to measure class separability of embedding methods as a means to evaluate the modeling quality of an ontology.
RDF2vec has been tailored to RDF graphs by respecting the type of edges (i.e., the predicates). Related variants, like node2vec [Grover and Leskovec, 2016] or DeepWalk [Perozzi et al., 2014], are defined for graphs with just one type of edges. They create sequences of nodes, while RDF creates alternating sequences of entities and predicates.
This video by Petar Ristoski introduces the main ideas of RDF2vec:
A lot of approaches have been proposed for link prediction in knowledge graphs, from classic approaches like TransE [Bordes et al., 2013] and RESCAL [Nickel et al. ,2011] to countless variants. The key difference is that those approaches are trained to optimize a loss function on link prediction, which yields a projection of similar entities closely together in the vector space as a by product. On the other hand, the capability to predict links is a by product in RDF2vec, in particular in variants like order-aware RDF2vec. A detailed comparison of the commonalities and differences of those families of approaches can be found in [Portisch et al., 2022]
There are a few different implementations of RDF2vec out there:
Training RDF2vec from scratch can take quite a bit of time. Here is a list of pre-trained models we know:
There is also an alternative for downloading and processing an entire knowledge graph embedding (which may consume several GB):
There are quite a few variants of RDF2vec which have been examined in the past.
Like all knowledge graph embedding methods, RDF2vec projects entities into a continuous vector space.
However, recent works have shown that this precision might actually not be needed. Using binary vectors instead of floating point ones can also yield competitive results, while requiring just a fraction of the storage capacity and processing memory. [Faria de Souza and Paulheim, 2024]
Natively, RDF2vec does not incorporate literals. However, they can be incorporated with a few tricks:
RDF2vec supports Knowledge Graph Updates.
Most knowledge graph embedding methods do not support knowledge graph updates and require re-training a model from scratch when a knowledge graph changes. Since word2vec provides a mechanism for updating word vectors and also learning vectors for new words, RDF2vec is capable of adapting its vectors upon updates in the knowledge graph without a full retraining. The adaptation can be performed in a fraction of the time a full retraining would take. [Hahn and Paulheim, 2024]
Initially, word2vec was created for natural language, which shows a bit of variety with respect to word ordering. In contrast, walks extracted from graphs are different.
Consider, for example, the case of creating embedding vectors for bread in sentences such as Tom ate bread yesterday morning and Yesterday morning, Tom ate bread. For walks extracted from graphs, however, it makes a difference whether a predicate appears before or after the entity at hand. Consider the example above, where all three entities in the middle (Angela_Merkel, Peter_Tschentscher, and Germany) share the same context items (i.e., Hamburg and leader). However, for the semantics of an entity, it makes a difference whether that entity is or has a leader.RDF2vec always generates embedding vectors for an entire knowledge graph. In many practical cases, however, we only need vectors for a small set of target entities. In such cases, generating vectors for an entire large graph like DBpedia would not be a practical solution.
RDF2vec can explicitly trade off similarity and relatedness.
One of the key findings of the comparison of RDF2vec to embedding approaches for link prediction, such as TransE, is that while embedding approaches for link prediction create an embedding space in which the distance metric encodes similarity of entities, the distance metric in the RDF2vec embedding space mixes similarity and relatedness [Portisch et al., 2022]. This behavior can be influenced by changing the walk strategy, thereby creating embedding spaces which explicitly emphasize similarity or relatedness. The corresponding walk strategies are called p-walks and e-walks. In the above example, a set of p-walks would be:
birthPlace -> country -> Germany -> leader -> birthPlace country -> leader -> Angela_Merkel -> birthPlace -> leader residence -> leader -> Peter_Tschentscher -> residence -> leader
Likewise, the set of e-walks would be
Peter_Tschentscher -> Hamburg -> Germany -> Angela_Merkel -> Hamburg Hamburg -> Germany -> Angela_Merkel -> Hamburg -> Peter_Tschentscher Angela_Merkel -> Hamburg -> Peter_Tschentscher -> Hamburg -> Germany
It has been shown that embedding vectors computed based on e-walks create a vector space encoding relatedness, while embedding vectors computed based on p-walks create a vector space encoding similarity. [Portisch and Paulheim, 2022]
Besides e-walks and p-walks, the creation of walks is the aspect of RDF2vec which has been undergone the most extensive research so far. While the original implementation uses random walks, alternatives have been explored include:
Besides changing the walk creation itself, there are also approaches for incorporating additional information in the walks:
RDF2vec relies on the word2vec embedding mechanism. However, other word embedding approaches have also been discussed.
While the original RDF2vec approach is agnostic to the type of knowledge encoded in RDF, it is also possible to extend the approach to specific types of datasets.
To materialize or not to materialize? While it might look like a good idea to enrich the knowledge graph with implicit knowledge before training the embeddings, experimental results show that materializing implicit knowledge actually makes the resulting embedding worse, not better.
RDF2vec can only learn that two entities is similar based on signals that can co-appear in a graph walk. For that reason, it is, for example, impossible to learn that two entities are similar because they have an ingoing edge from an entity of the same type (see also the results on the DLCC node classification benchmark [Portisch and Paulheim, 2022]). Looking at the following triples:
:Germany rdf:type :EuropeanCountry . :Germany :capital :Berlin . :France rdf:type :EuropeanCountry . :France :capital :Paris . :Thailand rdf:type :AsianCountry . :Thailand :capital :Bangkok .In this example, it is impossible for RDF2vec to learn that Berlin is more similar to Paris than to Bangkok, since the entities EuropeanCountry and AsianCountry never co-occur in any walk with the city entities. Therefore, injection structural information into RDF2vec may improve the results.
Knowledge Graphs usually do not contain negative statements. However, in cases where negative statements are present, there are different ways of handling them in the embedding creation.
Knowledge Graph Embedding methods are usually black box models, i.e., all predictions they make are not inherently explainable.
Other useful resources for working with RDF2vec:
RDF2vec has been used in a variety of applications. In the following, we list a number of those, organized by different fields of applications.
These are the core publications of RDF2vec:
Further references used above:
The original development of RDF2vec was funded in the project Mine@LOD by the Deutsche Forschungsgemeinschaft (DFG) under grant number PA 2373/1-1 from 2013 to 2018. Additional developments and extensive experiments have been performed by Jan Portisch, funded by SAP SE.
If you are aware of any implementations, extensions, pre-trained models, or applications of RDF2vec not listed on this Web page, please get in touch with Heiko Paulheim.