e
The hitchhiker's guide to RDF2vec.
RDF2vec is a tool for creating vector representations of RDF graphs. In essence, RDF2vec creates a numeric vector for each node in an RDF graph.
RDF2vec was developed by Petar Ristoski as a key contribution of his PhD thesis Exploiting Semantic Web Knowledge Graphs in Data Mining [Ristoski, 2019], which he defended in January 2018 at the Data and Web Science Group at the University of Mannheim, supervised by Heiko Paulheim. In 2019, he was awarded the SWSA Distinguished Dissertation Award for this outstanding contribution to the field.
RDF2vec was inspired by the word2vec approach [Mikolov et al., 2013] for representing words in a numeric vector space. word2vec takes as input a set of sentences, and trains a neural network using one of the two following variants: predict a word given its context words (continuous bag of words, or CBOW), or to predict the context words given a word (skip gram, or SG):
This approach can be applied to RDF graphs as well. In the original version presented at ISWC 2016 [Ristoski and Paulheim, 2016], random walks on the RDF graph are used to create sequences of RDF nodes, which are then used as input for the word2vec algorithm. It has been shown that such a representation can be utilized in many application scenarios, such as using knowledge graphs as background knowledge in data mining tasks, or for building content-based recommender systems [Ristoski et al., 2019].
The resulting vectors have similar properties as word2vec embeddings. In particular, similar entities are closer in the vector space than dissimilar ones, which makes those representations ideal for learning patterns about those entities. In the example below, showing embeddings for DBpedia and Wikidata, countries and cities are grouped together, and European and Asian cities and countries form clusters:
The two figures above indicate that classes (in the example: countries and cities) can be separated well in the projected vector space, indicated by the dashed lines. Zouaq and Martel have compared the suitability for separating classes in a knowledge graph for different knowledge graph embedding methods. They have shown that RDF2vec is outperforming other embedding methods like TransE, TransH, TransD, ComplEx, and DistMult, in particular on smaller classes.
RDF2vec has been tailored to RDF graphs by respecting the type of edges (i.e., the predicates). Related variants, like node2vec [Grover and Leskovec, 2016] or DeepWalk [Perozzi et al., 2014], are defined for graphs with just one type of edges. They create sequences of nodes, while RDF creates alternating sequences of entities and predicates.
There are a few different implementations of RDF2vec out there:
Training RDF2vec from scratch can take quite a bit of time. Here is a list of pre-trained models we know:
There is also an alternative for downloading and processing an entire knowledge graph embedding (which may consume several GB):
There are quite a few variants of RDF2vec which have been examined in the past.
RDF2vec always generates embedding vectors for an entire knowledge graph. In many practical cases, however, we only need vectors for a small set of target entities. In such cases, generating vectors for an entire large graph like DBpedia would not be a practical solution.
One area which has been undergone extensive research is the creation of walks for the RDF2vec algorithm. While the original implementation uses random walks, alternatives have been explored include:
RDF2vec relies on the word2vec embedding mechanism once the sequences are created. This is not the only choice:
While the original RDF2vec approach is agnostic to the type of knowledge encoded in RDF, it is also possible to extend the approach to specific types of datasets.
To materialize or not to materialize? While it might look like a good idea to enrich the knowledge graph with implicit knowledge before training the embeddings, experimental results show that materializing implicit knowledge actually makes the resulting embedding worse, not better.
Other useful resources for working with RDF2vec:
RDF2vec has been used, among others, in the following applications:
These are the core publications of RDF2vec:
Further references used above:
The original development of RDF2vec was funded in the project Mine@LOD by the Deutsche Forschungsgemeinschaft (DFG) under grant number PA 2373/1-1 from 2013 to 2018.
If you are aware of any implementations, extensions, pre-trained models, or applications of RDF2vec not listed on this Web page, please get in touch with Heiko Paulheim.