
The hitchhiker's guide to RDF2vec.

About RDF2vec

RDF2vec is a tool for creating vector representations of RDF graphs. In essence, RDF2vec creates a numeric vector for each node in an RDF graph.

RDF2vec was developed by Petar Ristoski as a key contribution of his PhD thesis Exploiting Semantic Web Knowledge Graphs in Data Mining [Ristoski, 2019], which he defended in January 2018 at the Data and Web Science Group at the University of Mannheim, supervised by Heiko Paulheim. In 2019, he was awarded the SWSA Distinguished Dissertation Award for this outstanding contribution to the field.

RDF2vec was inspired by the word2vec approach [Mikolov et al., 2013] for representing words in a numeric vector space. word2vec takes as input a set of sentences, and trains a neural network using one of the two following variants: predict a word given its context words (continuous bag of words, or CBOW), or to predict the context words given a word (skip gram, or SG):

This approach can be applied to RDF graphs as well. In the original version presented at ISWC 2016 [Ristoski and Paulheim, 2016], random walks on the RDF graph are used to create sequences of RDF nodes, which are then used as input for the word2vec algorithm. It has been shown that such a representation can be utilized in many application scenarios, such as using knowledge graphs as background knowledge in data mining tasks, or for building content-based recommender systems [Ristoski et al., 2019].

Consider the following example graph:

From this graph, a set of random walks that could be extracted may look as follows:

Hamburg -> country -> Germany            -> leader     -> Angela_Merkel
Germany -> leader  -> Angela_Merkel      -> birthPlace -> Hamburg
Hamburg -> leader  -> Peter_Tschentscher -> residence  -> Hamburg

For those random walks, we consider each element (i.e., an entity or a predicate) as a word when running word2vec. As a result, we obtain vectors for eall entities (and all predicates) in the graph.

The resulting vectors have similar properties as word2vec embeddings. In particular, similar entities are closer in the vector space than dissimilar ones (see [Hubert et al., 2024]), which makes those representations ideal for learning patterns about those entities. In the example below, showing embeddings for DBpedia and Wikidata, countries and cities are grouped together, and European and Asian cities and countries form clusters:

The two figures above indicate that classes (in the example: countries and cities) can be separated well in the projected vector space, indicated by the dashed lines. [Zouaq and Martel, 2020] have compared the suitability for separating classes in a knowledge graph for different knowledge graph embedding methods. They have shown that RDF2vec is outperforming other embedding methods like TransE, TransH, TransD, ComplEx, and DistMult, in particular on smaller classes. On the task of entity classification, RDF2vec shows results which are competitive with more recent graph convolutional neural networks [Schlichtkrull et al., 2018]. [Meghraoui et al., 2024] have proposed the idea to measure class separability of embedding methods as a means to evaluate the modeling quality of an ontology.

RDF2vec has been tailored to RDF graphs by respecting the type of edges (i.e., the predicates). Related variants, like node2vec [Grover and Leskovec, 2016] or DeepWalk [Perozzi et al., 2014], are defined for graphs with just one type of edges. They create sequences of nodes, while RDF creates alternating sequences of entities and predicates.

This video by Petar Ristoski introduces the main ideas of RDF2vec:

Trans* etc. vs. RDF2vec, Similarity vs. Relatedness

A lot of approaches have been proposed for link prediction in knowledge graphs, from classic approaches like TransE [Bordes et al., 2013] and RESCAL [Nickel et al. ,2011] to countless variants. The key difference is that those approaches are trained to optimize a loss function on link prediction, which yields a projection of similar entities closely together in the vector space as a by product. On the other hand, the capability to predict links is a by product in RDF2vec, in particular in variants like order-aware RDF2vec. A detailed comparison of the commonalities and differences of those families of approaches can be found in [Portisch et al., 2022]


There are a few different implementations of RDF2vec out there:

  • The original implementation from the 2016 paper. Not well documented. Uses Java for walk generation, and Python/gensim for the embedding training.
  • jRDF2vec is a more versatile and better peforming Java-based implementation. Like the original one, it uses Java to generate the walks, and Python/gensim for training the embedding. There is also a Docker image available here. jRDF2vec is the best performing end-to-end implementation for RDF2vec. It also implements many variants, such as RDF2vec Light, as well as p-walks and e-walks (see below).
  • pyRDF2vec [Vandewiele et al., 2022] is a pure Python-based implementation. It implements multiple strategies to generate the walks, not only random walks, and also has an implementation of RDF2vec light (see below).
  • ataweel55's implementation is another pure Python-based implementation. It includes all strategies for biasing the walks described in [Cochez et al., 2017a] and [Al Taweel and Paulheim, 2020].
  • There is a high performance C++ based implementation for creating walks (also with different weighting mechanisms [Cochez et al., 2017]), which can be considered the fastest implementation for walk extraction from RDF files.
  • While all of those approaches use the word2vec implementation in gensim, there is also a PyTorch-based implementation, which also implements the word2vec part in pure Python.

Models and Services

Training RDF2vec from scratch can take quite a bit of time. Here is a list of pre-trained models we know:

There is also an alternative for downloading and processing an entire knowledge graph embedding (which may consume several GB):

  • KGvec2go provides a REST API for retrieving pre-computed embedding vectors for selected entities one by one, as well as further functions, such as computing the vector space similarity of two concepts, and retrieving the n closest concepts. There is also a service for RDF2vec Light (see below) [Portisch et al., 2020].

Extensions and Variants

There are quite a few variants of RDF2vec which have been examined in the past.

  • Walking RDF and OWL pursues exactly the same idea as RDF2vec, and the two can be considered identical. It uses random walks and Skip Gram embeddings. The approach has been developed at the same time as RDF2vec. [Alsharani et al., 2017]
  • KG2vec pursues a similar idea as RDF2vec by first transforming the directed, labeled RDF graph into an undirected, unlabeled graph (using nodes for the relations) and then extracting walks from that transformed graph. [Wang et al., 2021] Although no direct comparison is available, we assume that the embeddings are comparable.
  • Wembedder is a simplified version of RDF2vec which uses the raw triples of a knowledge graph as input to the word2vec implementation, instead of random walks. It serves pre-computed vectors for Wikidata. [Nielsen, 2017]
  • KG2vec (not to be confused with the aforementioned approach also named KG2vec) follows the same idea of using triples as input to a Skip-Gram algorithm. [Soru et al., 2018]
  • Triple2Vec follows a similar idea of walk-based embedding generation, but embeds entire triples instead of nodes. [Fionda and Pirrò, 2020]
  • [Van and Lee, 2023] propose different extensions to creating walks for RDF2vec, including the usage of text literals by means of creating new graph nodes for similar text literals, as well as the introduction of latent walks which capture relations which are not explicit in the knowledge graph.
  • RDFstar2vec is an extension of RDF2vec which works on RDF-star graphs. It defines additional walk strategies for quoted triples. [Egami et al., 2023]

Natively, RDF2vec does not incorporate literals. However, they can be incorporated with a few tricks:

  • PyRDF2vec (see above) has an option which adds literals of entities as direct features, creating a heterogeneous feature vector consisting of the embedding dimensions and additional features from the literal values.
  • There are quite a few graph preprocessing operators which can be utilized to incorporate literals by representing their information in the form entities and relations, so that they are processed by RDF2vec (and other embedding methods). Even simple baselines which are efficient and do not increase the graph size can boost the performance of RDF2vec. [Preisner and Paulheim, 2023]

RDF2vec supports Knowledge Graph Updates.

Most knowledge graph embedding methods do not support knowledge graph updates and require re-training a model from scratch when a knowledge graph changes. Since word2vec provides a mechanism for updating word vectors and also learning vectors for new words, RDF2vec is capable of adapting its vectors upon updates in the knowledge graph without a full retraining. The adaptation can be performed in a fraction of the time a full retraining would take. [Hahn and Paulheim, 2024]

Initially, word2vec was created for natural language, which shows a bit of variety with respect to word ordering. In contrast, walks extracted from graphs are different.

Consider, for example, the case of creating embedding vectors for bread in sentences such as Tom ate bread yesterday morning and Yesterday morning, Tom ate bread. For walks extracted from graphs, however, it makes a difference whether a predicate appears before or after the entity at hand. Consider the example above, where all three entities in the middle (Angela_Merkel, Peter_Tschentscher, and Germany) share the same context items (i.e., Hamburg and leader). However, for the semantics of an entity, it makes a difference whether that entity is or has a leader.

RDF2vec always generates embedding vectors for an entire knowledge graph. In many practical cases, however, we only need vectors for a small set of target entities. In such cases, generating vectors for an entire large graph like DBpedia would not be a practical solution.

  • RDF2vec Light is an alternative which can be used in such scenarios. It only creates random walks on a subset of the knowledge graph and can produce embedding vectors for a target subset of entities fast. In many cases, the results are competitive with those achieved with embeddings of the full graph. [Portisch et al., 2020] Details about the implementation are found here.
  • LODVec uses the same mechanism as RDF2vec Light, but creates sequences across different datasets by exploiting owl:sameAs links, and unifying classes and predicates by exploiting owl:equivalentClass and owl:equivalentProperty definitions. [Mountantonakis and Tzitzikas, 2021]

RDF2vec can explicitly trade off similarity and relatedness.

One of the key findings of the comparison of RDF2vec to embedding approaches for link prediction, such as TransE, is that while embedding approaches for link prediction create an embedding space in which the distance metric encodes similarity of entities, the distance metric in the RDF2vec embedding space mixes similarity and relatedness [Portisch et al., 2022]. This behavior can be influenced by changing the walk strategy, thereby creating embedding spaces which explicitly emphasize similarity or relatedness. The corresponding walk strategies are called p-walks and e-walks. In the above example, a set of p-walks would be:

birthPlace -> country -> Germany            -> leader     -> birthPlace
country    -> leader  -> Angela_Merkel      -> birthPlace -> leader
residence  -> leader  -> Peter_Tschentscher -> residence  -> leader

Likewise, the set of e-walks would be

Peter_Tschentscher -> Hamburg -> Germany            -> Angela_Merkel -> Hamburg
Hamburg            -> Germany -> Angela_Merkel      -> Hamburg       -> Peter_Tschentscher
Angela_Merkel      -> Hamburg -> Peter_Tschentscher -> Hamburg	     -> Germany

It has been shown that embedding vectors computed based on e-walks create a vector space encoding relatedness, while embedding vectors computed based on p-walks create a vector space encoding similarity. [Portisch and Paulheim, 2022]

Besides e-walks and p-walks, the creation of walks is the aspect of RDF2vec which has been undergone the most extensive research so far. While the original implementation uses random walks, alternatives have been explored include:

  • The use of different heuristics for biasing the walks, e.g., prefering edges with more/less frequent predicates, prefering links to nodes with higher/lower PageRank, etc. An extensive study is available in [Cochez et al., 2017a].
  • Zhang et al. also propose a different weighting scheme based on Metropolis-Hastings random walks, which reduces the probability of transitioning to a node with high degree and aims at a more balanced distribution of nodes in the walks. [Zhang et al., 2022]
  • A similar approach is analyzed in [Al Taweel and Paulheim, 2020], where embeddings for DBpedia are trained with external edge weights derived from page transition probabilities in Wikipedia.
  • In [Vandewiele et al., 2020], we have analyzed different alternatives to using random walks, such as walk strategies with teleportation within communities. While random walks are usually a good choice, there are scenarios in which other walking strategies are superior.
  • In [Saeed and Prasanna, 2018], the identification of specific properties for groups of entities is discussed as a means to find task-specific edge weights.
  • Similarly, NESP computes semantic similarities between relations in order to create semantically coherent walks. Moreover, the approach foresees refining an existing embedding space by bringing more closely related entities closer together. [Chekol and Pirrò, 2020]
  • Mukherjee et al. [Mukherjee et al., 2019] also observe that biasing the walks with prior knowledge on relevant properties and classes for a domain can improve the results obtained with RDF2vec.
  • The ontowalk2vec approach [Gkotse, 2020] combines the random walk strategies of RDF2vec and node2vec, and trains a language model on the union of both walk sets.

Besides changing the walk creation itself, there are also approaches for incorporating additional information in the walks:

  • [Bachhofner et al., 2021] discuss the inclusion of metadata, such as provenance information, in the walks in order to improve the resulting embeddings.
  • [Pietrasik and Reformat, 2023] introduce a heuristic reduction based on probabilistic properties of the knowledge graph as a preprocessing step, so that a first version of the embedding can be computed on a reduced knowledge graph.

RDF2vec relies on the word2vec embedding mechanism. However, other word embedding approaches have also been discussed.

  • In his master's thesis, Agozzino discuss the usage of FastText and BERT in RDF2vec as an alternative to word2vec. His preliminary experiments suggest that FastText might be a superior alternative to word2vec. [Agozzino, 2021] The FastText variant is also available in the pyRDF2vec implementation.
  • KGlove adapts the GloVe algorithm [Pennington et al., 2014] for creating the embedding vectors [Cochez et al., 2017b]. However, KGlove does not use random walks, but derives the co-occurence matrix directly from the knowledge graph.

While the original RDF2vec approach is agnostic to the type of knowledge encoded in RDF, it is also possible to extend the approach to specific types of datasets.

To materialize or not to materialize? While it might look like a good idea to enrich the knowledge graph with implicit knowledge before training the embeddings, experimental results show that materializing implicit knowledge actually makes the resulting embedding worse, not better.

  • In [Iana and Paulheim, 2020], we have conducted a series of experiments training embeddings on DBpedia as is, vs. training embeddings on DBpedia with implicit knowledge materialized. In most settings, the results on downstream tasks get worse when adding implicit knowledge. Our hypothesis is that missing information in many knowledge graphs is not missing at random, but a signal of lesser importance, and that signal is canceled out by materialization. A similar observation was made by [Alsharani et al., 2017].

RDF2vec can only learn that two entities is similar based on signals that can co-appear in a graph walk. For that reason, it is, for example, impossible to learn that two entities are similar because they have an ingoing edge from an entity of the same type (see also the results on the DLCC node classification benchmark [Portisch and Paulheim, 2022]). Looking at the following triples:

:Germany	rdf:type	:EuropeanCountry .
:Germany	:capital	:Berlin .
:France		rdf:type	:EuropeanCountry .
:France		:capital	:Paris .
:Thailand	rdf:type	:AsianCountry .
:Thailand	:capital	:Bangkok .
In this example, it is impossible for RDF2vec to learn that Berlin is more similar to Paris than to Bangkok, since the entities EuropeanCountry and AsianCountry never co-occur in any walk with the city entities. Therefore, injection structural information into RDF2vec may improve the results.

  • Liang et al. have proposed an approach for using such structural information by injecting them in the loss function of the downstream task (not the one used for training the embeddings per se). Their results show that the performance of entity classification with RDF2vec can be improved by adding a loss term based on structural similarities.

Knowledge Graphs usually do not contain negative statements. However, in cases where negative statements are present, there are different ways of handling them in the embedding creation.

  • One variant is the encoding of negative statements with specific relations, an approach which can be used with arbitrary embedding methods. When dealing with walk-based methods on large hierarchies, it is possible to encode the negative statements in the direction of walks along the hierarchy, as demonstrated in the TrueWalks approach in [Sousa et al., 2023].

Other Resources

Other useful resources for working with RDF2vec:

  • GEval is a Python-based framework to run evaluations of RDF2vec in the way of the above mentioned papers [Pellegrino et al., 2019, Pellegrino et al., 2020].
  • Concept2vec provides a test benchmark for analyzing how well RDF2vec embeddings encode ontological (i.e., schema-level) properties of a knowledge graph [Alshargi et al., 2019].
  • DLCC is another benchmark for analyzing which schema constructs can be learned by embedding models. It comes in two flavours, one based on the real-world knowledge graph DBpedia, another one based on synthetic data [Portisch and Paulheim, 2022].


RDF2vec has been used in a variety of applications. In the following, we list a number of those, organized by different fields of applications.

Knowledge Graph Refinement

Knowledge Graph Refinement subsumes the usage of embeddings for adding additional information to a knowledge graph (e.g., link/relation or type prediction), to extend its schema/ontology, or the identification (and potentially: correction) of existing facts in the graph [Paulheim, 2017]. In most of the applications, RDF2vec embedding vectors are used as representations for training a machine learning classifier for the task at hand, e.g., a predictive model for entity types. Applications in this area include:
  • TIEmb is an approach for learning subsumption relations using RDF2vec embeddings. [Ristoski et al., 2017] The use of RDF2vec for learning subsumptions is also discussed in [Gosselin and Zouaq, 2023], [Shiraishi and Kaneiwa, 2024], and [Pietrasik et al., 2024].
  • Kejriwal and Szekely discuss the use RDF2vec embeddings for entity type prediction in knowledge graphs. [Kejriwal and Szekely, 2017] Another approach in that direction is proposed by Sofronova et al., who contrast supervised and unsupervised methods for exploiting RDF2vec embeddings for type prediction. [Sofronova et al., 2020] Furthermore, the usage of RDF2vec for type prediction in knowledge graphs is discussed in [Weller, 2021], [Jain et al., 2021], and [Ugai, 2023]. [Cutrona et al., 2021] report that using RDF2vec embeddings for type prediction yields similarly scoring results as using BERT embeddings [Devlin et al., 2018] trained on the entities' textual abstracts. The combination of textual entity information, encoded with BERT, and graph information, encoded with RDF2vec, is discussed in [Biswas et al., 2022].
  • Daga and Groth also use RDF2vec to classify nodes in a knowledge graph extracted from Python notebooks on Kaggle. They show that the classification using RDF2vec significantly outperforms the usage of the pre-trained CodeBERTa model. [Daga and Groth, 2022]
  • Shahinmoghadam et al. discuss the use of RDF2vec embeddings for node classification in the building information modeling field. They show that the combination of dimensionality reduction of the embedding space using Kernel PCA and a downstream classifier yields the best results. [Shahinmoghadam et al., 2022]
  • GraphEmbeddings4DDI utilizes RDF2vec for predicting drug-drug interactions [Çelebi et al., 2018]. A similar system is introduced by Karim et al., using a complex LSTM on top of the entity embeddings generated with RDF2vec [Karim et al., 2019]. Since the drug-drug-interactions are modeled as relation in the knowledge graphs used for the experiments, this task is essentially a relation prediction task. [Zhang et al., 2022] also target the prediction of drug-drug-interaction and drug-target-interaction, using a combination of CNN and BiLSTM as a downstream prediction model.
  • Ammar and Celebi showcase the use of RDF2vec embeddings for the fact validation task at the 2019 edition of the Semantic Web Challenge. [Ammar and Celebi, 2019]. A similar approach is pursued by Pister and Atemezing [Pister and Atemezing, 2019]. Qudus et al. discuss a hybrid fact checking approach using text and knowledge graph embeddings. They show that hybrid approaches built with RDF2vec outperform those built on most other embedding techniques. [Qudus et al., 2022]
  • Chen et al. show that RDF2vec embeddings can be used for relation prediction and yields results competitive with TransE and DistMult [Chen et al., 2020].
  • Yao and Barbosa combines RDF2vec and outlier detection for detecting wrong type assertions in knowledge graphs [Yao and Barbosa, 2021].
  • Egami et al. utilize RDF2vec for clustering activities in a knowledge graph of daily living activities, and discuss the use of those clusters for refining the underlying ontology. [Egami et al., 2021] Two further clustering case studies in the fields of public procurement and drugs are discussed in [Donini et al., 2024].
  • [Heilig et al., 2022] use RDF2vec embedding on a biomedical knowledge graph for refining rules for medical diagnosis. This gives an interesting example for combining embeddings with explainable artificial intelligence: the embeddings are not used directly for prediction, but rather to refine interpretable rules, which are reviewed by medical experts.
  • [Gonzalez-Hevia and Gayo-Avello] propose the extraction of knowledge graphs containing not only the information about the entity itself, but also its edit history to improve type prediction. One of their approaches uses RDF2vec embeddings on those extracted knowledge graphs.
  • [Silva Neto, 2023] uses RDF2vec, among other representations, to cluster entities for class learning in knowledge graphs.
  • [Potoniec, 2020] uses RDF2vec representations of triples (by concatenating subject, predicate, and object vectors), together with an RNN, to predict characteristics of object properties, such as symmetry and transitivity.
  • [Zhai et al., 2024] use RDF2vec representations of entities to approximate embeddings for relations (as averages of subject and object embeddings) and paths (as averages of all path element embeddings) in order to find replacements for relations in wrong triples.

Knowledge Matching and Integration

In knowledge matching and integration, entity embedding vectors are mostly utilized to determine whether two entities in two datasets are similar enough to each other to merge them into one. Different approaches have been proposed using RDF2vec for matching and integratino both on the schema as well as on the instance level:
  • MERGILO is a tool for merging structured knowledge extracted from text. A refinement of MERGILO using RDF2vec embeddings on FrameNet is discussed in [Alam et al., 2017].
  • EARL is a named entity linking tool which uses pre-trained RDF2vec embeddings. [Dubey et al., 2018]
  • ALOD2vec Matcher is an ontology matching system which uses pre-trained embeddings on the WebIsALOD knowledge graph to determine the similarity of two concepts. [Portisch and Paulheim, 2018]. The approach has later been extended to DBpedia, WordNet, Wikidata, Wiktionary, and BabelNet in [Portisch et al., 2021]. A similar approach is pursued by the DESKMatcher system, which uses domain specific embeddings from the business domain, e.g., the FIBO ontology [Monych et al., 2020].
  • AnyGraphMatcher is another ontology matching system which leverages RDF2vec embeddings trained on the two input ontologies to match [Lütke, 2019].
  • [Kardos and Farkos, 2022] use RDF2vec for knowledge graph matching, exploiting a linear transformation learned between the embedding spaces of the source and the target knowledge graph. [Happi et al., 2024 propose modeling the entity matching problem as a binary classification problem, concatenating two entity embeddings and predicting match/non-match as a target. Similarly, Azmy et al. use RDF2vec for entity matching across knowledge graphs, and show a large-scale study for matching DBpedia and Wikidata [Azmy et al., 2019]. A similar approach is introduced by Aghaei and Fensel, who combine RDF2vec embeddings with clustering and BERT sentence embeddings to identify related entities in two knowledge graphs [Aghaei and Fensel, 2021]. The combination of textual embeddings with BERT and RDF2vec embeddings is also discussed for ontology matching by [Mijalcheva et al., 2022].
  • While relying on RDF2vec solely for matching might not yield optimal results, [Soeiro, 2024] shows that refining an alignment based on RDF2vec similarity can improve the results of an already found alignment.
  • DELV is an entity matching approach for matching multiple knowledge graphs built on top of RDF2vec. It first embeds a central knowledge graph using RDF2vec, and then performs an RDF2vec embedding of satellite knowledge graphs with a slightly modified word2vec loss function, taking the minimization of the distance of already matched anchors into account.[Ruppen, 2018].
  • In a showcase for the MELT ontology matching framework, Hertling et al. show that by learning a non-linear mapping between RDF2vec embeddings of different ontologies, ontology matching can be performed at least for structurally similar ontologies [Hertling et al., 2020]. [Portisch et al., 2022] show that this can also be achieved by rotation of embedding spaces. This is particularly remarkable since that metric measures similarity, not relatedness, which is actually needed for the task at hand.
  • [Cvetkov et al., 2022] use knowledge graph embeddings to perform table augmentation. They represent a set of tables as a knowledge graph and perform embeddings on top; RDF2vec is one of the methods tested for this task.
  • In medical data analysis, trial datasets often consist of relatively few data points. [Sousa and Paulheim, 2024] demonstrate that it is possible to combine different trial datasets which are incompatible since they record different gene expressions by using embeddings of gene expressions, thereby improving the prediction of Diabetes.

Applications in NLP

In natural language processing, knowledge graph embeddings are particularly handy in setups that already exploit knowledge graphs, for example, for linking entities in text to a knowledge graph using named entitiy linking and named entity disambiguation. Applications of RDF2vec in the NLP field include:
  • TREC CAR is a benchmark for complex answer retrieval. The authors use pre-trained RDF2vec embeddings as one means to represent queries and answers, and for matching them onto each other. [Nanni et al., 2017a]
  • Inan and Dikenelli demonstrate the usage of RDF2vec embeddings in named entity disambiguation in the entity disambiguation frameworks DoSeR and AGDISTIS. [Inan and Dikenelli, 2017]
  • In a later work, Inan and Dikelli propose the use of RDF2vec embeddings together with a BiLSTM and a CRF layer for entity disambiguation. [Inan and Dikenelli, 2018]
  • Wang et al. have used RDF2vec embeddings for analyzing entity co-occurence in tweets [Wang et al., 2017].
  • [Benitez-Andrades et al., 2022] consider the case of tweet classification, and show that by linking entities to Wikidata and using RDF2vec embeddings for those entities leads to better classification results than pure text-based approaches based on different BERT variants.
  • Nanni et al. showcase the use of RDF2vec embeddings for entity aspect linking in [Nanni et al., 2018].
  • Nizzoli et al. use RDF2vec, among other features, to perform named entity linking of geographic entities, in particular for scoring candidates. [Nizzoli et al., 2020]
  • KGA-CGM is a system for describing images with captions. It uses RDF2vec embeddings for handling out-of-training entities [Mogadala et al., 2018].
  • Türker discusses the use of RDF2vec for text categorization by embedding both texts and categories [Türker, 2019].
  • Vakulenko demonstrates the use of RDF2vec in dialogue systems [Vakulenko, 2019].
  • G-Rex is a tool for relation extraction from text which leverages RDF2vec entity embeddings [Ristoski et al., 2020].
  • El Vaigh et al. show that using cosine similarity in the RDF2vec space creates a strong baseline for collective entity linking [El Vaigh al., 2020]. This is particularly remarkable since that metric measures similarity, not relatedness, which is actually needed for the task at hand.
  • Yamada et al. also use RDF2vec for measuring entity relatedness, and contrast the results of RDF2vec trained on DBpedia to their model Wikipedia2vec. The results are close, with Wikipedia2vec yielding slightly better results, but also based on a model which is significantly larger than RDF2vec. [Yamada et al., 2020]
  • FinMatcher is a tool for named entity classification in the financial domain, developed for the FinSim-2 shared task. It uses pre-trained RDF2vec embeddings on WebIsALOD [Portisch et al., 2021]
  • [Eingleitner et al., 2021] use RDF2vec embeddings to provide semantic tags for news articles.
  • LamAPI is a service for entity retrieval as one step in the entity linking process. [Avogadro et al., 2022] use RDF2vec in this service to enhance the set of types for seed entities, and they show that an expansion based on RDF2vec helps in particular with entities with no or just one type assigned.
  • [Bagherzadeh and Bergler, 2022] combine pre-trained RDF2vec embeddings on various general purpose knowledge graphs (DBpedia, WordNet, and ConceptNet) with embeddings on biomedical knowledge graphs and BERT text embeddings on a variety of NLP tasks in the biomedical domain, including text classification and relation extraction. They show that a combination of BERT and knowledge graph embeddings outperforms a pure BERT based approach. Moreover, the paper interestingly demonstrates that embeddings on different knowledge graphs created with different embedding approaches can be combined.
  • [Chen, 2023] proposes the use of RDF2vec embeddings on a general purpose knowledge graph (here: YAGO2) as a signal for entity relatedness in text summarization.
  • [Setty, 2023] uses RDF2vec embeddings for clustering types in knowledge graphs as a preparing step for answer type prediction in question answering.

Information Retrieval

In information retrieval, similarity and relatedness of entities can be utilized to retrieve and/or rank results for queries for a given entity. Examples for the use of RDF2vec in the information retrieval field include:
  • Nanni et al. describe a system for harvesting event collections from Wikipedia, where RDF2vec is used internally for entity ranking. [Nanni et al., 2017b]
  • Ad Hoc Table Retrieval using Semantic Similarity describes the use of pre-trained RDF2vec embeddings for retrieving Wikipedia tables. [Zhang and Balog, 2018] In a later extension, they distinguish two kinds of retrieval tasks (using either keywords or tables as queries), and show that entity embeddings with RDF2vec can be used for both scenarios. [Zhang and Balog, 2021] Table annotation with RDF2vec is also discussed in [Cutrona et al., 2021], [Shigarov et al., 2021], [Dorodnykh and Yurin, 2023], [Avogadro, 2024], and [Leventidis et al., 2024].
  • Cyber-all-intel is an application in the computer security domain. It uses RDF2vec vectors for retrieving information on security alerts [Mittal et al., 2019]. A similar approach is shown in [Saint-Hilaire et al., 2024], where two ontologies on cybersecurity threats and countermeasures are combined using RDF2vec for identifying countermeasures for threats by combining SPARQL queries and similarity search in the vector space.
  • The COVID-19 literature knowledge graph is a large citation network of CoViD-19 related scientific publications, derived from the CORD-19 dataset. In [Steenwinckel et al., 2020], the authors exploit RDF2vec embeddings on that graph for facilitating the retrieval of related articles, as well as for clustering the large body of literature.
  • In the context of the Data Set Knowledge Graph, the retrieval of similar datasets has been discussed as a use case for RDF2vec. [Färber and Lamprecht, 2021].
  • Kim et al. discuss the use of RDF2vec on top of knowledge graphs created using open information extraction from text, in particular for retrieving similar entities to support situation awareness in combat situations. [Kim et al., 2021]
  • [Loesch et al., 2022] discuss the use of RDF2vec for retrieving substitutes for food ingredients in a food knowledge graph. They show that RDF2vec embeddings outperform TransE and ComplEx on this task.
  • ebay uses RDF2vec embeddings on product graphs for determining product similarity [Ristoski et al, 2023], as well as to retrieve products with similar colors refered to by different names (e.g., "graphite" for "grey"). [Liang et al., 2022]
  • Nordsieck et al. use RDF2vec to retrieve similar processes [Nordsieck et al., 2022] and similar quality characteristics [Nordsieck et al., 2023] from a knowledge graph encoding procedural knowledge in the industrial manufacturing domain.
  • [Schwabe and Acosta, 2023] combine RDF2vec embeddings with a graph neural network approach to estimate the cardinality of queries on a knowledge graph.
  • [Ekaputra et al., 2023] use RDF2vec on a knowledge graph of scientific papers and systems to identify related systems, datasets, or algorithms.
  • [Farzana et al., 2023] discuss the use case of query rewriting for product retrieval, using RDF2vec embeddings on a product knowledge graph, among other building blocks.
  • [Eschauzier et al., 2023] use RDF2vec embeddings to represent predicates in learning to optimize join orderings for SPARQL query execution.
  • [Luo et al., 2023] use RDF2vec embeddings of Wikidata for reranking search results in data set search.
  • Web API composition deals with the complex task of finding a set of APIs that fulfill a goal. In order to combine them, one needs to find matching APIs. [Boustil and Tabel, 2023] use RDF2vec embeddings of different knowledge graphs in order to find APIs with synonymous parameter names.

Predictive Modeling

Predictive modeling was the original use case for which RDF2vec was developed. Here, external variables (which might be continuous or categorical) are predicted for a set of entities. By linking these entities to a knowledge graph, entity embeddings have been shown to be suitable representations for the downstream predictive modeling tools. Examples in this field include:
  • Hascoet et al. show how to use RDF2vec for image classification, especially for classes of images for which no training data is available, i.e., zero-shot-learning. [Hascoet et al., 2017]
  • evoKGsim* combines similarity metrics and genetic programming for predicting out-of-KG relations. The framework implements RDF2vec as one source of similarity metrics. [Sousa et al., 2021]
  • Biswas et al. discuss the use of RDF2vec as a signal for predicting infobox types in Wikipedia articles [Biswas et al., 2018].
  • Egami et al. show the use case of geospatial data analytics in urban spaces by constructing a geospatial knowledge graph and computing RDF2vec embeddings thereon [Egami et al., 2018]. Another example of predictive modeling using geo-spatial knowledge graphs is given by [Böckling et al., 2023], where wildfires are predicted using a geo-spatial knowledge graph integrating various sources, and computing RDF2vec embddings thereon.
  • Hees discusses the use of pre-trained RDF2vec models for predicting human associations of terms [Hees, 2018].
  • The utilization of RDF2vec for content-based recommender systems is discussed in [Saeed and Prasanna, 2018], [Ristoski et al., 2019], [Voit and Paulheim, 2021], [Hubert, 2023], and [Alhaj and Qawasmeh, 2024]. [Palumbo et al., 2019] report that RDF2vec performs better in terms of recommending novel items than other competitors. [Nguyen 2023] uses RDF2vec for recommending data items and visualizations for creating dashboards. The work is remarkable insofar that different embedding methods (RDF2vec and TransH) are combined for the recommendation. A similar approach is taken by [Moens et al., 2024], who show that RDF2vec outperforms standard recommendation techniques like matrix factorization, collaborative filtering, and personalized PageRank.
  • Jurgovsky demonstrates the use of RDF2vec for data augmentation on the task of credit card fraud detection [Jurgovsky, 2019].
  • Hoppe et al. demonstrate the use of RDF2vec embeddings on DBpedia for improving the classification of scientific articles [Hoppe et al., 2021]. The approach was later also applied to classifying Wikipedia abstracts [Hoppe, 2022]. In particular, the authors suggest representing a texts as sequences of entities, which are then processed by a BiLSTM.
  • [Nunes et al., 2021] show how graph embeddings on biomedical ontologies can be utilized for predicting drug-gene-interactions. They train classifiers such as random forests over the concatenated embedding vectors of the drugs and genes. In a follow up work, they explore different mechanisms of combination beyond concatenation. [Nunes et al., 2023]
  • [Sousa et al., 2021] use embeddings on the Gene ontology for various predictive modeling tasks in the biomedical domain, including the prediction of proteins and the interaction of diseases and genes, as well as the analysis of protein-protein interactions [Sousa et al., 2024]. In later work [Nunes et al., 2023], they show how using aggregates of embeddings of ancestor nodes can help producing explanations for the embedding-based predictions, and show how pre-trained embeddings of the Gene ontology can be leveraged in Graph Neural Networks [Balbi et al., 2024].
  • Wang et al. use embeddings, including RDF2vec, to assess the similarity of proteins in the Gene Ontology [Wang et al., 2022].
  • Ramezani et al. represent essays by knowledge graphs, and use embeddings of the concepts in those graphs to predict the author's personality in the big 5 model based on their written essay. [Ramezani et al., 2022]
  • [Carvalho et al., 2022] use RDF2vec embeddings on an ontology-enriched variant of the MIMIC III dataset, a database of hospital patient data, to predict patient readmission to intensive care units.
  • [Vliestra et al., 2022] apply RDF2vec on a biomedical knowledge graph to identify genetic markers associated with diseases. They show that RDF2vec does not only outperform other graph embedding methods, but also state of the art reference methods in the field.
  • [Pellegrini, 2021] uses RDF2vec embeddings for classifying nodes in different knowledge graphs, mostly for predicting the gender of humans.
  • [Chiatta and Dagi, 2022] show how RDF2vec based embeddings can be used as an additional signal in predicting artwork subjects. They combine image, text, and knowledge graph embeddings and show that those combinations often outperform purely visual classification.
  • [Lazzari, 2022] uses RDF2vec to classify chords in music, which are arranged in a knowledge graph of music chords. The work shows that RDF2vec on that graph outperforms tailored models like chord2vec, intervals2vec, and pitchclass2vec.
  • [Van der Weerdt et al., 2023] discuss the use of RDF2vec for node classification in IoT settings. Since IoT knowledge graphs contain lots of numerical measurements, they also demonstrate an effective way of preprocessing and enriching the graph before embedding.
  • [Tailhardat et al., 2023] use RDF2vec and RandomForests to classify incidents in ICT knowledge graphs. The classification is similar to a node classification task.
  • Ugai et al. build a knowledge graph of daily living activities and propose the use of RDF2vec for detecting hazardous situations. [Ugai et al., 2024]
  • [Katili et al., 2024] integrate various data sources about insects and plants into a knowledge graph, and use RDF2vec embeddings on that graph to predict the transmission of plant viruses by insects.
  • [Llugiqi et al., 2024a] enrich a dataset for predicting heart diseases with different knowledge graphs and show that a combination of RDF2vec with features existing in the dataset can improve the predictive performance. In later work, they extend the approach to a second use case for predicting kidney diseases. [Lluqiqi et al., 2024b]

Other Applications

No matter how sophisticated your categorization schema is, you always end up with a category called "other" or "misc.". Here are examples for applications of RDF2vec in that category:
  • REMES is an entity summarization approach which uses RDF2vec to select a suitable subset of statements for describing an entity. [Gunaratna et al., 2017] Other approaches proposing the usage of RDF2vec for entity summarization are discussed in [Li et al., 2020] and [Horlyk, 2023].
  • Similar to that, Shi et al. propose an approach for extracting semantically coherent subgraphs from a knowledge graph, which uses RDF2vec as a measure for semantic distance to guarantee semantic coherence. [Shi et al., 2021] A similar approach is discussed in [Wang and Cheng, 2024].
  • Jurisch and Igler demonstrate that utilization of RDF2vec embeddings for detecting changes in ontologies in [Jurisch and Igler, 2018].
  • Niazmand et al. use of RDF2vec embeddings for identifying similar predicates for summarizing knowledge graphs [Niazmand et al., 2022]. That approach for identifying of similar predicates is also discussed by the authors for improving query processing over Wikidata [Niazmand et al., 2023] and knowledge integration [Niazmand and Vidal, 2024a] and knowledge graph completion [Niazmand and Vidal, 2024b].
  • Sultana et al. combine RDF2vec with Graph Convolutional Neural Networks to achieve knowledge graph compression. Interestingly, their results indicate that an encoder solely built on RDF2vec (without the convolutional layer) can already achieve state of the art results in knowledge graph compression. [Sultana et al., 2024].
  • Similarly, Trouli et al. discuss the use of RDF2vec for knowledge graphs compression. They utilize RDF2vec to learn a function for predicting node and edge importance used for summarization. [Trouli et al., 2024]
  • Abe et al. propose the use of RDF2vec vectors to identify devices in the physical neighborhood of a user in an IoT scenario [Abe et al., 2022].
  • Wang et al. use RDF2vec to identify semantically similar statements (receiving a statement vector by concatenating subject, predicate, and object vectors) for creating semantically coherent subgraphs of inconsistent ontologies. [Wang et al., 2023].


These are the core publications of RDF2vec:

  1. Heiko Paulheim, Petar Ristoski, Jan Portisch: Embedding Knowledge Graphs with RDF2vec. Springer, 2023.
  2. Jan Portisch, Heiko Paulheim: The RDF2vec Family of Knowledge Graph Embedding Methods. Semantic Web Journal, 2023.
  3. Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim: RDF2Vec: RDF Graph Embeddings and Their Applications. Semantic Web Journal 10(4), 2019.
  4. Petar Ristoski, Heiko Paulheim: RDF2Vec: RDF Graph Embeddings for Data Mining. International Semantic Web Conference, 2016.

