Simple Science

Cutting edge science explained simply

# Computer Science# Artificial Intelligence# Machine Learning

Innovative Techniques for Knowledge Graph Enhancement

New methods improve knowledge graph embeddings using literal information.

― 8 min read


Boosting Knowledge GraphsBoosting Knowledge Graphswith Literalsdiverse literal data.Techniques enhance embeddings using
Table of Contents

Knowledge Graphs are tools used to share information across different fields. They consist of Entities, which can be anything like people or places, and the relations between them. Many knowledge graphs also include literal information, such as descriptions, numbers, or images related to these entities. For instance, a knowledge graph might contain a description of the city Mannheim along with its population and an image of a famous landmark.

Most methods for creating numeric representations of these entities focus only on the relationships between them. They gather data from these relationships but often ignore valuable information found in the literal descriptions or images. Including this additional information could help create more accurate representations of the entities.

Though there are some methods that consider literal information, they often have limitations. They tend to focus on only one type of literal, such as text or numbers, and cannot work with various embedding methods effectively.

This paper introduces a set of universal operators for preparing knowledge graphs that contain literal information, making them suitable for various types of data. These operators can handle text, numbers, dates, and images, transforming the original knowledge graph into one that can work with any embedding method. Tests have shown that these new preparations can improve the results of knowledge graph Embeddings.

Background

Knowledge graphs are popular for representing information from diverse areas. They contain a mix of entities and their relationships, along with literal information like text descriptions and numerical data. For example, a knowledge graph could represent the city Mannheim by its name, population, and an image of a historical site.

Many existing methods for embedding knowledge graphs focus solely on the relationships among entities. In the example of Mannheim, most methods would only use its name and population, ignoring any other useful information present in the literal descriptions or images. This oversight could mean missing out on key details that could enhance the quality of the representations created.

While some new approaches have started to consider literal information, they often focus on only one type and might not be adaptable to different embedding methods.

Our Approach

This paper presents a set of Preprocessing operators that can effectively transform knowledge graphs containing various types of literal information into graphs that only include relationships. By doing this, the modified graphs can then be used with any embedding method. We examined different preprocessing techniques for text, numerical, and image literals that can work together with any embedding model.

Related Work

Many common benchmarks for knowledge graph embeddings do not include literal information. Therefore, this topic has received less attention compared to methods focusing solely on relational knowledge graphs.

A 2021 survey discusses various approaches that mainly build upon established knowledge graph embedding models. Most of these models are adaptations of classic models such as TransE. While these adaptations tweak the underlying model's loss function, they remain tied to their specific models. An exception is a method called LiteralE, which has been applied across multiple embedding models. However, most of these methods still focus on just one type of literal.

A more recent survey confirms these findings. In contrast, the work presented in this paper proposes a method to preprocess knowledge graphs that include literal data. This approach aims to create a graph that only contains relationships while still representing the information found in the literals.

Some implementations, like pyRDF2vec, can extract literals directly as features. This results in a combination of an embedding and additional literal data, but it lacks a uniform representation. Instead, our aim is to modify the graph first, transforming the literal data into relational statements.

Although preprocessing methods are still uncommon, some researchers have used strategies such as binning numerical values. We incorporate some of these ideas in our work, along with additional strategies for preprocessing literals.

Preprocessing Operators

We focus on augmenting the knowledge graph rather than changing the embedding method itself. This approach involves adding extra nodes and edges that represent some of the information contained in the literals. In our framework, the embedding and augmentation steps are separate from the classification and evaluation phases.

For our experiments, we considered problems related to node classification, but other tasks, like link prediction or clustering, could also benefit from our preprocessing strategies.

Handling Numeric Literals

Creating one unique entity for each numeric literal can pose challenges when it comes to understanding the meaning of those values. For example, it becomes difficult to differentiate between two very similar numerical values and two values that are quite different. To overcome this, we use several techniques for presenting numeric literals, including a method known as binning.

The basis of our binning strategy is to group numeric values into bins based on their range. Another method allows users to specify a percentage of unique values when creating the bins. Binning helps to summarize the information and improve the overall understanding of the data.

Additionally, we detect and remove outliers before applying the binning strategy. If a value is significantly different from the rest, it could negatively impact how we categorize the other values.

We also account for different types of objects that may share the same property. For instance, a property like height might apply to both people and buildings, which have different height ranges. Therefore, we consider the different sets of relations tied to each type of object when creating our bins.

Handling Temporal Literals

For dealing with dates, we adopt a different tactic. We convert the date into a timestamp and then apply the binning approach. However, this method does not fully capture all the nuances of a date. For instance, two people might share the same birthday but be from different years. To address this, we extract additional features from the date, allowing us to build a more detailed representation.

The new entities created from these features can be interconnected, helping to show relationships between different date aspects, such as days, months, and quarters.

Handling Text Literals

Textual information often appears in knowledge graphs but is challenging to represent effectively. To address this, we employ topic modeling, which identifies the main topics within text literals. Each topic is then represented as a node in the graph, allowing for a better understanding of the content.

For this process, we analyze all values of a text literal through a topic modeling algorithm. We connect each entity to the topics that exceed a set threshold, helping to illustrate the relationship between the text and the topics identified.

Handling Image Literals

Images in knowledge graphs can also be tough to represent well. We utilize a pre-trained neural network that can classify images based on their content. By predicting tags for each image, we transform the information into understandable nodes that describe what the images are showing.

In our experiments, we use a recognized image classification model to categorize images. Each image can then be represented by the most likely class, providing a clearer description in the knowledge graph.

Size Changes in Knowledge Graphs

When we apply these preprocessing techniques, the size of the resulting knowledge graph can change. We examine the number of entities and statements after transforming the data. The findings show that while the number of statements remains similar to the original knowledge graph, the number of entities may vary.

Experiments and Results

For our experiments, we tested all preprocessing approaches on a benchmark dataset, which contains various datasets. We utilized different embedding methods and classifiers to evaluate the outcomes.

We trained the embedding methods based on set parameters, which allowed us to observe how effectively our preprocessing strategies improved the results. The experiments aimed to identify which strategies provided the best overall performance.

Our results indicate that different strategies performed well for each type of literal. In many cases, the preprocessing techniques allowed us to enhance the results when compared to the baselines.

Interestingly, there was no evident connection between the number of literals of a specific type and the improvements achieved through including that information. This suggests that the effectiveness of the literal data depends on its quality rather than quantity.

We also noted that some baseline approaches were strong contenders, indicating that the simple presence of a literal can be a useful signal, even without considering its specific value.

Conclusion and Future Work

In summary, we have demonstrated that using preprocessing techniques to represent literal information in knowledge graphs can improve embedding outcomes significantly. Our approach allows flexibility, as the set of preprocessing operators can be expanded or refined in the future.

Moving forward, we can continue enhancing our methods for textual and image representation by integrating advanced models. Different stages can also be employed to process text or images effectively, improving overall quality.

Additionally, our methods not only create new entities but also provide scores for the representations. This means that we can consider these scores as weights when embedding the data. Overall, our findings point towards exciting opportunities for further development and research in this field.

More from authors

Similar Articles