Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language

Improving Geolocation Tools for Humanitarian Efforts

Advancements in geolocation tools enhance humanitarian aid accuracy and reduce biases.

― 7 min read


Boosting Geolocation inBoosting Geolocation inHumanitarian Workhumanitarian aid globally.Enhancing tools to support effective
Table of Contents

Geolocation is the process of determining the physical location of a person or object. In humanitarian work, knowing where help is needed is vital. This includes identifying vulnerable groups, understanding ongoing issues, and knowing where resources are available. Humanitarian organizations create many documents and reports, resulting in a huge amount of text that needs to be analyzed.

Recent advancements in Natural Language Processing (NLP) technology can aid in extracting key information from these reports. However, the performance of current information extraction tools is not well understood, nor is the Bias that may exist in them.

This work aims to create better resources for processing humanitarian texts. It focuses on improving tools that identify specific location names in texts, known as Named Entity Recognition (NER) tools. The two popular NER tools used are Spacy and roBERTa. We introduce a method called FeatureRank that connects identified locations to a comprehensive database of geographical names known as GeoNames.

We found that training these tools with data from humanitarian documents not only improves their accuracy but also helps reduce the bias favoring locations in Western countries. Our study shows that we need more resources from non-Western documents to ensure that these tools work well in various settings.

Understanding the Problem

Humanitarian efforts generate vast amounts of data and reports from a wide range of organizations working around the world. For example, the International Federation of Red Cross and Red Crescent Societies operates in 192 countries, with nearly 14 million volunteers.

To manage the information produced, tools like the Data Entry and Exploration Platform (DEEP) have been created. This platform helps organizations compile and organize their documentation.

In a world overflowing with information, automated information extraction can make it easier to find useful insights. Recent progress in Deep Learning and NLP allows for identifying significant details in texts and categorizing them, which can help in sharing knowledge effectively.

Geolocation is an important aspect of humanitarian work. It spans wide areas from entire countries to small locations like villages or refugee camps. Accurate location information is crucial, especially in light of the Sustainable Development Goals, which seek to ensure that no one is overlooked when it comes to support.

Unfortunately, many data sources for training models show a bias toward Western locations. Many location databases favor the US and other Western nations, while alternative sources like Twitter and Wikipedia are not as well represented in countries from the global South.

To address this issue, we aim to create tools that accurately process diverse humanitarian data, ensuring that all countries are treated fairly in information gathering.

Creating a Geolocation Extraction Tool

In this study, we collaborate with humanitarian partners to produce a specialized geolocation extraction tool aimed at processing documents from humanitarian projects. This tool operates in two key tasks:

  1. Geotagging - Identifying text segments that refer to geographical locations.
  2. Geocoding - Associating these identified locations with exact geographical coordinates.

We contribute two datasets for these tasks, with one focusing on geotagging and the other on geocoding. Humanitarian reports are annotated by specialists to identify potential location names, which are then linked to entries in GeoNames, a vast geographical database.

Using these annotated datasets, we improve the performance of existing NER tools, achieving higher accuracy rates on our target datasets. The new geocoding method, FeatureRank, is evaluated against other baseline approaches in the literature.

Related Literature

Named Entity Recognition (NER) identifies important entities in texts, typically focusing on persons, organizations, and locations. Early models used traditional machine learning methods, but advancements since 2011 have seen the introduction of neural networks, which allow for building more adaptable models.

Recent large pre-trained models like BERT have enhanced the capabilities of NLP systems, allowing for effective representation of text without needing direct access to vast amounts of data.

However, very few studies have specifically addressed geographical NER in humanitarian contexts. Most approaches have focused on general text processing, with limited application to the unique challenges of humanitarian data.

This lack of attention to geographical NER is significant, especially when considering the biases that may emerge from relying solely on Western-focused data.

Data Collection and Annotation

To build our datasets, we use information from the HumSet database, which is part of the DEEP platform. Each document in this database includes relevant excerpts that have been annotated according to humanitarian analysis frameworks. These documents come from various sources, including reports from humanitarian organizations and media articles.

The dataset is multilingual, with the majority being in English, Spanish, and French. The documents include various types of content, from text to images and tables. We use a parser to extract and clean the text while discarding non-textual elements.

We carry out two main annotation tasks: geotagging and geocoding.

Annotation: Geotagging

For geotagging, we selected 500 English-language documents from the HumSet database. This selection aims to include as many different locations as possible while keeping track of the distribution of countries in the dataset.

We use pre-annotations to ease the labeling process. This involves running baseline models to suggest potential locations in the text, which annotators can then review and correct.

The annotators categorize location terms as either literal (directly referring to a place) or associative (indicating a relationship with a place without directly naming it).

Annotated Geotagging Dataset

The resulting annotated dataset includes over 11,000 location names extracted from the 500 selected documents.

The most frequently mentioned locations in our dataset include Libya, Syria, and Afghanistan, highlighting areas of ongoing humanitarian concern.

Annotation: Geocoding

The second annotated dataset supports the geocoding task, where identified location names are linked to their geographical coordinates. For this, we use the GeoNames database, which contains millions of geographical entries.

We prepare toponyms for analysis through careful cleaning and matching processes. Our annotation team, led by experts, works on mapping these toponym names to the corresponding entries in GeoNames.

Customizing Geolocation for Humanitarian Texts

Next, we evaluate the geotagging methods and optimize them with our annotated data. We assess the performance of Spacy and roBERTa NER models and utilize both exact and partial match scoring.

We find that training these models with additional humanitarian data significantly enhances their performance. Furthermore, we see the models become less biased as they are tuned.

Our findings indicate that combining the output from both models can lead to even better results, particularly when it comes to finding a higher number of correct matches.

Approaches to Geocoding

We evaluate existing geocoding methods from the literature, which focus on resolving toponyms to specific locations. One method favors unambiguous reference points from the text, while another clusters candidate locations based on proximity.

However, we propose a custom feature-based geocoding approach that considers not only geographical distance but also population and geopolitical features. This method, called FeatureRank, evaluates candidates based on various criteria and ranks them accordingly.

During our evaluation, we compare the performance of FeatureRank against baseline methods and observe that our custom method yields superior results.

Application Study

Finally, we apply our tuned toponym extraction and custom geocoding algorithm to a large dataset of humanitarian documents. While we lack ground truth for precise validation, we analyze biases in the locations identified by both the baseline models and our tuned versions.

We note that the baseline models tend to highlight more locations in the US and Europe, reflecting a Western bias. In contrast, our tuned models indicate a more balanced distribution of identified locations across various regions, including areas not covered in the training data.

Conclusion

Throughout our work, we have shown that training data from the humanitarian sector can enhance the performance of NER tools for geolocation. This not only improves accuracy but also appears to reduce biases favoring Western locations.

Our findings stress the importance of systematic evaluations to detect biases in data extraction tools. As we continue to refine these tools, it is essential to address the needs of vulnerable populations effectively.

More work is warranted to enhance the capabilities of these models and ensure they can adapt to the evolving landscape of humanitarian needs. We hope that the resources and guidelines provided in this study will encourage further advancements in this field.

Original Source

Title: Leave no Place Behind: Improved Geolocation in Humanitarian Documents

Abstract: Geographical location is a crucial element of humanitarian response, outlining vulnerable populations, ongoing events, and available resources. Latest developments in Natural Language Processing may help in extracting vital information from the deluge of reports and documents produced by the humanitarian sector. However, the performance and biases of existing state-of-the-art information extraction tools are unknown. In this work, we develop annotated resources to fine-tune the popular Named Entity Recognition (NER) tools Spacy and roBERTa to perform geotagging of humanitarian texts. We then propose a geocoding method FeatureRank which links the candidate locations to the GeoNames database. We find that not only does the humanitarian-domain data improves the performance of the classifiers (up to F1 = 0.92), but it also alleviates some of the bias of the existing tools, which erroneously favor locations in the Western countries. Thus, we conclude that more resources from non-Western documents are necessary to ensure that off-the-shelf NER systems are suitable for the deployment in the humanitarian sector.

Authors: Enrico M. Belliardo, Kyriaki Kalimeri, Yelena Mejova

Last Update: 2023-09-06 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2309.02914

Source PDF: https://arxiv.org/pdf/2309.02914

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles