Organizing Job Ads for Better Clarity
A new method for classifying job ads improves understanding of the job market.
Maciej Beręsewicz, Marek Wydmuch, Herman Cherniaiev, Robert Pater
― 4 min read
Table of Contents
- The Need for Classification
- What is a Classifier?
- The Magic of Data Sources
- The Hierarchical Structure
- The Role of Language
- The Challenge of Long-tail Distribution
- The Power of Transformers
- Training the Classifier
- Performance Evaluation
- Results and Findings
- The Importance of Open Data
- Conclusion
- Original Source
- Reference Links
Have you ever tried to find a job online? If so, you may have noticed that job ads are all over the place, and not all of them are easy to understand. This paper is all about how to make sense of these job ads by putting them into categories. Imagine trying to find a specific type of pizza among a sea of options. Wouldn’t it be easier if they were neatly organized by toppings and styles? That's what we want to do with job ads!
The Need for Classification
The job market is like a giant puzzle, but sometimes it feels like you’re missing half the pieces. We need to know what kinds of jobs are out there, how many there are, and what skills are in demand. That’s where our Classifier comes in. By organizing job ads into categories, we can better understand what’s happening in the job market.
What is a Classifier?
A classifier is like a smart assistant that helps sort things out. Imagine a helpful robot that takes a look at different job ads and then says, “Ah, this one is for a software developer, and this one is for a baker.” Our classifier does just that, but it needs a little guidance to get it right.
Data Sources
The Magic ofNow, how do we train this classifier? We feed it data-lots and lots of job ads! We gathered information from various places, including an official database that records jobs. Think of it as a treasure chest filled with job opportunities just waiting to be discovered.
The Hierarchical Structure
Jobs can be grouped in a hierarchy, much like a family tree. At the top, we have broad categories, like “Healthcare” or “Technology.” Then, below them, we have more specific jobs, like “Nurse” or “Software Engineer.” This organization helps our classifier give more precise predictions.
The Role of Language
Our classifier is multilingual, which means it can understand job ads in various languages. It’s like having a translator who makes sure everyone understands what’s being said. In this way, we can include job ads from different countries, making our findings relevant to a wider audience.
Long-tail Distribution
The Challenge ofHere’s a funny thing: in the job world, some positions are super popular, while others hardly get any attention. It’s like a show where the lead actor gets all the applause, but the supporting cast is just happy to be there. This unevenness is called a long-tail distribution, and it can make things tricky for our classifier.
The Power of Transformers
To help our classifier become super smart, we use a type of technology called transformers. No, we’re not talking about robots that turn into cars! In the coding world, these transformers analyze text to understand context and meaning. They’re like the wise old sages of language.
Training the Classifier
We put our classifier through rigorous training, feeding it thousands of job ads to learn from. Think of it as a student cramming for exams-lots of late nights and coffee! By the end of the training, our classifier can identify job categories with impressive accuracy.
Performance Evaluation
Just like a school report card, we evaluated how well our classifier did. We looked at how accurately it categorized job ads and how many times it made mistakes. This information helps us understand where it shines and where it needs improvement.
Results and Findings
After all the hard work, we found some interesting things! Our classifier did pretty well overall, especially with job ads in Polish and English. It struggled a bit more with languages that it didn’t see as often, similar to trying to learn a dialect you've never heard before.
The Importance of Open Data
In our quest for job ad knowledge, we realized that open data is crucial. By sharing our findings and methods, we enable others to learn from our work. This is like a chef sharing their secret recipe, allowing everyone to enjoy a slice of the pie!
Conclusion
Our work shows that job ads can be organized in a way that makes them easier to understand. This not only helps job seekers but also provides valuable information for policymakers. Who knew job ads could be so powerful? With our classifier, we’re taking a big step toward making the job market clearer for everyone. So let’s keep sorting and classifying, one job ad at a time!
Title: Multilingual hierarchical classification of job advertisements for job vacancy statistics
Abstract: The goal of this paper is to develop a multilingual classifier and conditional probability estimator of occupation codes for online job advertisements according in accordance with the International Standard Classification of Occupations (ISCO) extended with the Polish Classification of Occupations and Specializations (KZiS), which is analogous to the European Classification of Occupations. In this paper, we utilise a range of data sources, including a novel one, namely the Central Job Offers Database, which is a register of all vacancies submitted to Public Employment Offices. Their staff members code the vacancies according to the ISCO and KZiS. A hierarchical multi-class classifier has been developed based on the transformer architecture. The classifier begins by encoding the jobs found in advertisements to the widest 1-digit occupational group, and then narrows the assignment to a 6-digit occupation code. We show that incorporation of the hierarchical structure of occupations improves prediction accuracy by 1-2 percentage points, particularly for the hand-coded online job advertisements. Finally, a bilingual (Polish and English) and multilingual (24 languages) model is developed based on data translated using closed and open-source software. The open-source software is provided for the benefit of the official statistics community, with a particular focus on international comparability.
Authors: Maciej Beręsewicz, Marek Wydmuch, Herman Cherniaiev, Robert Pater
Last Update: Nov 6, 2024
Language: English
Source URL: https://arxiv.org/abs/2411.03779
Source PDF: https://arxiv.org/pdf/2411.03779
Licence: https://creativecommons.org/licenses/by-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://lightcast.io/about/data
- https://www.cedefop.europa.eu/en/tools/skills-online-vacancies/occupations/
- https://statistics-awards.eu/
- https://www.gov.pl/web/edukacja/zawody-szkolnictwa-branzowego
- https://psz.praca.gov.pl/rynek-pracy/bazy-danych/klasyfikacja-zawodow-i-specjalnosci/wyszukiwarka-opisow-zawodow
- https://psz.praca.gov.pl/rynek-pracy/bazy-danych/infodoradca
- https://stat.gov.pl/Klasyfikacje/doc/kzs/slownik.html
- https://esco.ec.europa.eu/en/classification/occupation_main
- https://nabory.kprm.gov.pl
- https://warszawa.praca.gov.pl/zgloszenie-oferty-pracy
- https://www.gov.pl/web/edukacja/prognoza-zapotrzebowania-na-pracownikow-w-zawodach-szkolnictwa-branzowego-na-krajowym-i-wojewodzkim-rynku-pracy-2024
- https://oferty.praca.gov.pl/portal/index.cbop
- https://github.com/OJALAB/CBOP-datasets
- https://github.com/argosopentech/argos-translate
- https://github.com/OJALAB/job-ads-datasets/blob/main/data/codes-not-coveted.csv
- https://huggingface.co/allegro/herbert-base-cased
- https://huggingface.co/allegro/herbert-large-cased
- https://huggingface.co/FacebookAI/XLM-roberta-base
- https://huggingface.co/FacebookAI/XLM-roberta-large
- https://esco.ec.europa.eu/en/about-esco/data-science-and-esco/crosswalk-between-esco-and-onet
- https://github.com/OJALAB/job-ads-classifier
- https://repod.icm.edu.pl/dataset.xhtml?persistentId=doi:10.18150/OCUTSI
- https://colab.research.google.com/drive/1a425aagT0lczRxXPWoUlf5aFxUII37nh?usp=sharing