Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language# Artificial Intelligence# Information Retrieval# Machine Learning

Improving Attribute-Value Extraction in E-commerce

A new model enhances the identification of product attributes and values in online listings.

― 6 min read


E-commerce DataE-commerce DataExtraction Boostattribute extraction.New model enhances accuracy of product
Table of Contents

E-commerce has grown rapidly, leading to a vast number of products available online. Each product typically has various features, often known as attributes, and each attribute has specific values. For instance, a smartphone may have attributes like Brand, Color, and Model Name with values such as Samsung, Phantom Gray, and Galaxy S21. These attributes and values help customers find products they want.

However, product listings from sellers often have incomplete information, which can be improved by using details from the product title. The task of automatically identifying these attribute-value pairs is important in e-commerce but can be complicated due to the variety of product categories and the limited amount of labeled training data available.

The Challenge

Extracting attribute-value pairs from product names is not straightforward. Vendors sometimes provide details that are incomplete or inconsistent, making it hard for automated systems to perform well. Moreover, many attributes exist for various products, often numbering in the thousands, making the task even more complex.

Furthermore, some terms can overlap or be used interchangeably, such as Model No. and Model Number. These inconsistencies pose a challenge for any system designed to classify or extract this information.

Additionally, such extraction systems often need to work in Real-time, especially in high-traffic environments, which adds another layer of difficulty.

Our Solution

To tackle these problems, we developed a two-stage model that extracts attribute-value pairs from product titles. The model is designed to learn from partially labeled data, meaning it can work with incomplete attribute-value pairs, reducing the need for fully annotated datasets.

Stage One: Attribute Extraction

The first stage of the model uses a generative model to predict potential attributes present in the product title. In other words, it takes a product name and outputs a list of possible attributes associated with that name.

Stage Two: Value Extraction

Once attributes are identified, the second stage kicks in. This stage uses a classification model to determine the corresponding values for each identified attribute.

By using these two stages, the model can effectively handle the complexities involved with various attributes while also being trained on partially labeled data.

Model Performance

Our model shows significant improvement over existing systems. It increases the number of correctly identified attribute-value pairs by 56.3% compared to previous approaches. Additionally, we introduced a method called "bootstrapping," which helps refine and expand the training dataset progressively.

Integration in Real-World Applications

We successfully integrated this model into India’s largest B2B e-commerce platform, achieving a 21.1% increase in the accurate identification of attribute-value pairs over existing systems while maintaining a high precision score.

Importance of Attributes and Values

In the context of e-commerce, attributes and values serve an essential role by assisting customers in refining their searches. Common attributes such as Brand, Model, and Color, help consumers make informed choices quickly.

For instance, if a buyer is looking for a particular product, knowing its Brand and Model can narrow down the search results significantly. However, if the attribute-value information is lacking or incorrect, it could lead to confusion or frustration for customers.

Methodology for Attribute-Value Extraction

The model employs a two-stage approach:

  1. Attribute Extraction via Generative Model: This step identifies all relevant attributes associated with a product name.
  2. Value Extraction via Classification Model: This step classifies each word in the product title to ascertain if it represents a value for the identified attributes.

Training with Partially Labeled Data

A unique aspect of our method is its ability to learn effectively from partially labeled data. By incorporating markers during the training process, the model can better grasp which words in the product title correspond to values for various attributes.

These markers help the model focus on the relevant parts of the input, enabling it to generate more accurate and insightful predictions during the extraction process.

Value Pruning

In addition to the above techniques, we have introduced a concept called "Value Pruning." This ensures that the model can generate null outputs for any incorrect attributes predicted by the system. This method improves the overall accuracy of attribute-value pair extraction by filtering out irrelevant predictions, leading to a cleaner output.

Comparison with Existing Models

When compared to existing models, our system shows superior performance in both automated and manual evaluations. The precision-how often the model’s predictions are correct-and recall-how many correct predictions the model makes-is often higher for our model.

Using different variations of our model, we assessed how various components like markers and value pruning affect overall performance. The results indicated that both are crucial for enhancing the model’s ability to extract attributes and values accurately.

Experimental Setup

To verify our model's effectiveness, we conducted experiments using real-world data. We pulled product listings from a popular B2B e-commerce platform, ensuring we had a diverse set of attributes and products for thorough testing.

By using a dataset with thousands of unique attribute-value pairs, we could train the model effectively and evaluate its performance on a substantial number of examples.

Results

The results of our experiments reveal that the two-stage model consistently outperforms existing systems, particularly in tasks that involve incomplete data. The use of markers and value pruning significantly improves the balance between precision and recall.

Handling Long Product Names

To further evaluate model performance, we examined how well it handles long product names, as these are common in e-commerce. Our model maintained high accuracy even with product names that contain many words, which demonstrates its robustness and adaptability.

Conclusion

In conclusion, our two-stage model effectively addresses the challenges of extracting attribute-value pairs from product titles in e-commerce. By integrating innovative techniques like partially labeled data training, marker embeddings, and value pruning, our approach offers a substantial improvement over traditional methods.

The success of our model when applied to a large online platform highlights its practical value and potential for broader application in the e-commerce sector.

We envision future expansions could involve more iterations of bootstrapping to continue improving data quality. As the e-commerce landscape evolves, the need for accurate, real-time attribute extraction will remain critical, and our model is well-positioned to meet these needs.

Original Source

Title: A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification

Abstract: In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.

Authors: D. Subhalingam, Keshav Kolluru, Mausam, Saurabh Singal

Last Update: 2024-11-18 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2405.10918

Source PDF: https://arxiv.org/pdf/2405.10918

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles