Simple Science

Cutting edge science explained simply

# Computer Science# Machine Learning# Computation and Language# Information Retrieval

Improving Product Categorization in E-Commerce

A new approach to enhance consistency in online product categorization.

― 6 min read


E-Commerce CategorizationE-Commerce CategorizationRevampedlabeling in online retail.A robust approach to consistent product
Table of Contents

In the busy world of online shopping, organizing products into the right categories is very important. A leading web company uses a product Categorization model that helps sort millions of items every day. This model takes the name of a product and decides which category it belongs to from a vast list of options. However, there are some issues that can arise with this model, especially when the product names are slightly changed.

For example, if two items are similar but one is a "blue shirt" and the other is a "large blue shirt," the model might categorize them differently. This inconsistency can lead to problems in how items are recommended or searched for, which can frustrate users. To fix this, we developed a new way of working with the model to ensure it is more consistent in its categorization.

We want to improve this model without slowing it down, as it has to manage a huge amount of data. One effective approach is to use something called Semi-supervised Learning, which allows us to make better use of both labeled data (where the category is known) and unlabeled data (where the category is not known). We have two main methods for enhancing the categorization.

The first method uses available product catalogs to help create new training data. This involves looking at groups of similar items and using them to help the model learn better. The second method uses a generative model to create new examples that look like the actual products but differ in some minor ways, without changing their core meaning.

The rise of e-commerce platforms such as Amazon and eBay over the last twenty years has significantly increased the number of products available online. These platforms depend on both clear product descriptions and inferred categories for an enjoyable shopping experience. The category assigned to a product can heavily influence how well it sells, as it affects search results and recommendations.

Our focus is on improving a machine learning model known as 'the categorizer.' This model quickly classifies billions of products daily, assigning the appropriate category based on an established hierarchy called Google Product Taxonomy. However, recent evaluations have shown that while the model is generally effective, it struggles with consistent labeling, especially when product titles change slightly, such as different colors or sizes.

Numerous studies in natural language processing (NLP) have looked into Consistency in classification tasks. These studies highlight how certain features can mislead models, causing inconsistencies when product details change just a little.

Even though the model might perform well on average, the inconsistency can create significant problems for users depending on accurate recommendations and search results. For instance, it might label a "red dress" and a "blue dress" differently, even if they belong to the same category.

To tackle this inconsistency, we apply various Data Augmentation techniques to enhance the model’s training. By adding more varied examples of similar items, we can help the model recognize that small changes shouldn't lead to different categories.

Using data augmentation to improve machine learning models is widely recognized and has been shown to increase the reliability of such systems. We continue to use the existing model structure to ensure that it can still process millions of items effectively.

Our new framework is called Consistent Semi-Supervised Learning (Consistent-SSL). We gather data from product catalogs and create clusters of items that are similar but have slight differences. With this setup, we can apply two methods to take advantage of the unlabeled data: a Self-Training Method and a generative approach.

The self-training method first creates pseudo-labels for the unlabeled data. We train a base model with the labeled data and use it to assign these pseudo-labels. Every time we look at a group of similar items, we ensure that they all get the same pseudo-label. This can help improve the consistency of the model.

In our generative method, we train a model to understand how items can vary. For a pair of items, the model learns to create new variations of the first item while keeping its original label. This allows us to generate multiple examples from a single item, enhancing the amount of training data.

We then filter the generated examples to ensure they match up with real-world examples. This helps create a training set that is both diverse and consistent.

We put our methods to the test using a dataset of commercial products with labels representing their categories. The dataset consisted of pre-labeled samples and a vast collection of unlabeled products from various retailers. Each sample contained details about products, including their title and category.

Our experiments focused on two aspects: accuracy and consistency. Accurate models produce correct predictions, while consistent models produce the same prediction for similar items. To measure performance, we created two different test sets. The accuracy test used labeled samples to compute a score, while the consistency test used pairs of similar product titles to see if they received the same label.

Through experimentation, we compared our methods against existing models. We discovered that our self-training method improved consistency rates while slightly reducing overall accuracy. Similarly, the generative method also led to better consistency without significantly impacting accuracy.

These findings highlight the influence of the quality of data on model performance. The more quality examples we can provide the model during training, the better it becomes at categorizing similar items.

Overall, our work shows that it is essential to consider not only the amount of data when training models but also the quality and distribution of that data. We learned that using real-world samples is generally better than generated ones when it comes to achieving good performance.

While our methods have shown promising results, some limitations remain. Our study was focused on just one particular model and dataset, so results may differ in other contexts. Moreover, our approaches centered on data augmentation instead of altering the model's core design. Future efforts could explore integrating consistency directly into the model's design or objectives.

Finally, it’s important to maintain ethical principles while conducting research like this. Our study adhered to ethical guidelines and aimed to ensure that the impact on users is positive.

In conclusion, we introduced a new way to improve e-commerce product categorization by ensuring that similar items are consistently labeled. By using semi-supervised learning techniques, we showed that it’s possible to enhance the training of the model while maintaining its efficiency. This can lead to a better shopping experience for users by improving the accuracy of recommendations and search results. We hope these advancements will pave the way for further improvements in product categorization in a rapidly evolving e-commerce landscape.

Original Source

Title: Consistent Text Categorization using Data Augmentation in e-Commerce

Abstract: The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model's output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model's consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.

Authors: Guy Horowitz, Stav Yanovsky Daye, Noa Avigdor-Elgrabli, Ariel Raviv

Last Update: 2023-05-30 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2305.05402

Source PDF: https://arxiv.org/pdf/2305.05402

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles