Chapter 1
Chapter 1
OPEN Automatic classification method of e-commerce commodity raw materials through the introduction of self-supervised concepts and the construction of domain ontology
The e-commerce platform's function-oriented classification basis will cause items with the same (different) raw materials to be incorrectly classified into different (same) functional categories, posing a challenge to marketing staff who create item sales statistics based on raw materials. Furthermore, it is challenging to promote the present item classification method in engineering applications since it necessitates a high number of manual markings to add labels. As a result, this paper created an item conceptual model to specify the categories and attributes of item raw materials, allowing it to screen item specification samples and automatically add category labels, generate domain-specific lexicon to extract item raw material features, and finally use a machine learning classifier to complete the classification. This research presents a verification of the suggested classification model using flour data from the Chinese e-commerce platform. The experimental results show that the self-supervised learning-based classification method proposed in this article for classifying raw materials of e-commerce items can achieve an accuracy of ninety-one percent.
Large-scale e-commerce platforms offer a vast array of goods, typically numbering in the millions or more. E-commerce platforms use a range of division techniques to handle items so that customers with varying buying goals may easily find the items they require. When an item is released, the merchant determines its category based on the prompts provided by the e-commerce platform. This means that an item may be classified into multiple categories or use different raw materials or ingredients, but it will always be classified at the same level of use. This phenomenon is especially common in the food industry. For instance, cake mixes that contain the same raw material (such wheat core flour) can be categorized as "flour > wheat core flour" or "grain and oil seasoning > Baking ingredients" in China's Jingdong Mall. Whole wheat flour and wheat core flour, for instance, can be used as bread flour. Companies that consider the functionality of their items typically classify these two items as "grain and oil seasoning > Baking materials" when they release them. While this flexible classification clearly helps consumers with their searches, it is exceedingly inconvenient for market workers or merchants who need to track sales of goods based on raw material classification. As a result, it is imperative that items be categorized based on their raw materials in practice rather than their listing in the e-commerce platform's catalogue.
The majority of current studies on the classification of items for e-commerce rely on machine learning techniques. On the one hand, the e-commerce platform's item catalogue serves as the primary basis for classifying these methods, and the platform's label serves as the primary basis for feature construction. The feature vocabulary pertaining to item raw materials is rarely used in these processes. As such, there will be a major reduction in the accuracy of transplanting such methodologies into the classification of item raw materials. On the other hand, many of the machine learning-based methods now in use need labor- and time-intensive human labelling, which is frequently impractical in engineering practice.
In order to address the aforementioned issues, this research suggests a self-supervised learning-based approach for classifying raw materials used in e-commerce. First, domain specialists create item domain ontology and establish classification rules for specific items based on raw material characteristics. Second, the samples are separated into normative samples and non-standard samples based on the concept of self-supervised learning. Samples that satisfy domain experts' classification criteria are referred to as normative samples. Regular expressions can be used to derive the item raw material classification labels from normative samples. Machine learning uses these kinds of samples. The machine learning approach presented in this research will predict non-standard samples, which are samples that cannot be retrieved from item raw material classification labels using regular expressions. Thirdly, feature keywords are extracted and a feature matrix for machine learning is constructed using BERT and regular expression based on the item domain ontology. This allows for the automatic classification of item raw materials. The term "self-supervised" referenced in this study represents an application and modification of the self-supervised learning concept, rather than a rigid compliance with established paradigms of self-supervised learning. We have adopted the fundamental concept of self-supervised learning, which involves generating supervisory signals from the intrinsic information within the data. Utilizing a rule system based on domain ontology, we automatically extract the inherent relationships between attributes and categories from commodity text data to produce label information for standardized samples, thereby supplanting the conventional manual labeling process. This approach, while not employing conventional self-supervised techniques like contrastive learning and pre-training tasks, fundamentally aims to diminish dependence on external manual labeling by deriving supervisory signals from intrinsic domain knowledge associations within the data, aligning with the primary objective of self-supervised learning to reduce human intervention.
Our methodology of integrating domain ontology rules with self-supervised concepts is a potential innovation of our work. Current self-supervised learning techniques predominantly depend on the general structural characteristics of data; however, in the context of commodity classification within e-commerce, which possesses pronounced domain attributes, the incorporation of domain ontology knowledge can more precisely identify domain-specific relationships within the data. This approach diminishes the need for manual labeling while improving the domain adaptability of label generation. This distinct implementation strategy seeks to offer an innovative solution for semi-automated categorization jobs in particular fields. The following are the innovations of this paper: first, the ontology of item field is constructed to realize the classification of item raw materials on e-commerce platform; second, the identification rules of normative samples are designed and the labels of normative samples are automatically labelled, based on the concept of self-supervised learning.
The rest of this paper is structured as follows: section "Related work" discusses related work, section "Model construction" introduces the proposed self-supervised learning-based model for classifying items based on their raw materials in e-commerce, section "Experimental validation" evaluates our method using real data obtained from Chinese e-commerce platforms, section "Discussion" is devoted to the discussion of this research, and section "Conclusion" concludes with future research directions.
Related work
Related work
The primary goal of e-commerce item classification research in academia is to develop automatic techniques for classifying large-scale item texts on e-commerce platforms. Compared to traditional text classification, e-commerce item classification has characteristics such as a huge variety of categories, short and noisy texts, and imbalanced samples in each category. Scholars have suggested improvements to feature vectors, data sources, and classification models in order to increase classification accuracy. Shen et al. addressed the issue of data sparsity using statistical smoothing techniques and created a two-stage learning strategy based on the naive Bayes algorithm to increase classification accuracy. The graded weighted bag-of-words vector is a distributed semantic representation technique that Gupta et al. devised to overcome the high-dimensional and sparse problems of bag-of-words or term frequency-inverse document frequency feature vectors. In order to tackle the issue of sellers on e-commerce platforms entering item information in an irregular manner, Das et al. introduced a noise detection technique utilizing the Corr-LDA model. Using item titles and descriptions as the classification texts, Cevahir and Murakami extracted item features like model numbers, sizes, and quantities from data from the Japanese e-commerce platform Rakuten. They then developed a deep belief network and deep autoencoder based classification model. Chen and colleagues introduced a neural item classification model designed to address the issues of unstable category vocabulary and idea fuzziness in fine-grained item classification. Neural item classification creates item categories based on item information, including titles, attributes, descriptions, and so on. Deep learning has been used in certain research to classify large-scale e-commerce items, with promising results. In order to address sparsity and scalability concerns, Ha et al. introduced an item classification model based on multiple recurrent neural networks. This model allows many qualities to be incorporated into a common representation. A CNN and Bi-LSTM-based deep learning model for item classification was presented by Kim et al. in twenty twenty-one. Islam and Alauddin presented a deep convolutional neural network-based categorization technique that relied on item photos rather than item titles and details. An OWL (Open-world Learning) model based on meta-learning was proposed by Xu et al. It allows the insertion or deletion of new categories without retraining the model, and it only keeps a set of dynamically observable categories. Qiao et al. addressed the distinct requirements of end users and the uniqueness of raw data, assessed the synthetic data set, and enhanced clustering accuracy using the effective paradigm of federated learning, achieving a nine percent reduction in convergence time. Tan et al. emphasized the need of recognizing BGP community properties for modeling and developed a graph neural network model incorporating convolutional residual networks and fully connected layers, achieving an accuracy rate of ninety percent.
A common goal of the above research is to facilitate consumers in quickly searching for the desired items. Nevertheless, this might result in the misclassification of things that possess comparable functions or characteristics but are made from distinct raw materials, or items that share the same raw materials but serve different consumer purposes, leading to erroneous categorization. It is essential for merchants or market workers to categorize and evaluate sales according to the raw materials used. However, the current methods are not suitable for classifying item raw materials, resulting in a low accuracy in classification. Hence, it is imperative to develop an item categorization model that relies on the classification of raw materials. This entails developing an item domain ontology to ascertain the classification labels for the raw materials of items. Subsequently, pertinent characteristics pertaining to the raw materials of the goods are derived from the textual data. Machine learning techniques are utilized to accomplish the categorization of e-commerce items based on their raw materials.
On the other hand, the training sets of the above e-commerce item classification models rely on a multitude of annotated samples or require time-consuming and labor-intensive manual annotation. For example, Shen et al. utilized a substantial number of samples that were categorized by sellers on the eBay site as the training set. Cevahir and Murakami employed a labeled dataset obtained from the Japanese Rakuten platform. Several scholars have employed various techniques for extracting labels. For instance, Das et al. introduced a methodology that utilizes topic models and a simplified manual labeling process to acquire item category labels. Similarly, Chen et al. developed a system that generates detailed item labels by analyzing user search logs.
Self-supervised learning leverages the inherent characteristics of data to extract meaningful features, using auxiliary tasks instead of relying on manual labels or external supervision signals for categorization. Self-supervised learning has been utilized in both the medical domain and the field of sentiment analysis. Chaves et al. employed self-supervised pre-training models to train and derive pseudo labels for the purpose of classifying skin lesions. In their study, Su et al. introduced a progressive self-supervised attention technique for identifying and extracting the most influential language in sentiment prediction. They iteratively extracted feature words using this method and included them into neural network models to enhance the accuracy of sentiment classification. This paper presents a novel approach for classifying raw materials in e-commerce items using self-supervised learning. The method involves dividing samples into standardized and non-standardized categories by constructing an item domain ontology. Labels are then extracted from standardized samples to facilitate machine learning.