Creating an E-Commerce Product Category Classifier using Deep Learning — Part 1

Problem Description :

We aim to create a product category API that utilizes machine learning and deep learning to predict the possible categories/classes for any provided product name and its descriptions. The problem is considered for an e-commerce domain and the dataset used to train our models will contain some products and their labeled categories.

Fig 1. High-level solution overview of API which takes product name and its description and predicts possible categories.

This will help the organization to automatically predict categories for any new product included in the inventory and thus reducing the time and effort caused due to manual tagging of product categories. I have tried to follow the CRISP-DM standard for model development.

Fig 2. The cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts.

Understanding Dataset :

We will consider the BestBuyAPIs open dataset which contains three JSON files: categories.json, products.json, and stores.json.The dataset can be found here:

The stores.json contains information like store id, type, name, address, hours, etc which is not relevant for this task at this point. The categories.json list different categories from which our predictions will belong, but will not be used for our work. The file used is product.json which contains the product names, description, and their categories, from which our model will learn to tag categories to new products.

The below is an output of a single instance of the product.json file, which basically represents information of a single product in the provided dataset.

{'category': [{'id': 'abcat0300000', 'name': 'Car Electronics & GPS'},
{'id': 'pcmcat165900050023', 'name': 'Car Installation Parts & Accessories'},
{'id': 'pcmcat331600050007', 'name': 'Car Audio Installation Parts'},
{'id': 'pcmcat165900050031', 'name': 'Deck Installation Parts'},
{'id': 'pcmcat165900050033', 'name': 'Dash Installation Kits'}],
'description': 'From our expanded online assortment; compatible with select GM vehicles; plastic material',
'image': '',
'manufacturer': 'Metra',
'model': '99-4500',
'name': 'Metra - Radio Dash Multikit for Select GM Vehicles - Black',
'price': 16.99,
'shipping': 0,
'sku': 346646,
'type': 'HardGood',
'upc': '086429003273',
'url': ''}

From the above example, we can see product name: Metra — Radio Dash Multikit for Select GM Vehicles — Black with description: From our expanded online assortment; compatible with select GM vehicles; plastic material belongs to categories like Car Electronics & GPS, Car Installation Parts & Accessories, Car Audio Installation Parts, Deck Installation Parts and Dash Installation Kits.

For now, information like images is out of scope to predict the product category. Also, we can see that there are so many other important attributes like price, model which can be used to solve some other problem for an e-commerce organization.

Data Understanding and Preparation :

The only relevant attributes for our problem set are name, description, and category which are extracted and converted to a suitable data frame using the below code.

The output data frame containing the product name, description, and category looks like below. The categories for first instance are like Connected Home and Houseware, Houseware etc.

Fig 3. For the first instance, the product name is Duracell-AAA Batteries with its description and labeled categories.

There is a total of 51646 products listed in the product.json file. As each instance can belong to multiple categories, so these types of problems are known as a multi-label classification problem, where we have a set of target labels. If there are multiple categories but each instance is assigned only one, therefore such problems are known as a multi-class classification problems.

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem, there is no constraint on how many of the classes the instance can be assigned to.

Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y). Multi-class problems acknowledge that all the classes are mutually exclusive, but in our case, it is not as a product can belong to multiple classes/categories.

Data Transformation :

We would need to transfer out data in a format that can be fitted to a machine learning model. First, we will list the categories separately :

Fig 4. Listing categories for each product separately

Let us find the number of unique categories/classes as our prediction will belong in this particular set.

The total number of unique classes/categories comes out to be 1802. So there are 1802 classes our prediction can take place into. Now comes the major data transformation part where we will unfold the dataset to view all the prediction classes at once.

Fig 5. Unfolded dataset with all prediction classes to view all at once. The width of the data frame is 1809 as the first 7 columns are just category label names and the rest all the 1802 classes.

Now we need to fill the cell as 1 for each row/product for their respective categories. Indicating cell[name, category] need to be 1 if that product belongs to that category. The script for that transformation is below and now the dataset is a sparse 0–1 matrix.

The shape of the data frame is (51646, 1804) as there are 51646 products with 1802 classes/categories.

Data Analysis :

Let us count of occurrence of each category, to check the most frequently occurring categories.

Fig 6. Top 5 categories existing in the dataset as per their frequency of occurrence for all products.

Let us visualize the category count with a threshold boundary indicated by a marked red line segregating all those calories having a count of more than 500 than those that don't.

Fig 7. Bar plot of categories of product and their counts. The most common category is Appliances.

We will reduce the number of categories by using a threshold value. So all those categories having an occurrence count of less than 100 will be considered as ‘other’ category. These all categories will be merged to form a single category to reduce the complexity of the dataset.

Now our data frame is of dimension (51646, 271), so our number of categories has been reduced by a significant number.

Similarly, we also do other data analysis like plotting the products with their number of categories. Most products have 3 categories assigned by the manual labelers.

Fig 8. There are around 22000 products with 3 categories assigned. This plot basically shows the distribution of categories.

We also do an analysis of the product description such as how long the description length is in general. For this, a boxplot is utilized, showing most descriptions are of length 150.

Fig 9. There are few outliers having description lengths greater than 300.

A word cloud is used to better understand the nature of descriptions of a product, it basically provides the most common words occurring in the descriptions of the product.

Fig 10. Most occurring words in the descriptions of products. The most occurring one is ‘Compatible’.

Data Cleaning :

It is important to clean our product name and description using NLP concepts like Stemming, stopwords removal, etc. This reduced the complexity and dimension of data and thus leads to less overfitted models.

These functions are applied to both product names and descriptions and finally, a new column ‘information’ is created which is generated by appending the cleaned product name and cleaned product description.

Fig 11. Data frame after performing the data cleaning operations.

Now the column ‘information’ will act as the dependent feature and the category classes as the dependent feature when we will fit them in a machine learning model.

Machine Learning Pipeline :

Fig 12. The machine learning pipeline for the category prediction task.

To solve multi-label problems, we mainly have approaches:

  1. Binary classification: This strategy divides the problem into several independent binary classification tasks. It resembles the one-vs-rest method, but each classifier deals with a single label, which means the algorithm assumes they are mutually exclusive.
  2. Multi-class classification: The labels are combined into one big binary classifier called powerset. For instance, having the targets A, B, and C, with 0 or 1 as outputs, we have A B C -> [0 1 0], while the binary classification transformation treats it as A B C -> [0] [1] [0].

We will work upon the machine learning modeling part in the next part of this blog. You can get the code for this work here: Code



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Prakhar Gurawa

Prakhar Gurawa

Data Scientist | Learner | Caricaturist | Omnivorous | DC Fanboy