ML and NLP for E-Commerce

Machine Learning and NLP for E-Commerce product description personalisation

2021 marks the second year of The Innovation Sandbox and will see us embark on several new projects, as well as some existing projects, with new students, new clients and new partnerships.  

The Innovation Sandbox is a collaborative initiative at the confluence of industry & academia, allowing both academic types of research on real–world datasets and the transfer of cutting-edge machine learning research into industry application

Project Overview:
What exactly are you researching? Why is it exciting?

According to Salsify, a leading market surveyor, more than 87% of the online shoppers place a high value on product descriptions as key factors when making a purchase decision. Additionally, it found that product descriptions were considered to be more important than product images and product reviews. (Source)

Currently, the problem with product descriptions on e-commerce platforms is that there is only one for a given product. Moreover, there have been huge advances in personalization over the years with the application of machine learning and big data, but this has not affected product descriptions. 

By employing information about a user, this project aims to apply Natural Language Processing (NLP) and Deep Learning techniques to personalize product descriptions for users and ultimately provide them with more relevant information and delivering a better overall user experience. 

Natural Language Processing is the branch of Artificial Intelligence and Machine Learning that aims to allow machines to understand human language. Within NLPthis project will use text generation, structured data is converted into human language, to generate these recommendations. 

The Specific Problem

Different users will look for different pieces of information in products. When purchasing a laptop, for instance, someone interested in computers might be more interested in the technical details, such as the screen specification or what the processor is like. A businessperson, on the other hand, might be more interested in the ease of use of the laptop and its interface.  

The aim of this project is to personalise product descriptions on an e-commerce platform according to the type of user that is looking at the product at that specific moment, showing them what is relevant and omitting some of the information that is not relevant to them.  

Project Challenges

The project will present several different challenges that we see from our initial observations. 

First of all, vast amount of data will be fed into the deep learning models and we forecast it will come at the cost of high computing power needed, since these models can take hours or days to trainThis is the reason why initially, to test the model, we will utilise only a sample of the data in order to validate it. 

Additionally, data will represent a significant challenge of the project.  

Previous research on the topic of product description personalisation has been conducted by Alibaba, the Chinese e-commerce giant. Alibaba, as an e-commerce platform itself, had vast amounts of data available to develop their model which proved to be essential For this project, we will try to gain as much information about users as we can, in an anonymous fashion, and inferring what we can from it. 

The way this project aims to solve this challenge, and this is quite a popular area for ecommerce platforms, is trying to segment users into different groups: users interested in music, businesspersonsor users interested in sports, and then personalizing the product descriptions according to the user. 

Another data challenge that we forecast in this project is that, although we will use vast volumes of public data for the project , we will not have access to private information such as previous purchases (except those that customers have reviewed), browsing history demographics and customers’ background 

To solve this issue, we aim to build a clustering model with the information that is available and extract user categorieswhich is needed as an input for the NLP model. This clustering model will be based on the behaviour of the user: subject of a review, what kind of purchase the did, etc which will ultimately be translated  into a data point 

By cross correlating some of these product review IDs and behaviours we will be able to work with data in a completely anonymous fashion, ultimately making a prediction and tailoring the product description. 

However, as the project just started, we are still in a conceptual phase but we aim to start testing the data that is available soon. 

Data Sources 

Part of the project is going to be analysing some currently existing datasets and trying to understand what will be useful and what will not be. This makes the volume of data available for analysis quite big. 

Alibaba’s previous research contained approximately 2 million data items that were applied to the model. Our model will be trained using a large publicly available dataset from Amazon consisting of approximately 18 million products and 250 million reviewseach attached to a unique reviewer ID. This will pose certain challenges in terms of processing the data and training the models.  

Initially, to reduce the compute requirements, we will not use all the dataAlternatively, wwill take some samples of the data to test our hypothesis and intuitions about this problem and observe their performance. 

For the success of the project, we will try to use as much of that information as possible, on top of other sources such as well such as amongst others Wikipedia and Wikidata amongst others 

Future Obstacles

The biggest obstacles that we forecast are a limited access to information about users and the computing issues that come by working with those volumes of data 

We can only look at reviews and try to infer information about the users as we will be using public and anonymised data.. This data constraint leads us to think that clustering will be abit of a challenge in this project  . 

In addition, once we are are able to find a suitable model, the computing and power issues will arise, so we need to aware of this and find a solution. 

Project Ethics – Handling User’s Data

Handling user’s data appropriately should be one of the most important objectives for any ecommerce platformBeing transparent on the data collected about the user is fundamental for brand to build trustworthy relationships, attract new customers and keep them engaged to come back to the platform. 

For this project, we will not collect any more information that’s already available publicly. As mentioned above, we will have to create a solution to overcome this lack of data without compromising the privacy of the user. This is a very delicate issue as it might result in not building any model at all. 

On the other hand there is the possibility of a privacy trade-off between the e-commerce platform and the user in order tot deliver them the personalised experience they are looking for.  

Research Impact & Next Steps

From a customer point of view, the biggest impact is that they will see more relevant information when reading product descriptionsThis is crucialas they will not have to “waste” time seeing irrelevant information. This will keep them more informed which, in turn, improves the overall customer experience, which can help them keep more informed.  

From an ecommerce point of view, customers are being kept more engaged, which encourages them to come back to the platform. Additionally, there is also an increased likelihood of customers making a purchase. 

More generally, optimizing messages so that you’re showing the reader only information that’s relevant to them can be a hugely important and powerful area of development, not only in the retail industry but in any industry such as healthcare and law for instance 

Project Timeline 

At the moment we are still in the planning phase, and the project will kick off shortly. 

The first two months are going to be focused on collecting all the data, bringing it in from all of the sources, analysing current datasets, to see what information will be useful.  

After the first two months, there’ll be three months of developing the model, developing the system to learn from this data, and give it the ability to, for example, not only learn from users, and how to personalize information, but learn from an external source, such as Wikipedia, to be able to provide more context for a product. 

What to expect next? 

Collecting the data is the most important priority right now and that’s one of the main tasks we are working onHow to collect this information about users in a very anonymized fashion and try to segment them into different group, pulling information from external sources like Wikipedia into a big data set, and then trying to allow the model to learn directly from that will take most of our time in the future months. 

Do you want to find out more?