The key idea is to use the opinions and behaviors of users to suggest and personalize relevant and interesting content for them.
Eg: Lets say there are 2 users, one with history of buying Samsung products, and the other Apple products. Then when they type in phone in search, the U1 should see samsung phones at the top and U2 should see iphones at the top of the result.
What we want is to get the User Embeddings, and the product Embeddings of Products from the search result, calculate which product embeddings are closer to a particular user and then based on their similarity score boost the popularity score of the product.
What do I mean by the Embeddings? We put the product/user data in some model and project it into an N-dimensional space. Then a product is represented by a N-dimensional vector which is the co-ordinate of that point in that space. Now, the model is trained such that 2 products which are similar to each other will be closer to each other. The embeddings of Nike shoes will closer to that of Adidas shoes compared to cell-phones.
- The data used here is: Amazon Product Data
- Here’s an example of how a json entry for a single product looks like (we’re interested in the related field).
- Relations between different products are extracted in the following form.
-
Converting relationships into a single score. Weights are assigned based on relationships (bought together = 1.2, also bought = 1.0, also viewed = 0.5) and summed across all unique product pairs.
-
To handle cold-start for products with no user interaction we create edges (l4-categories = 0.2, brand = 0.1) for products with same l4-categories, and same brand.
-
Create a graph (networkx), remove duplicates, add negative samples, create train-val split.
4.1. Create an adjacency matrix.4.2. The adjacency matrix needs to be converted to a transition matrix, where the rows sum up to 1.0. The transition matrix has the probability of each vertex transitioning to other vertices (thus each row summing to 1).
4.3. With the transition matrix, then converted it into dictionary form for lookup. Each key is a node, and the value is another dictionary of the adjacent nodes and the associated probability.
-
Create a random walk sequence from the formed graph.
-
Product embeddings are then learned via representation learning (i.e., word2vec skip-gram), doing away with the need for labels (gensim).
-
For a given user U1, who interacted with product P1, P2, … Pn, we will calculate the user embeddings by aggregating the product-embeddings of P1, P2, .. Pn.
-
Now, we have the User embedding, and some search results (product-ids, popularity-score), we update the popularity score of products whose cosine similarity with user-embedding is high, to give higher preference based on user interaction.
- We are getting logs of User Interaction Events, such as to Add to Cart, Product View, Transaction, etc. with Product-id, and User-id.
- Try adding more SI (desc, image, price, etc).
- Try directed graph .
- Product-pair relationships can be asymmetric; people who buy phones would also buy a phone case, but not vice versa).
- Evaluation of Product Embeddings.
- Evaluation of Personalization results.
Billion-scale Commodity Embedding for E-commerce Recommendation
Overall
- Use the product-pairs and associated relationships to create a graph
- Generate sequences from the graph (via random walk)
- Learn product embeddings based on the sequences (via word2vec)
- Recommend products based on embedding similarity (e.g., cosine similarity, dot product)