Identifying New Product Groupings in E-Commerce
This is a post by Padmini Jaikumar, engineer from the Organic Search Team at Bloomreach.
It is based on work by Apurva Gupta, Antariksh Bothale, Soubhik Bhattacharya, Warren Mar, Mohammad Salim Ahmed, Ailian Gan, Ramkumar Rajendran, Prateek Gupta and Charlie Luo.
Despite online shopping’s increasing popularity, finding products that you’re interested in on an e-commerce website can be a tiresome experience. Product groupings exactly matching a user’s intent may not exist; search results can be limited and category pages may be too broad.
This blog post discusses algorithms developed at Bloomreach that automatically create new product groupings to better aid users in their discovery process. Our algorithms use a combination of site content and user behavior to identify new, interesting product groupings.
In fact, the algorithms have been used to identify thousands of new groupings for merchants spanning many verticals. These new groupings have led to significant revenue impact for these merchants, leading us to conclude that these new groupings have had a measurable impact on improving users’ product discovery process.
Problem Statement ⚡
Quickly finding products of interest on an e-commerce site can be a tedious process. There usually aren’t pages that display product groupings that exactly match the user’s intent. Category pages containing the products the user is interested in are usually pretty broad and have limited navigation facets for filtering. Searching on the website can return limited or irrelevant results.
Fig.1 shows such an example, live from a merchant’s site. In this case, a user is interested in finding cream-colored men’s sweaters. There is no category page exactly matching the user’s intent. Searching on the website using site search for “men’s cream sweaters,” returns only one product (Fig.1a), while there are clearly many more relevant products in the retailer’s inventory (Fig.1b). If a user, dissatisfied with this search result, broadened the search query to “men’s sweaters,” there would be 151 products returned, with the only filtering options being “Price,” Customer Rating” and “Best Selling.”
Fig 1. The image on the left shows the search results for “men’s cream sweaters”. The image on the right shows products on the site that fit the description, but were not returned in the search results.
Unfortunately, today this experience may be more the rule than the exception. From the merchant’s standpoint, however, this is not an easy problem to fix. With the huge flux in online inventory, thousands of items are being added and removed every day. These items may be insufficiently or incorrectly tagged by suppliers. Search in the presence of such noisy or missing attributes leads to poor quality, as seen in Fig.1. Products also go in and out of season and fashion, and in and out of stock. Finally, customers express their intent in a myriad of different ways.
As the number of different intents is practically infinite, manually creating the right content for every intent of every user is next to impossible. To overcome these problems, we have developed novel algorithms that automatically identify product groupings, thereby improving the user navigation experience. These algorithms are discussed in detail in the subsequent sections.
Creating New Product Groupings
At Bloomreach, we have developed algorithms to automatically create new focused, user-friendly product groupings to aid navigation. These groupings are refreshed every day to keep up with the flux in online inventory, taking the pain out of manually curating such pages.
Our algorithm for creating these new groupings has the following novel characteristics:
Understanding Synonyms - We determine contextual synonym correlations and use this to tag a product with the different ways users describe it. This enables products to be part of groupings they would not have been in the past, based on the attributes provided by the merchant. We determine these correlations by analyzing the distribution of the products viewed based on particular user queries.
Content Based Clustering- We have developed a rich product-attribute dictionary so that pages can be tagged more richly. We rank attributes in this dictionary based on user demand; and we cluster pages with popular attributes to create new groupings.
User Behavior Based Clustering - We determine sets of products that are often bought together by analyzing co-bought products from user buying patterns. We do this by clustering user sessions and by extracting maximum cliques in a product association graph.
Fig. 2 shows examples of product groupings automatically created by our algorithms.
Fig 2. The image on the left shows the grouping created for “Prom sleeveless party dress”. The grouping on the right was created for “Decorative lumbar pillow cover”.
These algorithms have been used to identify thousands of new product groupings for merchants spanning many verticals, which has had significant revenue impact for these merchants. These groupings increase revenue by improving users’ experience in different parts of their buying journey. A grouping like “Blue lace dress,” helps users browse a gallery of similar products to easily compare them and perhaps buy one. Other groupings of related products, like “Canon DSLR camera accessories,” help someone starting out on the buying process quickly access the range of accessories to complement a product they might have bought already.
These groupings are linked from top-level category pages so users can access popular, specific combinations easily. They can also be linked from a specific product page so users can quickly access a gallery of similar products. These pages are also excellent landing pages for targeted campaigns or marketing emails.
The subsequent sections discuss specific aspects of our algorithms in greater detail.
Users describe products in a myriad of different ways. Fig. 3 illustrates this point, where 500 respondents described the fit of the dress in the figure in 148 unique ways.
Fig 3. Results of describing the fit of the dress shown in the figure.
Our goal is to build contextual intent synonym mapping, so that we can automatically tag a product with the different ways users describe it. This allows the product to be a member of new groupings that it could not have been a part of in the past, based on the limited attribute set provided by the merchant. This helps users view synonymous products in the same grouping without having to navigate or search again.
We built a synonym dictionary as opposed to using resources like WordNet, since filtering based on context was hard. For example, we wanted to learn that “heels” is a synonym of “pumps” in the context of shoes, but not in the context of electrical equipment.
User queries for which similar sets of products were viewed are considered synonymous queries. From synonymous queries, we extract phrases that have been surrounded by the same set of words across many queries. These phrases are then considered contextual, intent-based synonyms.
Our algorithm is summarized below:
for each distinct query q: v: set of products viewed c: number of times each product was viewed p(q): distribution over v by normalizing using sum of c for q: QuerySet for q1: QuerySet if substrings(q,q1) || plurals(q,q1) q is not a synonym of q1 else if JSD(p(q), p(q1)) < 0.1 q is synonym of q1 else q is not a synonym of q1
Here JSD is the Jensen-Shannon divergence which is a measure of the similarity between two probability distributions (Eq.1). Small values of the Jensen-Shannon divergence indicate that the two distributions are very similar, while large values indicate that they are dissimilar. The Jensen-Shannon divergence is a symmetric and smoothed version of the Kullback-Liebler divergence (represented as D( || ) below), which is a non-symmetric measure of the difference between two probability distributions.
Equation 1: Jensen-Shannon divergence
The Kullback-Liebler divergence is the expectation of the logarithmic difference between the probabilities P and Q, which for discrete probability distributions is computed as:
Equation 2: Kullback-Liebler divergence.
With the algorithm above, we identify synonymous queries such as:
- toddler holiday pajamas → kids christmas pajamas
- winter coats → snow jackets, outerwear, parka
From this query mapping, we then identify unigram synonyms. For this we look for words that appear in the same context, meaning terms that have been surrounded by the same words many times. This helps us identify synonyms like:
- toddler → kids, children, babies
- pajamas → onesies, sleepwear, nightgown, pjs
There are some interesting benefits we receive based on our synonym-extraction algorithm:
- We automatically capture new terms as synonyms, for instance, “gladiator sandals” were identified as synonyms for “strappy upper flats.”
- We identify common misspelling of words.
- We can understand brands better – for example, we learned that “mudd” makes junior clothing, since “mudd clothes” had a product distribution similar to “junior clothes.”
These algorithms have helped us build a rich synonym dictionary. Using this dictionary, we enhance the attributes supplied for a product by the merchant with synonyms. As mentioned before, this helps capture the numerous ways users might describe this product and allows the product to be a part of new content groupings.
As far as we are aware, such an approach for synonym extraction has not been tried before at this scale to help retailers provide better navigation.
The next section discusses our algorithms to identify new product groupings.
Creating New Product Groupings
Clustering by Page Content
Say we want to create pages grouping similar products (an example would be “Blue lace dress”), that facilitate easy comparison. There are two challenges here: making sure products are annotated richly and identifying which new groupings to create, since there are many similar product grouping pages we can create, but all of them may not be useful.
For expanding the set of attributes a product is annotated with, we have created a rich-product-attribute dictionary, which elaborates on the set of attributes that can be associated with a product. This dictionary is built by:
- Capturing commonly co-occurring words used to describe the product in the product description corpus, using techniques such as Latent Semantic Analysis .
- Capturing common co-occurring terms used along with the product in user queries.
Additionally, synonym expansion is used to annotate products more richly.
Once products have been annotated with a wider set of attributes, we rank attribute and product combinations based on user demand, which is estimated using search queries on the website. For attribute product combinations that are popular, and have many matching products on the merchant’s site, we create new product groupings clustering these pages.
Clustering Co-Bought Products
This section discusses the process to identify complementary products like “Coffee Supplies,” that might group a “Coffee Maker,” “Coffee,” “Coffee Filter,” “Mugs” etc. We create such product groupings based on user buying behavior.
We define a user session to be the the set of products a user bought over the span of a few days. For each user session, we compute a locality sensitive min hash. A locality sensitive min hash has the property that similar items map to the same “buckets” with high probability. Therefore user sessions with a lot of products in common have a high probability of hashing to the same bucket, while sessions with none or very few products in common will have a low probability. The number of user sessions in a bucket gives an idea of the popularity of that product set among all user sessions.
Subsequently, we compute the most popular product per user-session group, as the pivot product for that group. Once we have the pivot product, we determine frequently bought products from co-bought information. Finally, ensuring that attributes are shared across this set ensures better quality. All thresholds are manually fine-tuned based on performance.
Our algorithm is summarized here:
Identifying frequently bought sets of products
There are some products that are rarely bought alone. For example, a parent buying a children’s coloring book is more likely to buy a couple of different ones as opposed to a single one. Another example is someone looking to buy a sugar jar. Quite often they are also looking for a flour jar, oil dispenser, salt and pepper shakers etc. These products are not similar products and tend to have a tighter correlation than products in the co-bought category. The co-bought clustering algorithm tended to return a broader set of products than these tighter sets, so we tried something different:
- For each pair of products in the user-session data, compute an association score as Eq. 2:
Eq 2: Association score between A and B
NA and NAB are the number of times product A was bought, and products A and B were bought together, respectively. A high association score indicates that the products are often bought together, while a low score indicates that they are not frequently bought together.
- Build an undirected graph linking every two products with an association score higher than a threshold (manually fine-tuned).
- Identify maximum cliques in this graph. A clique is a subset of vertices of an undirected graph, such that there is an edge between every set of vertices. A maximum clique is the largest of all cliques in the graph. Since edges are based on high association scores between products, a clique in the graph represents a set of products bought frequently together.
Using this algorithm, we were able to identify the examples discussed earlier as frequent sets. For clique computation we use the NetworkX Python library implementation of determining maximum cliques .
In this blog post we presented work we have done in identifying new, focused and interesting groupings that aid user navigation. Our algorithms create groupings that account for contextual synonyms and group similar and related products. Our work here serves to show that creating a compelling user experience in terms of ease of navigation, while being a hard problem, can lead to significant benefits.
Scott Deerwester and Susan T. Dumais and George W. Furnas and Thomas K. Landauer and Richard Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, 1990