Solving Search Relevancy Problems of E-Grocery App's Search Engine: ElasticSearch Signal Modeling

This article is one of a series on my journey to solve the search relevancy problem of an e-grocery app's search engine. This article is the second part, and it discusses one of the fundamental parts of solving the search relevancy problem: signal modeling. As I mentioned in my previous article, this article is based on my practical experience of solving a real-world problem at my current company, using knowledge from the book Relevant Search by Doug Turnbull and John Berryman.

Signal modeling in Elasticsearch is the process of creating and using signals to improve the relevance of search results. Signals are pieces of information that can be used to measure the relevance of a document to a query. For example, a signal could be the number of times a word appears in a document, or the proximity of two words in a document.

Based on the definition of signal modeling, to solve our search relevancy problems, we need to carefully craft the signals to create a ranking function that returns the most relevant documents to our users. In this case, I will share my own experience of doing signal modeling using the use case of an e-grocery app's search engine.

Signal Modeling

Usually, without any understanding of signal modeling, we could just index the source data model's fields and search the query on those fields. However, the problem is that the source data model is usually different from the model that intuitively pops up in the user's mind. Why do we need to care about this intuitive model that exists in the user's mind? Because any of the user's search queries will be based on that model, not our source data model.

Before we do the signal modeling, we might want to try to answer these questions first to understand our users better:

What are the forms of product properties that users care about in an e-grocery platform?
How do users intend to search for the products they want, with those product properties in mind?

What are the forms of product properties that users care about in an e-grocery platform?

An e-grocery platform is just the digital form of a grocery store. So, as we can see in any grocery store, they might give us some clues by placing signs above the racks, grouping racks based on the item's category, etc., to help us find the items we need. So, in an offline grocery store, we might find an item by following these steps as an illustration:

Go to the category group of racks (Frozen Food, Snacks, etc.).
Find some specific part of the rack that groups the items we want (could be grouped by brand, name, variant, size, etc.).
Look at the price.
Put it in our basket if the price is okay for us.

By looking at the offline grocery store illustration above, we can intuitively see that the user might care about a few things, which are: category, brand, name, variant, size, and price. In this case, name is like (apple, grape, etc.), while variant is the variant of the apple, it could be red apple, green apple, etc.

We can just derive the implementation for e-grocery using properties inspired by offline grocery stores as product properties. Because those are properties that users might care about when they go to a grocery store to find the products they want to buy.

So, to answer the question, we can just define the forms of product properties that users care about in an e-grocery platform as:

Category
Name
Brand
Variant
Size (1kg, 500ml, 3pcs, etc.)
Price

Of course, it could be more complex than this, but we can just start with these properties and enrich them later to give a better user experience when searching for products they want.

How do users intend to search for products with product properties in mind?

Based on my experience, users tend to be straightforward in writing search queries. In e-grocery platforms, the top keywords are dominated by single-keyword queries such as:

apple
banana
milk
meat
sausage
etc.

Other than that, there are two- or three-keyword queries that explain the product in a more specific way, or when the product name is described using two terms, such as:

red apple
red onion
long beans

So, to answer the question, as I explained in my previous article, we can formulate the user's search query platform as:

[Product Name][Brand Name][Variant Name][Category Name]

Sample queries that use the full combination of the parts in the formula are:

"red apple juice 500ml"
"garlic powder 150g"

Content Indexing

We can now start thinking about the indexing process. Indexing in Elasticsearch is the process of creating a searchable index of documents. An index is a collection of documents that are organized and stored in a way that makes them easy to search.

Let's get into practice. We're going to use the Elasticsearch Python SDK because I think it is the easiest way to understand the ElasticSearch API, and Python is such an easy-to-read language.

We can assume that the data from the database will be quite complex. It might contain a lot of fields. Do we need to index all of those fields? Sometimes, it's tempting to think to just copy all the fields from the source data and search for the user's query on all those fields. But, it's not that simple. That could lead to a lot of problems, as we'll see soon in the next sections.

Let's just extract some of those fields based on our understanding of what fields matter in the user's mental model when they're searching for products. So, we can start our content indexing process by doing the extraction process first.

Extraction

product_document = {
  "name": data["name"],
  "category": data["category"],
  "sub_category": data["sub_category"],
  "variant": data["variant"],
  "sub_variant": data["sub_variant"],
  "brand": data["brand"],
  "name": data["name"],
  "size": data["size"],
  "price": data["price"],
}

Extraction is a very simple process. It simply extracts the value of fields from the source data and puts it into our data that will be indexed on Elasticsearch. As you can see, we also have sub_category and sub_variant fields, which turn out to be needed to align with the user's mental model.

To see what the data looks like, let's just try to index this single data and search the data!

import elasticsearch

es = elasticsearch.Elasticsearch(
    [{'host': 'localhost', 'port': 9200}],
)

es.index(index='products', doc_type='product', id=source_data['id'], body=product_document)

Let's say, one of the data from the source data is this data:

{
  "id": 2,
  "category": "Fruits",
  "sub_category": "Fresh Ingridients",
  "brand": "Fresh Nation",
  "name": "Red Apples",
  "variant": "Apple",
  "sub_variant": "Red Apple",
  "size": "1 pound",
  "price": "$5.99"
}

And by that, we should be able to find that data by using the query "apple", right? Let's try:

result = es.search(
    index="products",
    query={
        "match": {
            "name": "apple"
        }
    }
)

It will return this kind of response:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

Hmm, why is the result empty? Does the matching query not work? Not really. It's because we didn't have any analyzer to process the data before it was indexed. So, when we have indexed data that contains the term "apples" in the name field, searching with the query "apple" in the name field will not return any results. This is because "apples" and "apple" are treated differently without an analyzer. We will create our custom analyzer soon in the Analysis Section.

Enrichment

Enrichment is a process of adding more information that can help calculate a document's relevancy score. In this e-grocery case, we could add fields such as:

weekly_order_count
weekly_add_to_cart_count
weekly_click_count

Let's assume that the data is already provided by the Data Team, and we just have to use it to enrich our product document:

product_performance_data = get_product_performance_data_from_data_team_api(source_data['id'])

product_document['weekly_order_count'] = product_performance_data['weekly_order_count']
product_document['weekly_add_to_cart_count'] = product_performance_data['weekly_add_to_cart_count']
product_document['weekly_click_count'] = product_performance_data['weekly_click_count']

We will use these fields later, maybe in the next article! We will ignore them for now.

Analysis

To solve the previous problem of searching for "apple" in the Extraction section, we need to do the analysis step. We need to create an analyzer to preprocess the text before indexing it to Elasticsearch. Let's create the analyzers!

settings = {
  "analysis": {
    "filter": {
      "english_stemmer": { #1
          "type": "stemmer",
          "language": "english"
      }
    },
    "analyzer": {
      "e_grocery_text_analyzer": {
          "tokenizer": "standard", #2
          "filter": [ #3
              "lowercase",
              "english_stemmer",
          ]
      },
    },
  },
}

We have created an analyzer called e_grocery_text_analyzer that will be used to analyze any text properties in the document. I will explain all of them one by one.

english_stemmer filter -- Normalizes word endings so that, for instance, the words "running", "ran", and "runs" would all be stemmed to the word "run".
standard tokenizer -- Tokenizes words (good job of tokenizing European languages).
filter -- Where we use the filters we defined, and have an extra built-in lowercase filter to simply lowercase all the characters.

Okay, now we could use the analyzer to analyze our document's properties (we will ignore the properties from the enrichment step for now):

mappings = {
    "properties": {
      "category": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer", #1
      },
      "sub_category": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer",
      },
      "brand": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer",
      },
      "name": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer",
      },
      "variant": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer",
      },
      "sub_variant": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer",
      },
      "size": {
        "type": "text",
        "analyzer": "e_grocery_text_analyzer",
      },
      "price": {
        "type": "float", #2
      },
    }
}

We just need to mention the name of the analyzer in the analyzer properties.
Except for the price property, we use the float type instead, because we might need to sort the search result based on price, so it will be more useful if the price is in float type.

Indexing

And finally, we can create our index:

from elasticsearch import Elasticsearch, helpers


es = Elasticsearch(
    [{'host': 'localhost', 'port': 9200}],
)

es.indices.create(index="products", mappings=mappings, settings=settings)

actions = []

for data in source_data:
    actions.append({
      "_op_type": "create",
      "_index": "products",
      "_id": data["id"],
      "category": data["category"],
      "sub_category": data["sub_category"],
      "brand": data["brand"],
      "name": data["name"],
      "variant": data["variant"],
      "sub_variant": data["sub_variant"],
      "size": data["size"],
      "price": float(data["price"].replace("$", "")),
    })

helpers.bulk(es, actions=actions)

We could use the bulk helpers to index multiple documents at once.

Crafting The Query

When we talk about signal modeling, it's not just about building the properties of a document and indexing them. It's also about crafting the query.

Crafting the query is a part of signal modeling itself. We're going to define how we use user's query terms to turn on the signals in an e-grocery platform.

We can imagine that we're using a tool like sonar. Documents are the objects we're going to detect with our sonar. Each object has different characteristics. How well we detect the objects doesn't depend on the object characteristics alone, but also on how good our sonar is at detecting object characteristics. The query is just like the sonar beam in this story.

Let's start with a very simple match query:

result = es.search(
    index="products",
    query={
        "match": {
            "name": {
               "query": "red apple",
               "analyzer": "e_grocery_text_analyzer",
            }
        }
    },
)

It's common to start with a very simple approach to solve a problem. In this case, we will try to search with the query: "red apple".

Here is what we got:

1. [2.668399]	Red Apples
2. [2.0531998]	Green Apples
3. [0.61519927]	Red Grapes
4. [0.61519927]	Red Potatoes
5. [0.55260456]	Red Bell Peppers
6. [0.501571]	Locally Grown Red Onions
7. [0.39275682]	Whole Foods Market Organic Red Wine Vinaigrette
8. [0.24881268]	Prego Traditional Italian Style Pasta Sauce, Made with San Marzano Tomatoes, Red Wine, and Herbs

It seems that we have no problem here. The search results are intuitively correct. The expected product is at the first position, which is "Red Apples", and the less relevant "Green Apples" is still relevant because in this context, I'm talking about apples. But the rest of the search results are any products that contain the word "red".

Red Apple Issue

In the user's perspective, it seems wrong to search for "red apple" and get any product that has the color red. Let's fix this counter-intuitive search result by modifying our query:

query = "red apple"

result = es.search(
  index="products",
  query={
    "bool": {
      "must": [
        {
          "match": {
            "variant": {
              "query": query,
              "analyzer": "e_grocery_text_analyzer",
            },
          }
        }
      ],
      "should": [
        {
          "match": {
            "name": {
              "query": query,
              "analyzer": "e_grocery_text_analyzer",
            }
          }
        }
      ]
    }
  },
)

The query is returning this search result:

1. [4.3877363]		Red Apples
2. [3.7725368]		Green Apples

Now, we have a better result! Here, we're using a Boolean Query. Here, we know that we have a "variant" field that saves information about what kind of product this is.

...
{
  "id": 2,
  "category": "Fruits",
  "sub_category": "Fresh Ingridients",
  "brand": "Fresh Nation",
  "name": "Red Apples",
  "variant": "Apple", // #1
  "sub_variant": "Red Apple", // #2
  "size": "1 pound",
  "price": "$5.99"
},
...
{
  "id": 5,
  "category": "Canned Goods",
  "sub_category": "Pasta Sauce",
  "brand": "Prego",
  "name":/** #3 */ "Prego Traditional Italian Style Pasta Sauce, Made with San Marzano Tomatoes, Red Wine, and Herbs",
  "variant": "Pasta Sauce",
  "sub_variant": "Red Wine Pasta Sauce",
  "size": "24 ounces",
  "price": "$3.99"
},
...

The "variant" field describes "what kind of product is this?". (e.g. Apple, regardless of color, size, shape, or brand.)
The "sub_variant" field is similar to the "variant" field, but it tends to be more specific. It describes "okay, this is an apple, what kind of apple is it?" (e.g. Red Apple.)
The "name" field is quite extraordinary. In fact, marketing teams sometimes need to craft eye-catching names that can result in very long names, and it's even possible that the product name itself will not be included.

Based on the third explanation above, we could add a fallback in case the "name" field is not helpful. For example, if the product "red apple" has the name "The Most Delicious Apple in the World," we don't want to lose the "red" signal, right? So, let's add it to our query:

result = es.search(
  index="products",
  query={
    "bool": {
      "must": [
        {
          "match": {
            "variant": {
              "query": query,
              "analyzer": "e_grocery_text_analyzer",
            },
          }
        }
      ],
      "should": [
        {
          "match": {
            "name": {
              "query": query,
              "analyzer": "e_grocery_text_analyzer",
            }
          }
        },
        {
          "match": {
            "sub_variant": {
              "query": query,
              "analyzer": "e_grocery_text_analyzer",
            }
          }
        }
      ]
    }
  },
)

The query is returning this result:

1. [6.6497173]		Red Apples
2. [5.5130186]		Green Apples

Nothing's changed but the score, which makes sense because we added more matching queries. The score is not a normalized number or a ratio between 1-100, 0-1, etc. It's just an accumulative score of every rule in our query.

Business Perspective

Our users seem to be happy, but not our business users. Because displaying the most relevant item only reduces the chance for users to buy other items that they might be interested in.

Somehow, our business users want the items to be ordered like this:

1. [Primary Items]
2. [Secondary Items]
3. [Tertiary Items]

Primary items are the most relevant items, like (e.g. red apple, green apple)
Secondary items are derivative items like (e.g. apple juice, apple snacks)
Tertiary items are recommendation items (e.g. best selling items, most viewed items, etc.)

We could use the data from the Enrichment step to rank the tertiary items. But I think we could do that in the next article!

Conclusion

We've learned how to do Signal Modeling for an e-grocery platform. This is only the beginning, and the most fundamental thing to do before we can optimize our search results to give our users a better search experience when using our search engine.

And, I'm sorry that this article might be quite technical, I know, but that's because it's also my very first time fighting this search relevancy problem, and I'm an engineer, first thing pops up in my mind to solve any problem is just to translate it into code. I promise that in the next article I will try to be more focused on the problems on more conceptual level. See you!