Skip to content

Instantly share code, notes, and snippets.

@JulesBelveze
Last active January 6, 2022 08:41
Show Gist options
  • Save JulesBelveze/3e28cc1851f8f915a414535a01e707eb to your computer and use it in GitHub Desktop.
Save JulesBelveze/3e28cc1851f8f915a414535a01e707eb to your computer and use it in GitHub Desktop.

ElasticSearch Entities Data Structure

The underlying intent here is for a user to be able to narrow done its research to only relevant document. The “Apple” case is representative. The way we do it now will return a bunch of documents containing the fruit. However, thanks to NER (Named Entity Recognition) we will be able to only extract only the “apple” referring to the brand. NER provides an additional information about the extracted entity: its category. We currently support 5 categories: organisation, person, event, product, location.

In the short run we want the user to be able to retrieve mentions containing a given entity and, optionally, from a given category. In the long run we want the user to be able to disambiguate the entity it is searching for. For example, there exists a bunch of “Michael Jackson”: the singer, the soccer player, … So ideally the user should be able to retrieve only mentions referring to one of them.

The proposed mapping is the following:

{
    "properties": {
        "entities": {
            "properties": {
                "ends": {
                    "index": false,
                    "type": "long"
                },
                "entity_id": {
                    "index": false,
                    "type": "long"
                },
                "labels": {
                    "index": false,
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "reputation": {
                    "index": false,
                    "type": "float"
                },
                "salience": {
                    "index": false,
                    "type": "float"
                },
                "starts": {
                    "index": false,
                    "type": "long"
                },
                "target": {
                    "index": false,
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        },
        "entities_ALL": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_ALL_id": {
            "type": "long"
        },
        "entities_EVENT": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_EVENT_id": {
            "type": "long"
        },
        "entities_LOC": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_LOC_id": {
            "type": "long"
        },
        "entities_ORG": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_ORG_id": {
            "type": "long"
        },
        "entities_PERS": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_PERS_id": {
            "type": "long"
        },
        "entities_PROD": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_PROD_id": {
            "type": "long"
        }
    }
}

Such a mapping allows us to perform a bunch of different requests:

  1. Retrieve all mentions containing ‘Michael Jackson’ With the following query we will retrieve all mentions containing the entity “Michael Jackson” regardless if its type or if it is the soccer player or singer.
{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_ALL: Michael Jackson"
                    }
                }
            ]
        }
    }
}
  1. Retrieve all mentions containing the person ‘Michael Jackson’ In opposition to the previous we can refine the query to only return mentions where the entity is a person.
{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_PERS: Michael Jackson"
                    }
                }
            ]
        }
    }
}
  1. Retrieve all mentions containing the soccer player ‘Michael Jackson’ Last but not least, we can explicitly search for the soccer player.
{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_PERS_id: 2"
                    }
                }
            ]
        }
    }
}

The reason why one will not always want to perform the last query is because sometimes the lack of context doesn’t allow you to surely know which entity is mentioned. For example, in Michael Jackson is dead. there’s no way to tell whether it is the soccer player or the singer.

Also, organisations with eponymous products like “Facebook” or “Instagram” might want to be able to search mentions about the organisation itself or the product.

@nguyenvietyen
Copy link

Here are additional use cases I can imagine:

  • The user wants to monitor Samsung, and that also includes Samsung Corp., Samsung Global, Samsung EMEA. So that means the user wants to search whether a substring matches the entity. This complements the exact match search.
  • How would the aggregation query work? E.g. what are the most named products when Samsung is mentioned as a substring in any entity?

Question as part of the process:

  • Should we make a detailed design for id generation/giving to entities? Or can we do without in the first phase and simply loosely consider them so we don't create too much tech debt?

@JulesBelveze
Copy link
Author

JulesBelveze commented Jan 5, 2022

The user wants to monitor Samsung, and that also includes Samsung Corp., Samsung Global, Samsung EMEA. So that means the user wants to search whether a substring matches the entity. This complements the exact match search.

Forgot to mention that one.. For such a case one could do the following:

               {
                    "query_string": {
                        "query": "entities_PERS: Michael*"
                    }
                }

How would the aggregation query work? E.g. what are the most named products when Samsung is mentioned as a substring in any entity?

Well aggregation queries are possible, one can actually perform the following:

{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_PERS: Samsung*"
                    }
                }
            ]
        }
    },
    "aggs": {
        "prod-agg": {
            "terms": {
                "field": "entities_PROD.keyword"
            }
        }
    }
}

Should we make a detailed design for id generation/giving to entities? Or can we do without in the first phase and simply loosely consider them so we don't create too much tech debt?

I would proceed without the entity ids in the first place, I will start thinking how to create such table.

@nguyenvietyen
Copy link

LGTM. I also agree on proceeding without entity ids. It seems like we will not create too much tech debt by omitting this now.

Next step: could you create a seed database using a small sample of our mentions (e.g. 10K to 100K of them) where NER was applied and stored in that way?

Then we get more real-world exposure to the actual data cases, and can try out the queries and see whether they yield the intended result.

@JulesBelveze
Copy link
Author

On it.
I guess that the reputation_searchable will become irrelevant, right?

@nguyenvietyen
Copy link

I guess that the reputation_searchable will become irrelevant, right?

Long-term: Yeap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment