Skip to content

Instantly share code, notes, and snippets.

@JulesBelveze
Last active January 6, 2022 08:41
Show Gist options
  • Save JulesBelveze/3e28cc1851f8f915a414535a01e707eb to your computer and use it in GitHub Desktop.
Save JulesBelveze/3e28cc1851f8f915a414535a01e707eb to your computer and use it in GitHub Desktop.

ElasticSearch Entities Data Structure

The underlying intent here is for a user to be able to narrow done its research to only relevant document. The “Apple” case is representative. The way we do it now will return a bunch of documents containing the fruit. However, thanks to NER (Named Entity Recognition) we will be able to only extract only the “apple” referring to the brand. NER provides an additional information about the extracted entity: its category. We currently support 5 categories: organisation, person, event, product, location.

In the short run we want the user to be able to retrieve mentions containing a given entity and, optionally, from a given category. In the long run we want the user to be able to disambiguate the entity it is searching for. For example, there exists a bunch of “Michael Jackson”: the singer, the soccer player, … So ideally the user should be able to retrieve only mentions referring to one of them.

The proposed mapping is the following:

{
    "properties": {
        "entities": {
            "properties": {
                "ends": {
                    "index": false,
                    "type": "long"
                },
                "entity_id": {
                    "index": false,
                    "type": "long"
                },
                "labels": {
                    "index": false,
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "reputation": {
                    "index": false,
                    "type": "float"
                },
                "salience": {
                    "index": false,
                    "type": "float"
                },
                "starts": {
                    "index": false,
                    "type": "long"
                },
                "target": {
                    "index": false,
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        },
        "entities_ALL": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_ALL_id": {
            "type": "long"
        },
        "entities_EVENT": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_EVENT_id": {
            "type": "long"
        },
        "entities_LOC": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_LOC_id": {
            "type": "long"
        },
        "entities_ORG": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_ORG_id": {
            "type": "long"
        },
        "entities_PERS": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_PERS_id": {
            "type": "long"
        },
        "entities_PROD": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "entities_PROD_id": {
            "type": "long"
        }
    }
}

Such a mapping allows us to perform a bunch of different requests:

  1. Retrieve all mentions containing ‘Michael Jackson’ With the following query we will retrieve all mentions containing the entity “Michael Jackson” regardless if its type or if it is the soccer player or singer.
{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_ALL: Michael Jackson"
                    }
                }
            ]
        }
    }
}
  1. Retrieve all mentions containing the person ‘Michael Jackson’ In opposition to the previous we can refine the query to only return mentions where the entity is a person.
{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_PERS: Michael Jackson"
                    }
                }
            ]
        }
    }
}
  1. Retrieve all mentions containing the soccer player ‘Michael Jackson’ Last but not least, we can explicitly search for the soccer player.
{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "entities_PERS_id: 2"
                    }
                }
            ]
        }
    }
}

The reason why one will not always want to perform the last query is because sometimes the lack of context doesn’t allow you to surely know which entity is mentioned. For example, in Michael Jackson is dead. there’s no way to tell whether it is the soccer player or the singer.

Also, organisations with eponymous products like “Facebook” or “Instagram” might want to be able to search mentions about the organisation itself or the product.

@nguyenvietyen
Copy link

I guess that the reputation_searchable will become irrelevant, right?

Long-term: Yeap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment