Currently we are storing our tag information in json
files in S3
. The idea is good because we get versioning and replayability out of the box.
This is how the current structure looks like:
ci
├── amenity
├── geo
├── hotels
We are using cloudsearch mainly for autocompletion and to get the ids linked to them. To make this work for multiple markets and multiple languages/market (even synonyms) we only need 3 fields.
The first field we need is a context that defines market and language.
market:langage
E.g.: uk:en
{ | |
"nodes": [ | |
{ | |
"id": 516, | |
"label": "geo", | |
"title": "London" | |
}, | |
{ | |
"id": 1650, | |
"label": "geo", |
[ | |
{ | |
"product": { "name": "testing shit", "product_number": "asdasdasd" }, | |
"retailer": { "id": 1, "name": "shit"}, | |
"branding": { | |
"manufacturer": { "id": 1, "name": "supermanufacturer" }, | |
"brand": { "id": 1, "name": "superbrand" }, | |
"sub_brand": { "id": 1, "name": "supersubrand" } | |
} | |
}, |
The problem is easy to understand. We have 'duplicate' nodes in our database based on the 'id' field on the node properties.
Well this sounds easy enough, until you have to actually do it.
My first attempt was to try and figure out which nodes are actualy duplicate (based on a property on the node). This seems to be pretty straightforward.
Cypher: