jruts/s3tagstorage.md

## s3tagstorage.md

      
    Raw
  

              s3tagstorage.md
            
          
    PROPOSALS: Tag Storage in S3

Currently we are storing our tag information in json files in S3. The idea is good because we get versioning and replayability out of the box.
This is how the current structure looks like:
ci
├── amenity
├── geo
├── hotels
│   ├── master
│   ├── nordics

Here we see that we store the geographical tags in the geo > geonames folder.
A json file in the geonames folder can look like this:
{
  "_id": "geo:geonames.6252001",
  "displayName": "United States",
  "location": {
    "lat": "39.76",
    "lon": "-98.5"
  },
  "tags": [
    ...
  ],
  "metadata": [
    {
      "key": "label:da",
      "values": [
        "USA"
      ]
    },
    {
      "key": "search:da",
      "values": [
        "USA"
      ]
    },
    ...
  ]
}
As you can see we are storing the different languages in there too.
But if you look at the hotel level, we differentiate between master (UK market?) and Nordics content based on folders.
This leaves us with 2 different ways of handling market/language dependent tags in the same tree structure.
This means that the lambda that takes care of inserting the data in cloudsearch and neo4j has to deduce the language and market dependent values based on whether is an amenity, geotag or a hotel.
It would make our lifes easier if we would generalise the way of detecting the market and the language.
Idea

So instead of throwing all the language dependent data in the metadata array  for geography and amenity, we could create a separate file for each market and language.
Example:
We want to add a geography tag for spain
Folder structure:
ci
├── geo
│   ├── uk
│   │   ├── en
│   │   │   ├── geo:geonames.10062607.json

Now we also want to add the Danish version:
ci
├── geo
│   ├── uk
│   │   ├── en
│   │   │   ├── geo:geonames.10062607.json
│   ├── dk
│   │   ├── da
│   │   │   ├── geo:geonames.10062607.json
├── hotels
│   ├── uk
│   │   ├── en
│   │   │   ├── hotel:mhid.00oLnm6.json
│   ├── dk
│   │   ├── da
│   │   │   ├── hotel:ne.wvid.10002.json

This same structure can apply for amenity and hotels.
Do note that this structure is merely to make it a lot easier to create records for CloudSearch. The logic of creating nodes in neo4j will remain the same.
Pros and cons

Pros


By using a structure like this our lambda can easily insert the records in cloudsearch with the same algorithm for each file.
You can easily create/add new markerts/languages and drop the needed files into that folder without having to append to existing files.

Cons


More markets/languages means more folders/subfolders.