Skip to content

Instantly share code, notes, and snippets.

@jruts
Last active April 27, 2016 15:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jruts/21bf83b7cb2ed48708683c71dd220a18 to your computer and use it in GitHub Desktop.
Save jruts/21bf83b7cb2ed48708683c71dd220a18 to your computer and use it in GitHub Desktop.

PROPOSALS: Tag Storage in S3

Currently we are storing our tag information in json files in S3. The idea is good because we get versioning and replayability out of the box.

This is how the current structure looks like:

ci
├── amenity
├── geo
├── hotels
│   ├── master
│   ├── nordics

Here we see that we store the geographical tags in the geo > geonames folder. A json file in the geonames folder can look like this:

{
  "_id": "geo:geonames.6252001",
  "displayName": "United States",
  "location": {
    "lat": "39.76",
    "lon": "-98.5"
  },
  "tags": [
    ...
  ],
  "metadata": [
    {
      "key": "label:da",
      "values": [
        "USA"
      ]
    },
    {
      "key": "search:da",
      "values": [
        "USA"
      ]
    },
    ...
  ]
}

As you can see we are storing the different languages in there too.

But if you look at the hotel level, we differentiate between master (UK market?) and Nordics content based on folders. This leaves us with 2 different ways of handling market/language dependent tags in the same tree structure.

This means that the lambda that takes care of inserting the data in cloudsearch and neo4j has to deduce the language and market dependent values based on whether is an amenity, geotag or a hotel.

It would make our lifes easier if we would generalise the way of detecting the market and the language.

Idea

So instead of throwing all the language dependent data in the metadata array for geography and amenity, we could create a separate file for each market and language.

Example:

We want to add a geography tag for spain

Folder structure:

ci
├── geo
│   ├── uk
│   │   ├── en
│   │   │   ├── geo:geonames.10062607.json

Now we also want to add the Danish version:

ci
├── geo
│   ├── uk
│   │   ├── en
│   │   │   ├── geo:geonames.10062607.json
│   ├── dk
│   │   ├── da
│   │   │   ├── geo:geonames.10062607.json
├── hotels
│   ├── uk
│   │   ├── en
│   │   │   ├── hotel:mhid.00oLnm6.json
│   ├── dk
│   │   ├── da
│   │   │   ├── hotel:ne.wvid.10002.json

This same structure can apply for amenity and hotels.

Do note that this structure is merely to make it a lot easier to create records for CloudSearch. The logic of creating nodes in neo4j will remain the same.

Pros and cons

Pros

  • By using a structure like this our lambda can easily insert the records in cloudsearch with the same algorithm for each file.
  • You can easily create/add new markerts/languages and drop the needed files into that folder without having to append to existing files.

Cons

  • More markets/languages means more folders/subfolders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment