Skip to content

Instantly share code, notes, and snippets.

@joshuar
Last active August 29, 2015 14:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joshuar/e364bec14e28205fb6b8 to your computer and use it in GitHub Desktop.
Save joshuar/e364bec14e28205fb6b8 to your computer and use it in GitHub Desktop.
Backblaze Hard Drive Test Data in Elasticsearch

Instructions

This is just some quick notes for importing the Backblaze Hard Drive Test Data into Elasticsearch. Of the archives that Backblaze has provided, you only need to download the 2013 and 2014 data-sets and unpack them to a temporary location.

After you've unpacked the data, you'll need to convert the CSV to JSON. I use the csvjson tool from Csvkit for this. In the directory containing the CSV files, run this bash loop:

for csv in *.csv; do name=$(basename $csv .csv); csvjson "${name}.csv" > "${name}.json"; done

In order to import into Elasticsearch using the bulk import api, the JSON created by csvkit needs some more formatting applied. The following sed command will:

  • Remove extraneous characters from the JSON file.
  • Format a single record per-line and properly terminated with '\n'
  • Add action lines to insert each record using the Elasticsearch bulk api

Records are added into a single backblaze_smart_data index:

sed -i -e 's|\[{|{|' -e 's|}\]|}|' -e 's|},|}\n|g' -e 's|{"date|{"index":{"_index":"backblaze_smart_data","_type":"smartdata"}}\n{"date|g' *.json

Once you've reformatted the JSON files, you can insert them into your local Elasticsearch instance with the following bash loop:

for j in *.json; do curl -XPUT localhost:9200/_bulk --data-binary @${j}; done

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment