joshuar/Backblaze-HDD-Data-Elasticsearch.md

## Backblaze-HDD-Data-Elasticsearch.md

      
    Raw
  

              Backblaze-HDD-Data-Elasticsearch.md
            
          
    Instructions

This is just some quick notes for importing the Backblaze Hard Drive Test Data into Elasticsearch. Of the archives that Backblaze has provided, you only need to download the 2013 and 2014 data-sets and unpack them to a temporary location.
After you've unpacked the data, you'll need to convert the CSV to JSON.  I use the csvjson tool from Csvkit for this. In the directory containing the CSV files, run this bash loop:
for csv in *.csv; do name=$(basename $csv .csv); csvjson "${name}.csv" > "${name}.json"; done

In order to import into Elasticsearch using the bulk import api, the JSON created by csvkit needs some more formatting applied.  The following sed command will:

Remove extraneous characters from the JSON file.
Format a single record per-line and properly terminated with '\n'
Add action lines to insert each record using the Elasticsearch bulk api

Records are added into a single backblaze_smart_data index:
sed -i -e 's|\[{|{|' -e 's|}\]|}|' -e 's|},|}\n|g' -e 's|{"date|{"index":{"_index":"backblaze_smart_data","_type":"smartdata"}}\n{"date|g' *.json

Once you've reformatted the JSON files, you can insert them into your local Elasticsearch instance with the following bash loop:
for j in *.json; do curl -XPUT localhost:9200/_bulk --data-binary @${j}; done

References


https://www.backblaze.com/blog/hard-drive-data-feb2015/
https://www.backblaze.com/hard-drive-test-data.html