Skip to content

Instantly share code, notes, and snippets.

@bearloga
Last active June 25, 2019 05:48
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bearloga/c311cdcd3a61f4435b4b006cf119c30e to your computer and use it in GitHub Desktop.
Save bearloga/c311cdcd3a61f4435b4b006cf119c30e to your computer and use it in GitHub Desktop.
Druid ingestion spec for gzipped CSV data
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"paths": "hdfs://analytics-hadoop/tmp/gsc-all.csv.gz",
"type": "static"
}
},
"dataSchema": {
"dataSource": "test_gsc_all",
"granularitySpec": {
"type": "uniform",
"queryGranularity": "day",
"segmentGranularity": "year",
"intervals": [
"2017-01-01T00:00:00Z/2018-11-13T00:00:00Z"
]
},
"parser": {
"type": "string",
"parseSpec": {
"format": "csv",
"columns": [
"dt", "url", "protocol", "site", "subdomain", "project", "site_version",
"country_code", "country_name", "language_code", "language_name",
"economic_region", "maxmind_continent",
"impressions", "clicks", "position"],
"dimensionsSpec": {
"dimensions": [
"url",
"protocol",
"site",
"subdomain",
"project",
"site_version",
"country_code",
"country_name",
"language_code",
"language_name",
"economic_region",
"maxmind_continent"
]
},
"timestampSpec": {
"column": "dt"
}
}
},
"metricsSpec": [
{
"name": "impressions",
"type": "doubleSum",
"fieldName": "impressions"
},
{
"name": "clicks",
"type": "doubleSum",
"fieldName": "clicks"
},
{
"name": "best_position",
"type": "doubleMin",
"fieldName": "position"
},
{
"name": "worst_position",
"type": "doubleMax",
"fieldName": "position"
}
]
}
},
"tuningConfig": {
"type": "hadoop",
"overwriteFiles": true,
"ignoreInvalidRows" : false,
"partitionsSpec" : {
"type" : "hashed",
"numShards" : 1
},
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec"
}
}
@bearloga
Copy link
Author

bearloga commented Nov 19, 2018

Use unset http_proxy && curl -v -L -X 'POST' -H 'Content-Type:application/json' -d@druid-csv-spec_country-all.json http://druid1001.eqiad.wmnet:8090/druid/indexer/v1/task to start indexing job (this returns a task ID)

Use unset http_proxy && curl -L -H 'Content-Type:application/json' http://druid1001.eqiad.wmnet:8090/druid/indexer/v1/task/***task ID***/status

Note: If using uncompressed CSV, remove "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec" in tuningConfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment