Skip to content

Instantly share code, notes, and snippets.

@evidanary
Last active August 29, 2015 13:56
Show Gist options
  • Save evidanary/9148018 to your computer and use it in GitHub Desktop.
Save evidanary/9148018 to your computer and use it in GitHub Desktop.
Extract Schema from a JSON row. Generates a hive table. Flattens based on {"event" => ....., "properties" = > ....}
# This utilityi extracts schema of a JSON object
# It returns the keys of the JSON object separated by commas
# It also returns a create table statement in Hive
require '/var/lib/jenkins/scripts/lib/hive_utils.rb'
require 'json'
# This function should probably be in hive_utils.rb
def create_table_string(params)
header = "CREATE TABLE #{params[:hive_schema]}.#{params[:hive_table]}("
footer = ")" + "\n" +
"PARTITIONED BY (dt string)\n" +
"ROW FORMAT DELIMITED" + "\n" +
" FIELDS TERMINATED BY '\\t'" + "\n" +
" LINES TERMINATED BY '\\n';"
header + params[:fields].join(",\n") + footer
end
def drop_table_string(params)
"DROP TABLE IF EXISTS #{params[:hive_schema]}.#{params[:hive_table]};"
end
def flatten_event(jevent)
new_event = {}
new_event["event"] = jevent["event"]
new_event.merge!(jevent["properties"])
new_event
end
def sanitize_to_hive_columns(field_name)
#replacce $ with '' and space with _, : with _
field_name.gsub(/\$+/, '').gsub(' ', '_').gsub(':', '_')
end
json_payload = flatten_event(JSON.load(gets.split("\n").first))
keys = json_payload.keys
3.times {puts}
puts "JSON_KEYS: "
puts keys.join(',')
3.times {puts}
#sanitize keys to column_names
keys.map! {|x| " " + sanitize_to_hive_columns(x) + " string"}
params = {
:hive_schema => 'sandbox_yranadive',
:hive_table => 'mixpanel_AirplaneModeChanged',
:fields => keys
}
puts "HIVE CREATE TABLE STMT: "
puts drop_table_string(params)
puts create_table_string(params)
@derwin12
Copy link

derwin12 commented Apr 1, 2014

Where do I find the hive_utils.rb file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment