Skip to content

Instantly share code, notes, and snippets.

@jbenninghoff
Last active July 21, 2021 17:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jbenninghoff/cbe91035e18663fd7a23cfc0a1cd0a43 to your computer and use it in GitHub Desktop.
Save jbenninghoff/cbe91035e18663fd7a23cfc0a1cd0a43 to your computer and use it in GitHub Desktop.
Markdown Collapse Preview

How to generate synthetic data in Hive table format

CSV and HQL Generation


Use the included genHiveTableFromSchema.py Python script to generate the structured CSV data and the associated Hive script locally. The script requires options to specify the schema file, row count, and partition sizes. If the script is run without options, full usage is provided.

The schema file expected should be the output of a SHOW CREATE TABLE command.
-s mySchema.hql
-n (Row count)
-p X,Y,Z... (Partion sizes, e.g. 1,5,40)

The only map schema type supported for now is (string,string)

Once the script is run there will be two files generated in the folder.

  • HiveRandom.csv
  • HiveRandom.hql

Review the Hive hql script and note the last step is commented out.

Hive and Presto Commands


On a host that can run the hive shell, use it to run:

hive -f HiveRandom.hql

The last HQL command the HiveRandom.hql is commented out because the data insert executes much faster in Trino/Presto. The syntax of the commented out statement/command must be modified to remove TABLE and PARTITION(...) to run in Trino/Presto SQL. Run it once manually to insure it all completes successfully without errors.

That insert statement is also put into the insert-csv.sql script like this:

INSERT INTO fact_applogs_ailtn_current SELECT dynamicfields, organizationid,
userid, datacenter, pod, appversion, type, host, uuid, rawtimestamp,
agenttimestamp, superpod, approle, substrate, region, account,
logrecordtype,ts_date,batchid FROM fact_applogs_ailtn_current_random;

Then insert-csv.sql can be run with presto-cli like this:
presto-cli --catalog hive --schema default -f insert-csv.sql

Repeat Process


The Presto SQL script, insert-csv.sql, can then be run repeatedly with a bash script like this one to add 4M or more rows per bash script run:

cat addrows.sh

#!/bin/bash
set -o nounset; set -o errexit; set -o pipefail

# Remove previous run data assuming fact_applogs_ailtn_current table name
hive -e 'drop table fact_applogs_ailtn_current_random;'
hdfs dfs -rm -R /user/hive/warehouse/fact_applogs_ailtn_current_random/

# Generate the data in csv format and create table schemas
./genHiveTableFromSchema.py -s fact_applogs_ailtn_current-schema-jb.hql \
   -n 4000000 -p1,1,40
hive -f HiveRandom.hql

# Time the actual data insert. I saw ~50min for 4M rows with 30 nodes @ r5a.24xl
presto-cli --catalog hive --schema default -f insert-csv.sql

The addrows.sh script can then be run repeatedly to add 4M rows at a time to the parquet table. Each run will generate a new ts_date assuming the -p1,1,40 option is used.

In order to insert millions of rows in Presto, the following hive.properties value must be set on all nodes:
hive.max-partitions-per-writers = 5000

/jb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment