jbenninghoff/deleteme.md

## deleteme.md

      
    Raw
  

              deleteme.md
            
          
    How to generate synthetic data in Hive table format


  CSV and HQL Generation


Use the included genHiveTableFromSchema.py Python script to generate
the structured CSV data and the associated Hive script locally.  The script
requires options to specify the schema file, row count, and partition sizes.
If the script is run without options, full usage is provided.
The schema file expected should be the output of a SHOW CREATE TABLE command.

-s mySchema.hql

-n (Row count)

-p X,Y,Z... (Partion sizes, e.g. 1,5,40)
The only map schema type supported for now is (string,string)
Once the script is run there will be two files generated in the folder.

HiveRandom.csv
HiveRandom.hql

Review the Hive hql script and note the last step is commented out.


  Hive and Presto Commands


On a host that can run the hive shell, use it to run:
hive -f HiveRandom.hql
The last HQL command the HiveRandom.hql is commented out because
the data insert executes much faster in Trino/Presto.  The syntax
of the commented out statement/command must be modified to remove
TABLE and PARTITION(...) to run in Trino/Presto SQL. Run it once
manually to insure it all completes successfully without errors.
That insert statement is also put into the insert-csv.sql script like this:
INSERT INTO fact_applogs_ailtn_current SELECT dynamicfields, organizationid,
userid, datacenter, pod, appversion, type, host, uuid, rawtimestamp,
agenttimestamp, superpod, approle, substrate, region, account,
logrecordtype,ts_date,batchid FROM fact_applogs_ailtn_current_random;
Then insert-csv.sql can be run with presto-cli like this:

presto-cli --catalog hive --schema default -f insert-csv.sql


  Repeat Process


The Presto SQL script, insert-csv.sql, can then be run repeatedly
with a bash script like this one to add 4M or more rows per bash
script run:
cat addrows.sh
#!/bin/bash
set -o nounset; set -o errexit; set -o pipefail

# Remove previous run data assuming fact_applogs_ailtn_current table name
hive -e 'drop table fact_applogs_ailtn_current_random;'
hdfs dfs -rm -R /user/hive/warehouse/fact_applogs_ailtn_current_random/

# Generate the data in csv format and create table schemas
./genHiveTableFromSchema.py -s fact_applogs_ailtn_current-schema-jb.hql \
   -n 4000000 -p1,1,40
hive -f HiveRandom.hql

# Time the actual data insert. I saw ~50min for 4M rows with 30 nodes @ r5a.24xl
presto-cli --catalog hive --schema default -f insert-csv.sql
The addrows.sh script can then be run repeatedly to add 4M rows at
a time to the parquet table.  Each run will generate a new ts_date
assuming the -p1,1,40 option is used.
In order to insert millions of rows in Presto, the following hive.properties
value must be set on all nodes:

hive.max-partitions-per-writers = 5000

/jb