xerial/td-spark-usage.md

## td-spark-usage.md

      
    Raw
  

              td-spark-usage.md
            
          
    td-spark usage notes


Official Doc

What You Can Do With td-spark


Reading and writing tables in TD through DataFrames of Spark.
Running Spark SQL queries against DataFrames.
Submitting Presto SQL queries to TD and reading the query results as DataFrame.
If you use PySpark, you can use both Spark's DataFrames and Pandas DataFrames interchangeably.
Using any hosted Spark services, such as Amazon EMR, Databrics Cloud.

For the best performance, we recommend using us-east region in AWS.
It's also possible to use Google Colaboratory to run PySpark. See demos for the deatils.


Downloads


td-spark 1.1.0 (For Spark 2.4.0 + Scala 2.11): td-spark-assembly_2.11-1.1.0.jar

Reduced the td-spark assembly jar file size from 151MB to 22MB.
Add new methods:

td.presto(sql) 

Improved the query performance by using the api-presto gateway.
If you need to run a query that has large results, use td.prestoJob(sql).df.


td.executePresto(sql)

Run non-query Presto statements, e.g., CREATE TABLE, DROP TABLE, etc.


td.prestoJob(sql)

Run Presto query as a regular TD job


td.hiveJob(sql)

Run Hive query as a regular TD job


New Database/Table methods:

td.table(...).exists
td.table(...).dropIfExists
td.table(...).createIfNotExists
td.database(...).exists
td.database(...).dropIfExists
td.database(...).createIfNotExists


Add methods for creating new UDP tables:

td.createLongPartitionedTable(table_name, long type column name)
td.createStringPartitionedTable(table_name, string type column name)


Support df.createOrReplaceTD(...) for UDP tables 
Add spark.td.site configuration for multiple regions.

spark.td.site=us (default)

For US customers. Using us-east region provides the best performance


spark.td.site=jp

For Tokyo region customers. Using ap-northeast region provides the best performance.


spark.td.site=eu01

For EU region customers. Using eu-central region provides the best performance.


Enabled predicate pushdown for UDP table queries

Queries with exact match conditions for UDP keys can be accelerated.


Bug fixes:

Fixed a bug when using createOrReplaceTempView in multiple notebooks at Databricks cloud.
Fixed an error that showed NoSuchMethod when using td.presto command


td-spark 1.0.0

For Spark 2.4.x + Scala 2.11 or Scala 2.12:
Scala 2.11: td-spark-assembly_2.11-1.0.0.jar (recommended: Scala 2.11 is the default Scala version for Spark 2.4.x)
Scala 2.12: td-spark-assembly_2.12-1.0.0.jar


td-spark 0.4.2

For Spark 2.3.x + Scala 2.11
td-spark-assembly_2.11-0.4.2.jar


Demos


On Databricks Cloud
PySpark Demo with Google Colaboratory

Usage

Launching a local Spark Shell Using Docker

If you already have a Docker daemon installed on your machine, using Docker images of td-spark will be the most convenient method to try td-spark.


If you don't have Docker, install Docker on your machine.


Running Spark shell using your API Key


$ export TD_API_KEY=(your TD API key)
$ docker run -it -e TD_API_KEY=$TD_API_KEY armtd/td-spark-shell:latest

Running PySpark shell

$ docker run -it -e TD_API_KEY=$TD_API_KEY armtd/td-spark-pyspark:latest
Testing td-spark with your local Spark

It is also possible to use Spark binaries to test td-spark:

Download Spark 2.4.0 (or higher) package, and extract it to any folder you like.
Download td-spark_2.11_1.0.0.jar
Prepare td-spark.conf configuration file:
td-spark.conf

spark.td.apikey=(YOUR TD API KEY)
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.execution.arrow.enabled=true


Launch Spark Shell:

$ ./bin/spark-shell --jars (path to td-spark-assembly_2.11_1.0.0.jar) --properties-file (path to td-spark.conf)

Import td-spark package

To use td-spark, import com.treasuredata.spark._ package:
import com.treasuredata.spark._; val td = spark.td
Selecting Tables

To select a time range of a table, you can use td.table("table name") and within filtering method.
After applying some filtering, you can create a DataFrame with .df method:
val df = 
  td.table("(database).(table)")
    .within("-1d")
    .df

Here are examples of selecting time ranges:
val t = td.table("sample_datasets.www_access")
  
// Using duration strings as in TD_INTERVAL  
t.within("-1d") // last day
t.within("-1m/2014-10-03 09:13:00") // last 1 minute from a given offset
t.within("-1d PST") // specifying a time zone. (e.g., JST, PST, UTC etc.)

// Specifying unix time ranges
t.withinUnixTimeRange(from = 1412320845, until = 1412321000) // [from, unti) unix time range

t.fromUnixTime(1412320845) // [from, ...)
t.untilUnixTime(1412321000) // [... , until)

// Using time strings yyyy-MM-dd HH:mm:ss (timezone. default = UTC)?
t.fromTime("2014-10-03 09:12:00") // [from, ...)
t.untilTime("2014-10-03 09:13:00") // [..., until)

t.withinTimeRange(from = "2014-10-03 09:12:00", until = "2014-10-03 09:13:00") // [from, until)
t.withinTimeRange(from = "2014-10-03 09:12:00", until = "2014-10-03 09:13:00", ZoneId.of("Asia/Tokyo")) // [from, until) in Asia/Tokyo timezone

// Specifying timezone
t.fromTime("2014-10-03 09:12:00", ZoneId.of("Asia/Tokyo")) // [from, ...) in Asia/Tokyo timezone
t.untilTime("2014-10-03 09:13:00", ZoneId.of("Asia/Tokyo")) // [..., until) in Asia/Tokyo timezone
Run Presto Queries

val df = td.presto("select * from www_access limit 10")
df.show
+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+
|user|           host|                path|             referer|code|               agent|size|method|      time|
+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+
|null|  76.45.175.151|   /item/sports/4642|/search/?c=Sports...| 200|Mozilla/5.0 (Maci...| 137|   GET|1412380793|
|null|184.162.105.153|   /category/finance|                   -| 200|Mozilla/4.0 (comp...|  68|   GET|1412380784|
|null|  144.30.45.112|/item/electronics...| /item/software/4777| 200|Mozilla/5.0 (Maci...| 136|   GET|1412380775|
|null|  68.42.225.106|/category/networking|/category/electro...| 200|Mozilla/4.0 (comp...|  98|   GET|1412380766|
|null| 104.66.194.210|     /category/books|                   -| 200|Mozilla/4.0 (comp...|  43|   GET|1412380757|
|null|    64.99.74.69|  /item/finance/3775|/category/electro...| 200|Mozilla/5.0 (Wind...|  86|   GET|1412380748|
|null| 136.135.51.168|/item/networking/540|                   -| 200|Mozilla/5.0 (Wind...|  89|   GET|1412380739|
|null|   52.99.134.55|   /item/health/1326|/category/electro...| 200|Mozilla/5.0 (Maci...|  51|   GET|1412380730|
|null|  136.51.116.68|/category/finance...|                   -| 200|Mozilla/5.0 (comp...|  99|   GET|1412380721|
|null|136.141.218.177| /item/computers/959|                   -| 200|Mozilla/5.0 (Wind...| 124|   GET|1412380712|
+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+
Using SparkSQL

td.table("sample_datasets.www_access").df.createOrReplaceTempView("www_access")
val q = spark.sql("select agent, count(*) cnt from www_access group by 1")
q.show
+--------------------+---+
|               agent|cnt|
+--------------------+---+
|Mozilla/4.0 (comp...|159|
|Mozilla/5.0 (comp...|341|
|Mozilla/4.0 (comp...|140|
|Mozilla/5.0 (comp...|497|
|Mozilla/5.0 (Wind...|630|
|Mozilla/5.0 (iPad...|158|
|Mozilla/5.0 (Maci...|445|
|Mozilla/5.0 (comp...|261|
|Mozilla/5.0 (Wind...|304|
|Mozilla/4.0 (comp...|187|
|Mozilla/4.0 (comp...|150|
|Mozilla/5.0 (Wind...|291|
|Mozilla/5.0 (Wind...|297|
|Mozilla/4.0 (comp...|132|
|Mozilla/5.0 (Wind...|147|
|Mozilla/4.0 (comp...|123|
|Mozilla/5.0 (iPho...|139|
|Mozilla/5.0 (Wind...|439|
|Mozilla/4.0 (comp...|160|
+--------------------+---+

Uploading DataFrame to TD

To upload a DataFrame to TD, use df.createOrReplaceTD, df.insertIntoTD:
val df: DataFrame = ... // Prepare some DataFrame

// Writes DataFrame to a new TD table. If the table already exists, this will fail.
df.write.td("(database name).(table name)")

// Create or replace the target TD table with the contents of DataFrame
df.createOrReplaceTD("(database name).(table name)")

// Append the contents of DataFrame to the target TD table.
// If the table doesn't exists, it will create a new table
df.insertIntoTD("(database name).(table name)")
You can also use the full DataFrame.write and save syntax. Specify the format as com.treasuredata.spark,
then select the target table with .option("table", "(target database).(table name)"):
df.write // df is an arbitrary DataFrame
  .mode("overwrite") // append, overwrite, error, ignore
  .format("com.treasuredata.spark")
  .option("table", "(database name).(table name)") // Specify an upload taget table
  .save
Available write modes (SaveMode)


mode
behavior


append
Append to the target table. Throws an error if the table doesn't exist.


overwrite
This performs two-step update. First it deletes the target table if it exists, then creates a new table with the new data.


error
(default) If the target table already exists, throws an exception.


ignore
If the target table exists, ignores the save operation.


Writing Data Using SQL

Creating a new table with CREATE TABLE AS (SELECT ...):
spark.sql("""
CREATE TABLE my_table
USING com.treasuredata.spark
OPTIONS(table '(database name).(table name)')
AS SELECT ...
""")
Limitations


overwrite for UDP tables is not supported yet. You need to create an UDP table using Presto, then use append mode for inserting data to the table.
mode	behavior
append	Append to the target table. Throws an error if the table doesn't exist.
overwrite	This performs two-step update. First it deletes the target table if it exists, then creates a new table with the new data.
error	(default) If the target table already exists, throws an exception.
ignore	If the target table exists, ignores the save operation.