Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save acmiyaguchi/6ec7708ec0df920744baceb0015247d1 to your computer and use it in GitHub Desktop.
Save acmiyaguchi/6ec7708ec0df920744baceb0015247d1 to your computer and use it in GitHub Desktop.
[superset][churn] validating presto failures
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
# # Validating Presto Errors
#
# Presto is failing on a very simple query on the churn dataset that scans over the entire table. This data has been added to the hive metastore using the data in `telemetry-parquet`.
#
# ```
# parquet2hive -ulv 1 s3://telemetry-parquet/churn | bash
# ```
#
# Presto meets an unfortunate fate through a generic internal error caused by a null pointer exception.
#
# ```
# Error Type: INTERNAL_ERROR
# Error Code: GENERIC_INTERNAL_ERROR (65536)
# Stack Trace:
# java.lang.NullPointerException
# at com.facebook.presto.spi.type.VarcharType.writeSlice(VarcharType.java:160)
# at com_facebook_presto_$gen_CursorProcessor_15.project_9(Unknown Source)
# at com_facebook_presto_$gen_CursorProcessor_15.process(Unknown Source)
# at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:232)
# at com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
# at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
# at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
# at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:555)
# at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:691)
# at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
# at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
# at java.lang.Thread.run(Thread.java:745)
# ```
#
# This might be related to how Presto handles parquet data. Another similar null pointer exception was caused by the default behavior of using relative offsets to columns, which was solved by setting `hive.parquet.use-column-names=true`.
# In[7]:
df = spark.read.parquet('s3://telemetry-parquet/churn/v2')
df.createOrReplaceTempView('churn')
# In[14]:
query = """
SELECT COUNT(*) FROM churn
WHERE source<>'unknown' or medium<>'unknown' or campaign<>'unknown' or content<>'unknown'
"""
get_ipython().magic(u'time spark.sql(query).show()')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment