Created
March 14, 2017 22:36
-
-
Save acmiyaguchi/6ec7708ec0df920744baceb0015247d1 to your computer and use it in GitHub Desktop.
[superset][churn] validating presto failures
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
# # Validating Presto Errors | |
# | |
# Presto is failing on a very simple query on the churn dataset that scans over the entire table. This data has been added to the hive metastore using the data in `telemetry-parquet`. | |
# | |
# ``` | |
# parquet2hive -ulv 1 s3://telemetry-parquet/churn | bash | |
# ``` | |
# | |
# Presto meets an unfortunate fate through a generic internal error caused by a null pointer exception. | |
# | |
# ``` | |
# Error Type: INTERNAL_ERROR | |
# Error Code: GENERIC_INTERNAL_ERROR (65536) | |
# Stack Trace: | |
# java.lang.NullPointerException | |
# at com.facebook.presto.spi.type.VarcharType.writeSlice(VarcharType.java:160) | |
# at com_facebook_presto_$gen_CursorProcessor_15.project_9(Unknown Source) | |
# at com_facebook_presto_$gen_CursorProcessor_15.process(Unknown Source) | |
# at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:232) | |
# at com.facebook.presto.operator.Driver.processInternal(Driver.java:378) | |
# at com.facebook.presto.operator.Driver.processFor(Driver.java:301) | |
# at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622) | |
# at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:555) | |
# at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:691) | |
# at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) | |
# at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) | |
# at java.lang.Thread.run(Thread.java:745) | |
# ``` | |
# | |
# This might be related to how Presto handles parquet data. Another similar null pointer exception was caused by the default behavior of using relative offsets to columns, which was solved by setting `hive.parquet.use-column-names=true`. | |
# In[7]: | |
df = spark.read.parquet('s3://telemetry-parquet/churn/v2') | |
df.createOrReplaceTempView('churn') | |
# In[14]: | |
query = """ | |
SELECT COUNT(*) FROM churn | |
WHERE source<>'unknown' or medium<>'unknown' or campaign<>'unknown' or content<>'unknown' | |
""" | |
get_ipython().magic(u'time spark.sql(query).show()') | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment