Skip to content

Instantly share code, notes, and snippets.

Created November 12, 2015 22:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anonymous/cc9146815e7f28619488 to your computer and use it in GitHub Desktop.
Save anonymous/cc9146815e7f28619488 to your computer and use it in GitHub Desktop.
#%pylab inline
In [2]:
import dataiku
import dataiku.spark as dkuspark
import pyspark
from pyspark.sql import SQLContext
In [3]:
# Load PySpark
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)
In [4]:
# Example: Read the descriptor of a Dataiku dataset
mydataset = dataiku.Dataset("csvisit_99k_prepared")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)
# df = mydataset.get_dataframe()
In [5]:
# Example: Get the count of records in the dataframe
df.count()
Out[5]:
96218
In [6]:
import pandas as pd
z = df.select("visit_duration","page_views_num").toPandas()
#z.plot(kind='line')
#z.plot(kind='bar')
#z.plot(kind='hist')
#z.plot(kind='hist')
z.plot(kind='line', x='visit_duration', y='page_views_num', c='visit_duration');
In [7]:
import pandas as pd
z = df.select("visit_duration","page_views_num").toPandas()
z.plot()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x58f76d0>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment