Dejan Simic simicd

## cloudSettings
{"lastUpload":"2020-08-02T11:29:35.550Z","extensionVersion":"v3.4.3"}

## Dockerfile
# Source: https://hub.docker.com/r/jupyter/pyspark-notebook
# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG BASE_CONTAINER=jupyter/scipy-notebook
FROM $BASE_CONTAINER

LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"

USER root

## custom.css
/* Main section */
#notebook-container{
    box-shadow: none !important;  /* Remove box shadows */
    max-width: 1000px;
}

.container {
    width: 80% !important;
}

## spark_tips_and_tricks.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                simicd
                / spark_tips_and_tricks.md
            
            
              Created
              February 14, 2020 20:58
                — forked from dusenberrymw/spark_tips_and_tricks.md
            
              
                Tips and tricks for Apache Spark.
              
          
    Spark Tips & Tricks

Misc. Tips & Tricks


If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding).  Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the


## arrow_prepare_dataset.py
## Read Palmer Station Penguin dataset from GitHub
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/"
                 "palmerpenguins/47a3476d2147080e7ceccef4cf70105c808f2cbf/"
                 "data-raw/penguins_raw.csv")
                 # Increase dataset to 1m rows and reset index
df = df.sample(1_000_000, replace=True).reset_index(drop=True)


# Update sample number (0 to 999'999)

## arrow_write_files.py
# Write to csv
df.to_csv("penguin-dataset.csv")

# Write to parquet
df.to_parquet("penguin-dataset.parquet")

# Write to Arrow
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Write out to file

## arrow_time_performance.py
# Read csv and calculate mean
%%timeit
pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"].mean()


# Read parquet and calculate mean
%%timeit
pd.read_parquet("penguin-dataset.parquet", columns=["Flipper Length (mm)"]).mean()


## arrow_memory_performance.py
# Measure initial memory consumption
memory_init = psutil.Process(os.getpid()).memory_info().rss >> 20


# Read csv
col_csv = pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"]
memory_post_csv = psutil.Process(os.getpid()).memory_info().rss >> 20


# Read parquet

## arrow_performance_comparison_notebook.ipynb

      
              1 file
            
          
              6 forks
            
          
              0 comments
            
          
              12 stars
            
          
                simicd
                / arrow_performance_comparison_notebook.ipynb
            
            
              Last active
              November 28, 2022 09:11
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Dog Image API.postman_collection.json
{
	"info": {
		"_postman_id": "c18ab42d-2677-4ede-b043-99535f4da9f6",
		"name": "Dog Image API",
		"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
	},
	"item": [
		{
			"name": "Dog API - Loop through breeds",
			"event": [
	# Source: https://hub.docker.com/r/jupyter/pyspark-notebook
	# Copyright (c) Jupyter Development Team.
	# Distributed under the terms of the Modified BSD License.
	ARG BASE_CONTAINER=jupyter/scipy-notebook
	FROM $BASE_CONTAINER

	LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"

	USER root
	/* Main section */
	#notebook-container{
	box-shadow: none !important; /* Remove box shadows */
	max-width: 1000px;
	}

	.container {
	width: 80% !important;
	}
	## Read Palmer Station Penguin dataset from GitHub
	import pandas as pd
	df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/"
	"palmerpenguins/47a3476d2147080e7ceccef4cf70105c808f2cbf/"
	"data-raw/penguins_raw.csv")
	# Increase dataset to 1m rows and reset index
	df = df.sample(1_000_000, replace=True).reset_index(drop=True)


	# Update sample number (0 to 999'999)
	# Write to csv
	df.to_csv("penguin-dataset.csv")

	# Write to parquet
	df.to_parquet("penguin-dataset.parquet")

	# Write to Arrow
	# Convert from pandas to Arrow
	table = pa.Table.from_pandas(df)
	# Write out to file
	# Read csv and calculate mean
	%%timeit
	pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"].mean()


	# Read parquet and calculate mean
	%%timeit
	pd.read_parquet("penguin-dataset.parquet", columns=["Flipper Length (mm)"]).mean()
	# Measure initial memory consumption
	memory_init = psutil.Process(os.getpid()).memory_info().rss >> 20


	# Read csv
	col_csv = pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"]
	memory_post_csv = psutil.Process(os.getpid()).memory_info().rss >> 20


	# Read parquet
	{
	"info": {
	"_postman_id": "c18ab42d-2677-4ede-b043-99535f4da9f6",
	"name": "Dog Image API",
	"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
	},
	"item": [
	{
	"name": "Dog API - Loop through breeds",
	"event": [