Skip to content

Instantly share code, notes, and snippets.

View BryanCutler's full-sized avatar

Bryan Cutler BryanCutler

View GitHub Profile
import io.netty.buffer.ArrowBuf;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.file.ArrowWriter;
import org.apache.arrow.vector.schema.ArrowFieldNode;
import org.apache.arrow.vector.schema.ArrowRecordBatch;
import org.apache.arrow.vector.types.pojo.Field;
@BryanCutler
BryanCutler / pandas_rdd.py
Last active March 14, 2018 05:47
Vectorized UDFs in Python SPARK-21190
class DataFrame(object):
...
def asPandas(self):
return ArrowDataFrame(self)
class ArrowDataFrame(object):
"""
Wraps a Python DataFrame to group/winow then apply using``pandas.DataFrame``
"""
@BryanCutler
BryanCutler / PySpark_to_Pandas_with_Arrow.ipynb
Last active January 24, 2019 11:12
Spark to Pandas Conversion with Arrow Example
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@BryanCutler
BryanCutler / PySpark_Vectorized_UDFs.ipynb
Last active February 17, 2022 13:57
PySpark vectorized UDFs with Arrow
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@BryanCutler
BryanCutler / PySpark_createDataFrame_with_Arrow.ipynb
Last active September 16, 2020 02:30
How to create a Spark DataFrame from Pandas or NumPy with Arrow
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@BryanCutler
BryanCutler / start_jupyter_pyspark.sh
Last active July 29, 2022 01:06
How to start a Jupyter Notebook with PySpark Kernel
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
@BryanCutler
BryanCutler / tf_arrow_model_training.py
Last active June 28, 2021 16:13
TensorFlow Keras Model Training Example with Apache Arrow Dataset
from functools import partial
import multiprocessing
import os
import socket
import sys
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
@BryanCutler
BryanCutler / tf_arrow_blog_pt1.py
Last active August 5, 2019 19:15
TensorFlow Arrow Blog Part 1 - Create Sample DataFrame
import numpy as np
import pandas as pd
data = {'label': np.random.binomial(1, 0.5, 10)}
data['x0'] = np.random.randn(10) + 5 * data['label']
data['x1'] = np.random.randn(10) + 5 * data['label']
df = pd.DataFrame(data)
print(df.head())
@BryanCutler
BryanCutler / tf_arrow_blog_pt2.py
Last active August 5, 2019 19:34
TensorFlow Arrow Blog Part 2 - ArrowDataset
import tensorflow_io.arrow as arrow_io
ds = arrow_io.ArrowDataset.from_pandas(
df,
batch_size=2,
preserve_index=False)
# Make an iterator to the dataset
ds_iter = iter(ds)
@BryanCutler
BryanCutler / tf_arrow_blog_pt3.py
Last active August 5, 2019 17:38
TensorFlow Arrow Blog Part 3 - ArrowFeatherDataset
import tensorflow_io.arrow as arrow_io
from pyarrow.feather import write_feather
# Write the Pandas DataFrame to a Feather file
write_feather(df, '/path/to/df.feather')
# Create the dataset with one or more filenames
ds = arrow_io.ArrowFeatherDataset(
['/path/to/df.feather'],
columns=(0, 1, 2),