- If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
- Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
- Pay particular attention to the number of partitions when using
flatMap
, especially if the following operation will result in high memory usage. TheflatMap
op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output offlatMap
to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql.types import * | |
# Auxiliar functions | |
# Pandas Types -> Sparks Types | |
def equivalent_type(f): | |
if f == 'datetime64[ns]': return DateType() | |
elif f == 'int64': return LongType() | |
elif f == 'int32': return IntegerType() | |
elif f == 'float64': return FloatType() | |
else: return StringType() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
version: '2' | |
services: | |
zookeeper: | |
image: "confluentinc/cp-zookeeper:4.1.0" | |
hostname: zookeeper | |
ports: | |
- "2181:2181" | |
environment: | |
ZOOKEEPER_CLIENT_PORT: 2181 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
# https://docs.docker.com/install/linux/docker-ce/ubuntu/ | |
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common | |
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - | |
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable" | |
sudo apt-get update | |
sudo apt-get install docker-ce | |
# https://docs.docker.com/compose/install/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
from keras import backend as K | |
from keras import callbacks | |
from keras import layers | |
from keras import models | |
from keras.wrappers.scikit_learn import KerasClassifier | |
import pandas as pd | |
import tensorflow as tf | |
from sklearn import metrics |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Spark 2.1 | |
val spark = SparkSession.builder().master("local").getOrCreate() | |
// Given a list of mixture of strings in integers | |
val values = List("20030100013280", 1.0) | |
// Create `Row` from `Seq` | |
val row = Row.fromSeq(values) | |
// Create `RDD` from `Row` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from __future__ import print_function | |
import json | |
import os | |
import numpy as np | |
from gensim.models import Word2Vec | |
from gensim.utils import simple_preprocess | |
from keras.engine import Input | |
from keras.layers import Embedding, merge |
- act2vec, trace2vec, log2vec, model2vec https://link.springer.com/chapter/10.1007/978-3-319-98648-7_18
- apk2vec https://arxiv.org/abs/1809.05693
- app2vec http://paul.rutgers.edu/~qma/research/ma_app2vec.pdf
- ast2vec https://arxiv.org/abs/2103.11614
- attribute2vec https://arxiv.org/abs/2004.01375
- author2vec http://dl.acm.org/citation.cfm?id=2889382
- baller2vec https://arxiv.org/abs/2102.03291
- bb2vec https://arxiv.org/abs/1809.09621
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Below are the Big O performance of common functions of different Java Collections. | |
List | Add | Remove | Get | Contains | Next | Data Structure | |
---------------------|------|--------|------|----------|------|--------------- | |
ArrayList | O(1) | O(n) | O(1) | O(n) | O(1) | Array | |
LinkedList | O(1) | O(1) | O(n) | O(n) | O(1) | Linked List | |
CopyOnWriteArrayList | O(n) | O(n) | O(1) | O(n) | O(1) | Array |
NewerOlder