30lm32

## spark_tips_and_tricks.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                30lm32
                / spark_tips_and_tricks.md
            
            
              Created
              April 12, 2020 17:59
                — forked from dusenberrymw/spark_tips_and_tricks.md
            
              
                Tips and tricks for Apache Spark.
              
          
    Spark Tips & Tricks

Misc. Tips & Tricks


If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding).  Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the


## something2vec.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                30lm32
                / something2vec.md
            
            
              Created
              March 12, 2019 06:07
                — forked from nzw0301/something2vec.md
            
          
    *2vec papers


act2vec, trace2vec, log2vec, model2vec https://link.springer.com/chapter/10.1007/978-3-319-98648-7_18
apk2vec https://arxiv.org/abs/1809.05693
app2vec http://paul.rutgers.edu/~qma/research/ma_app2vec.pdf
author2vec http://dl.acm.org/citation.cfm?id=2889382
bb2vec https://arxiv.org/abs/1809.09621
behavior2vec https://dl.acm.org/citation.cfm?id=3184454
care2vec https://arxiv.org/abs/1812.00715
cat2vec http://104.155.136.4:3000/forum?id=HyNxRZ9xg


## docker-compose.yml

version: '2'
services:
  zookeeper:
    image: "confluentinc/cp-zookeeper:4.1.0"
    hostname: zookeeper
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

	version: '2'
	services:
	zookeeper:
	image: "confluentinc/cp-zookeeper:4.1.0"
	hostname: zookeeper
	ports:
	- "2181:2181"
	environment:
	ZOOKEEPER_CLIENT_PORT: 2181