30lm32

## pandas_to_spark.py
from pyspark.sql.types import *

# Auxiliar functions
# Pandas Types -> Sparks Types
def equivalent_type(f):
  if f == 'datetime64[ns]': return DateType()
  elif f == 'int64': return LongType()
  elif f == 'int32': return IntegerType()
  elif f == 'float64': return FloatType()
  else: return StringType()

## docker-compose.yml

version: '2'
services:
  zookeeper:
    image: "confluentinc/cp-zookeeper:4.1.0"
    hostname: zookeeper
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

## install-docker.sh
#!/usr/bin/env bash

# https://docs.docker.com/install/linux/docker-ce/ubuntu/
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable"
sudo apt-get update
sudo apt-get install docker-ce

# https://docs.docker.com/compose/install/

## fit.py
import os

from keras import backend as K
from keras import callbacks
from keras import layers
from keras import models
from keras.wrappers.scikit_learn import KerasClassifier
import pandas as pd
import tensorflow as tf
from sklearn import metrics

## gist:f81e929e5810271292bd08856e2f4512
   // Spark 2.1
    val spark = SparkSession.builder().master("local").getOrCreate()

    // Given a list of mixture of strings in integers
    val values = List("20030100013280", 1.0)

    // Create `Row` from `Seq`
    val row = Row.fromSeq(values)

    // Create `RDD` from `Row`

## spark_tips_and_tricks.md

      
              1 file
            
          
              21 forks
            
          
                1 comment
              
            
              75 stars
            
          
                dusenberrymw
                / spark_tips_and_tricks.md
            
            
              Last active
              January 10, 2025 07:36
            
              
                Tips and tricks for Apache Spark.
              
          
    Spark Tips & Tricks

Misc. Tips & Tricks


If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding).  Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the


## keras_gensim_embeddings.py
from __future__ import print_function

import json
import os
import numpy as np

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from keras.engine import Input
from keras.layers import Embedding, merge

## something2vec.md

      
              1 file
            
          
              85 forks
            
          
                28 comments
              
            
              287 stars
            
          
                nzw0301
                / something2vec.md
            
            
              Last active
              April 24, 2025 03:32
            
          
    *2vec papers


act2vec, trace2vec, log2vec, model2vec https://link.springer.com/chapter/10.1007/978-3-319-98648-7_18
apk2vec https://arxiv.org/abs/1809.05693
app2vec http://paul.rutgers.edu/~qma/research/ma_app2vec.pdf
ast2vec https://arxiv.org/abs/2103.11614
attribute2vec https://arxiv.org/abs/2004.01375
author2vec http://dl.acm.org/citation.cfm?id=2889382
baller2vec https://arxiv.org/abs/2102.03291
bb2vec https://arxiv.org/abs/1809.09621


## curl.md

      
              5 files
            
          
              1021 forks
            
          
                112 comments
              
            
              4062 stars
            
          
                subfuzion
                / curl.md
            
            
              Last active
              October 11, 2025 00:58
            
              
                curl POST examples
              
          
    Common Options

-#, --progress-bar
Make curl display a simple progress bar instead of the more informational standard meter.
-b, --cookie <name=data>
Supply cookie with request. If no =, then specifies the cookie file to use (see -c).
-c, --cookie-jar <file name>
File to save response cookies to.

  
## gist:c30a821239f4818b0709
Below are the Big O performance of common functions of different Java Collections.


List                 | Add  | Remove | Get  | Contains | Next | Data Structure
---------------------|------|--------|------|----------|------|---------------
ArrayList            | O(1) |  O(n)  | O(1) |   O(n)   | O(1) | Array
LinkedList           | O(1) |  O(1)  | O(n) |   O(n)   | O(1) | Linked List
CopyOnWriteArrayList | O(n) |  O(n)  | O(1) |   O(n)   | O(1) | Array
	from pyspark.sql.types import *

	# Auxiliar functions
	# Pandas Types -> Sparks Types
	def equivalent_type(f):
	if f == 'datetime64[ns]': return DateType()
	elif f == 'int64': return LongType()
	elif f == 'int32': return IntegerType()
	elif f == 'float64': return FloatType()
	else: return StringType()

	version: '2'
	services:
	zookeeper:
	image: "confluentinc/cp-zookeeper:4.1.0"
	hostname: zookeeper
	ports:
	- "2181:2181"
	environment:
	ZOOKEEPER_CLIENT_PORT: 2181
	#!/usr/bin/env bash

	# https://docs.docker.com/install/linux/docker-ce/ubuntu/
	sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
	curl -fsSL https://download.docker.com/linux/ubuntu/gpg \| sudo apt-key add -
	sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable"
	sudo apt-get update
	sudo apt-get install docker-ce

	# https://docs.docker.com/compose/install/
	import os

	from keras import backend as K
	from keras import callbacks
	from keras import layers
	from keras import models
	from keras.wrappers.scikit_learn import KerasClassifier
	import pandas as pd
	import tensorflow as tf
	from sklearn import metrics
	// Spark 2.1
	val spark = SparkSession.builder().master("local").getOrCreate()

	// Given a list of mixture of strings in integers
	val values = List("20030100013280", 1.0)

	// Create `Row` from `Seq`
	val row = Row.fromSeq(values)

	// Create `RDD` from `Row`
	from __future__ import print_function

	import json
	import os
	import numpy as np

	from gensim.models import Word2Vec
	from gensim.utils import simple_preprocess
	from keras.engine import Input
	from keras.layers import Embedding, merge
	Below are the Big O performance of common functions of different Java Collections.


	List \| Add \| Remove \| Get \| Contains \| Next \| Data Structure
	---------------------\|------\|--------\|------\|----------\|------\|---------------
	ArrayList \| O(1) \| O(n) \| O(1) \| O(n) \| O(1) \| Array
	LinkedList \| O(1) \| O(1) \| O(n) \| O(n) \| O(1) \| Linked List
	CopyOnWriteArrayList \| O(n) \| O(n) \| O(1) \| O(n) \| O(1) \| Array