Skip to content

Instantly share code, notes, and snippets.

@P7h
Forked from jreuben11/SparkML-QuickRef.md
Created March 24, 2016 10:12
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save P7h/c8e48f281e6eb53dd083 to your computer and use it in GitHub Desktop.
Save P7h/c8e48f281e6eb53dd083 to your computer and use it in GitHub Desktop.
Spark.ml Pipelines QuickRef

in a nutshell: fit trainingData (train a model), transform testData (predict with model)

  • Transformer: DataFrame => DataFrame
  • Estimator: DataFrame => Transformer

#Transformers

  • Tokenizer: sentence => words
  • RegexTokenizer: sentence => words - setPattern
  • HashingTF: terms => feature vectors based on frequency - setNumFeatures
  • StopWordsRemover: filter - setStopWords
  • NGram: sequence of n strings
  • Binarizer: number => 0/1 threshold - setThreshold
  • PCA: reduce feature set statistical dimensionality reduction (selects least correlated) - setK
  • PolynomialExpansion: feature set dimensionality expansion (~ Taylor Series) - setDegree
  • DCT: time series => frequencies (via cosine wave)- setInverse
  • StringIndexer: strings => frequency ordinals
  • IndexToString: dual of StringIndexer
  • OneHotEncoder: category feature => 1-hot bitset
  • VectorIndexer: category automatically index categorical features in the featureset - setMaxCategories
  • Normalizer: vector features to p-norm - setP
  • StandardScaler: features to z-scores - setWithStd, setWithMean
  • MinMaxScaler: scale feature to range [0, 1]
  • Bucketizer: continuous to discrete - setSplits
  • ElementwiseProduct: apply weights to vector features - setScalingVec
  • SQLTransformer: SQL over featureset ! - setStatement
  • VectorAssembler: combine multi-columns into a single vector column
  • QuantileDiscretizer: continuous to discrete - setNumBuckets
  • VectorSlicer: select subset of featureset - setIndices, setNames
  • RFormula: specify labelled point dependent / independent variables - setFormula("y ~ x1 + x2"), setFeaturesCol, setLabelCol
  • ChiSqSelector: select features with most predictive power - setNumTopFeatures, setFeaturesCol, setLabelCol

#Estimators:

  • IDF: down-weights high frequency terms
  • Word2Vec: document => token count - setVectorSize, setMinCount
  • CountVectorizer: document => token count - setVocabSize, setMinDF
  • LogisticRegression - setMaxIter, setRegParam, setElasticNetParam, setTol, setFitIntercept
  • DecisionTreeClassifier
  • RandomForestClassifier - setNumTrees
  • GBTClassifier - setMaxIter
  • MultilayerPerceptronClassifier - setLayers, setBlockSize, setSeed, setMaxIter
  • OneVsRest - setClassifier
  • DecisionTreeRegressor
  • RandomForestRegressor
  • GBTRegressor
  • AFTSurvivalRegression - setQuantileProbabilities, setQuantilesCol
  • KMeans - setK
  • LDA - setK, setMaxIter

#Models:

  • CountVectorizerModel
  • LogisticRegressionModel - coefficients, intercept, setThreshold, summary
  • DecisionTreeClassificationModel
  • RandomForestClassificationModel
  • GBTClassificationModel
  • DecisionTreeRegressionModel
  • RandomForestRegressionModel
  • GBTRegressionModel
  • LDAModel - logLikelihood, logPerplexity

#Evaluators:

  • BinaryLogisticRegressionSummary - fMeasureByThreshold, areaUnderROC, roc
  • BinaryClassificationEvaluator - default metric names: "areaUnderROC"
  • MulticlassClassificationEvaluator - default metric name: "precision"
  • MulticlassMetrics - confusionMatrix, falsePositiveRate
  • RegressionEvaluator - default metric name: "rmse"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment