Skip to content

Instantly share code, notes, and snippets.

@ihainan
Created June 13, 2019 10:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ihainan/828343315ab965fe1e6cd0988bfa69b1 to your computer and use it in GitHub Desktop.
Save ihainan/828343315ab965fe1e6cd0988bfa69b1 to your computer and use it in GitHub Desktop.

Spark / MLeap Upgradation

Spark 2.2

  • New algorithms in DataFrame-based API
    • SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
    • SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
    • SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
    • SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
    • SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
    • SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)
  • Existing algorithms added to Python and R APIs
    • SPARK-18239: Gradient Boosted Trees ®
    • SPARK-18821: Bisecting K-Means ®
    • SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
    • SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)
  • Major bug fixes
    • SPARK-19110: DistributedLDAModel.logPrior correctness fix
    • SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
    • SPARK-18715: Fix wrong AIC calculation in Binomial GLM
    • SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
    • SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
    • SPARK-20047: Box-constrained Logistic Regression

Spark 2.3

  • Highlights
    • ML Prediction now works with Structured Streaming, using updated APIs. Details below.
  • New/Improved APIs
    • [SPARK-21866]: Built-in support for reading images into a DataFrame (Scala/Java/Python)
    • [SPARK-19634]: DataFrame functions for descriptive summary statistics over vector columns (Scala/Java)
    • [SPARK-14516]: ClusteringEvaluator for tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python)
    • [SPARK-3181]: Robust linear regression with Huber loss (Scala/Java/Python)
    • [SPARK-13969]: FeatureHasher transformer (Scala/Java/Python)
    • Multiple column support for several feature transformers:
    • [SPARK-21633] and [SPARK-21542]: Improved support for custom pipeline components in Python.
  • New Features
    • [SPARK-21087]: CrossValidator and TrainValidationSplit can collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models.
    • [SPARK-19357]: Meta-algorithms CrossValidator, TrainValidationSplit, OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobs
    • [SPARK-17139]: Model summary for multinomial logistic regression (Scala/Java/Python)
    • [SPARK-18710]: Add offset in GLM
    • [SPARK-20199]: Added featureSubsetStrategy Param to GBTClassifier and GBTRegressor. Using this to subsample features can significantly improve training speed; this option has been a key strength of xgboost.
  • Other Notable Changes
    • [SPARK-22156] Fixed Word2Vec learning rate scaling with num iterations. The new learning rate is set to match the original Word2Vec C code and should give better results from training.
    • [SPARK-22289] Add JSON support for Matrix parameters (This fixed a bug for ML persistence with LogisticRegressionModel when using bounds on coefficients.)
    • [SPARK-22700] Bucketizer.transform incorrectly drops row containing NaN. When Param handleInvalid was set to “skip,” Bucketizer would drop a row with a valid value in the input column if another (irrelevant) column had a NaN value.
    • [SPARK-22446] Catalyst optimizer sometimes caused StringIndexerModel to throw an incorrect “Unseen label” exception when handleInvalid was set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset.
    • [SPARK-21681] Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients when some features had zero variance.
    • Major optimizations:
      • [SPARK-22707] Reduced memory consumption for CrossValidator
      • [SPARK-22949] Reduced memory consumption for TrainValidationSplit
      • [SPARK-21690] Imputer should train using a single pass over the data
      • [SPARK-14371] OnlineLDAOptimizer avoids collecting statistics to the driver for each mini-batch.

MLeap 0.8.0

Fix Existing Spark Integration Issues

Sync with Spark 2.2

  • b323879 sync GBT Classifier with Spark 2.2 (#236)
    • add rawPrediction and probability columns to GBT Classifier
    • add loss functions for probabilistic classifiers
    • add reference config for spark 2.2
  • 89d6ca5 add "keep" as a invalid handling mode to StringIndexerModel (#226)
  • 9196e54 Handle null values like Spark does (#262)
  • 185cdf4 Feature/core types - overhaul of typing system (#250)
    • Extend Model for ElementwiseProductModel and PipelineModel
    • Add implicit context to some Ops
    • Remove remaining type from Socket
    • add numFeatures to LogisticRegressionModel based on coefficients/coefficientsMatrix
    • add numFeatures to SupportVectorMachineModel based on coefficients
    • fix SparkTransformBuilderSpec
    • adding input/output schema and fixing or adding tests for outstanding models
    • scikit integration fixes
    • Makes it much easier to implement both MLeap transformer/Spark transformer serializers
    • Offers full test coverage of input/output schemas for transformers
    • Fully-typed pipelines mean that every input/output of a transformer is known by MLeap
  • d076532 Allow OneHotEncoder etc. to be copied (#219)
    • Allow OneHotEncoder to be copied.
    • Same fix for sibling types.

Serialization / Deserialization

  • 0c3329f Fix LogisticRegressionOpV21 for binary classification (#264)

  • 7be8bdd Automatic casting of data types in leap frame and Spark DataFrame <--> LeapFrame conversion (#260)

  • c6c4b4d Fix LR threshold/thresholds when loading model into spark pipeline (#265)

  • 3de5d79 Fix null string mleap -> spark type conversion. (#261)

  • 5b82afe Fix up some Spark bundle ops and make parity specs more flexible (#263)

  • 8d7f713 Automatically set spark registry version based on version string. (#248)

  • 6181ead Use default registry for Spark bundle registry if it's defined. (#251)

Others

  • a0d2218 XGBoost Integration (#259)
  • 7be8bdd Automatic casting of data types in leap frame and Spark DataFrame <--> LeapFrame conversion (#260)

MLeap 0.9.0

Fix Existing Spark Integration Issues

  • 1c36b9c Fix MaxAbsScaler. (#273)
  • 0c3a4d5 Fix issue with MaxAbsScaler (#274)
  • 33ee871 Mark input to RegexTokenizer and Tokenizer as nonnullable to prevent NPEs (#278)
  • 8d64f66 Fix typedExec copying inputs from the model for the UDF and change spark parity base to check for typedExec for Simple and Multi Transformers (#280)
  • 6f681f7 Add empty strings to null numeric casting (#317)

Sync with Spark 2.2 / 2.3

  • 06d66db Implement skip logic for string indexer. (#289)

  • e3070b5 issue 231 - support sparse data in standard scaler (#300)

    • issue 231 - support sparse data in standard scaler
    • issue 229 - update std scaler code to match Spark
  • 0130150 fix keep_invalid being lost when a model gets deserialized back in Spark (#312)

    • fix keep_invalid being lost when a model gets deserialized back in Spark
    • re-trigger the build
  • 85e00af remove custom mleap one hot encoder

  • 7dce151 Improve reverse string indexer with list inputs/outputs. (#350)

  • 8e00ee8 Issue #79: add ALS support - part 1 (#347)

    • add support for ALS predictions in MLeap - part 1

      • note: some precision issues with parity test to be fixed
    • simplify to remove need for vectors

    • same logic as spark 2.2.0

    • fix parity test

    • Merge remote-tracking branch 'remotes/upstream/master' into issue_79_als

    • fix copy paste error

  • 8bde434 [Ready For Merge] Spark 2.3 Support (#364)

    • Upgrade spark dependency to 2.3 and ensure all existing unit tests pass.

    • Remove cross compile for 2.10, as spark 2.3 isn't prebuilt for scala 2.10 anymore.

    • Initial work on support for FeatureHasher transformer

    • Few cleanup items and more/better tests for FeatureHasher

    • Update README.md to remove references to Scala 2.10 and cross compilation.

Serialization / Deserialization

  • 4acb81f Refactor LeapFrame to make a consistent interface (#286)
  • 08cde5f Feature/pypi version (#287)

Others

  • b8c38f6 Asynchronous interface for transformers (#288)

    • Add asynchronous interface for MLeap transformers.
    • Update with clean leap frame interface.
    • Fix async transform for scala 2.10

    MLeap 0.10

    Fix Existing Spark Integration Issues

    • 682679f fix StringIndexer serialization and deserialization after changes for Spark 2.3 (#383)

Serialization / Deserialization

  • **682679f fix StringIndexer serialization and deserialization after changes for Spark 2.3 (#383)

MLeap 0.11

Fix Existing Spark Integration Issues

  • 7e3ca9d Mleap-386 fixing BinaryLogisticRegression predictions (#387)

Sync with Spark 2.3

  • 8a8d69e Feature/regex indexer (#403)

    • Add regex indexer model.

    • Add regex indexer transformer.

    • Add in RegexIndexerOp.

  • 81872e5 Feature/sentence to vec (#402)

    • Add in alternative kernels for the Word2Vec model.
    • Include kernel in bundle for word to vector.
    • Use sentence size for divisor in default kernel.
    • Make indices optional in Word2Vec transformer.
    • Word2Vec op can use tensor or list for word vectors.
  • 300f990 MLeap Support for Gensim's Word2Vec model (#404)

    • initial commit of serializing gensims' word2vec model along with word2sentence transformer for parity with mleap/spark's sent2vec sqrt kernel

    • initial commit

    • cleaning up test

    • updating test

  • 15e8dd3 Feature/drilldown (#405)

    • Add in Categorical drilldown transformer.
    • Add in bundle op for categorical drilldown model.
  • 2924b4a Add support for cross validator and train validation split. (#406)

  • 3ee879f [Ready to Merge] Support for Spark 2.3's OneHotEncoderModel (#408)

  • c7860af Make 1HE runtime op compatible with both legacy 1HE and 1HEModel. (#417)

Serialization / Deserialization

  • e177e58 Feature/bundle file system (#415)
    • Add in support for bundle file systems.
    • Tighter integration with bundle file systems.
    • Add in bundle HDFS configurations for runtime/spark.
    • Fix up configuration of HDFS for runtime.
    • Make hadoop bundle file system configurable.
    • Add in tests for hadoop file system.
    • Finish testing for BundleFileSystem.
    • Fix spec.
    • Use URI as primary way of saving/loading bundles.
    • Add in README and make slightly more configurable.
    • Add in spec for URI saving/loading.
    • Delete HadoopBundleFileSystem.scala
    • Fix python requirements file.

MLeap 0.12

Fix Existing Spark Integration Issues

  • 0ff7c16 Update MinMaxScalerModel.scala

    • Fix apply bug when use sparse vector

Sync with Spark 2.3

  • 297e0a2 1HE NodeShape backwards compatibility (#433)

MLeap 0.13

Fix Existing Spark Integration Issues

  • ec9e775 fix custom mleap OneVsRest transformer (#492)

Sync with Spark 2.3

  • 5d231f9 added absolute value (abs) function to the MathUnaryModel class
  • 8c07788 ensure backwards compatibility for old version of 1HE

API Changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment