ihainan/MLeap_Changes.md

## MLeap_Changes.md

      
    Raw
  

              MLeap_Changes.md
            
          
    Spark / MLeap Upgradation

Spark 2.2


New algorithms in DataFrame-based API

SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)


Existing algorithms added to Python and R APIs

SPARK-18239: Gradient Boosted Trees ®
SPARK-18821: Bisecting K-Means ®
SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)


Major bug fixes

SPARK-19110: DistributedLDAModel.logPrior correctness fix
SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
SPARK-18715: Fix wrong AIC calculation in Binomial GLM
SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
SPARK-20047: Box-constrained Logistic Regression


Spark 2.3


Highlights

ML Prediction now works with Structured Streaming, using updated APIs. Details below.


New/Improved APIs

[SPARK-21866]: Built-in support for reading images into a DataFrame (Scala/Java/Python)
[SPARK-19634]: DataFrame functions for descriptive summary statistics over vector columns (Scala/Java)
[SPARK-14516]: ClusteringEvaluator for tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python)
[SPARK-3181]: Robust linear regression with Huber loss (Scala/Java/Python)
[SPARK-13969]: FeatureHasher transformer (Scala/Java/Python)
Multiple column support for several feature transformers:

[SPARK-13030]: OneHotEncoderEstimator (Scala/Java/Python)
[SPARK-22397]: QuantileDiscretizer (Scala/Java)
[SPARK-20542]: Bucketizer (Scala/Java/Python)


[SPARK-21633] and [SPARK-21542]: Improved support for custom pipeline components in Python.


New Features

[SPARK-21087]: CrossValidator and TrainValidationSplit can collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models.
[SPARK-19357]: Meta-algorithms CrossValidator, TrainValidationSplit, OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobs
[SPARK-17139]: Model summary for multinomial logistic regression (Scala/Java/Python)
[SPARK-18710]: Add offset in GLM
[SPARK-20199]: Added featureSubsetStrategy Param to GBTClassifier and GBTRegressor. Using this to subsample features can significantly improve training speed; this option has been a key strength of xgboost.


Other Notable Changes

[SPARK-22156] Fixed Word2Vec learning rate scaling with num iterations. The new learning rate is set to match the original Word2Vec C code and should give better results from training.
[SPARK-22289] Add JSON support for Matrix parameters (This fixed a bug for ML persistence with LogisticRegressionModel when using bounds on coefficients.)
[SPARK-22700] Bucketizer.transform incorrectly drops row containing NaN. When Param handleInvalid was set to “skip,” Bucketizer would drop a row with a valid value in the input column if another (irrelevant) column had a NaN value.
[SPARK-22446] Catalyst optimizer sometimes caused StringIndexerModel to throw an incorrect “Unseen label” exception when handleInvalid was set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset.
[SPARK-21681] Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients when some features had zero variance.
Major optimizations:

[SPARK-22707] Reduced memory consumption for CrossValidator
[SPARK-22949] Reduced memory consumption for TrainValidationSplit
[SPARK-21690] Imputer should train using a single pass over the data
[SPARK-14371] OnlineLDAOptimizer avoids collecting statistics to the driver for each mini-batch.


MLeap 0.8.0

Fix Existing Spark Integration Issues


ee76cc7 Fix binary logistic regression predict method. (#211)

BinaryLogisticRegressionModel.
Fix issue: LogisticRegressionModel returns wrong predictions.


Sync with Spark 2.2


b323879 sync GBT Classifier with Spark 2.2 (#236)

add rawPrediction and probability columns to GBT Classifier
add loss functions for probabilistic classifiers
add reference config for spark 2.2


89d6ca5 add "keep" as a invalid handling mode to StringIndexerModel (#226)
9196e54 Handle null values like Spark does (#262)
185cdf4 Feature/core types - overhaul of typing system (#250)

Extend Model for ElementwiseProductModel and PipelineModel
Add implicit context to some Ops
Remove remaining type from Socket
add numFeatures to LogisticRegressionModel based on coefficients/coefficientsMatrix
add numFeatures to SupportVectorMachineModel based on coefficients
fix SparkTransformBuilderSpec
adding input/output schema and fixing or adding tests for outstanding models
scikit integration fixes
Makes it much easier to implement both MLeap transformer/Spark transformer serializers
Offers full test coverage of input/output schemas for transformers
Fully-typed pipelines mean that every input/output of a transformer is known by MLeap


d076532 Allow OneHotEncoder etc. to be copied (#219)

Allow OneHotEncoder to be copied.
Same fix for sibling types.


Serialization / Deserialization


0c3329f Fix LogisticRegressionOpV21 for binary classification (#264)


7be8bdd Automatic casting of data types in leap frame and Spark DataFrame <--> LeapFrame conversion (#260)


c6c4b4d Fix LR threshold/thresholds when loading model into spark pipeline (#265)


3de5d79 Fix null string mleap -> spark type conversion. (#261)


5b82afe Fix up some Spark bundle ops and make parity specs more flexible (#263)


8d7f713 Automatically set spark registry version based on version string. (#248)


6181ead Use default registry for Spark bundle registry if it's defined. (#251)


Others


a0d2218 XGBoost Integration (#259)
7be8bdd Automatic casting of data types in leap frame and Spark DataFrame <--> LeapFrame conversion (#260)

MLeap 0.9.0

Fix Existing Spark Integration Issues


1c36b9c Fix MaxAbsScaler. (#273)
0c3a4d5 Fix issue with MaxAbsScaler (#274)
33ee871 Mark input to RegexTokenizer and Tokenizer as nonnullable to prevent NPEs (#278)
8d64f66 Fix typedExec copying inputs from the model for the UDF and change spark parity base to check for typedExec for Simple and Multi Transformers (#280)
6f681f7 Add empty strings to null numeric casting (#317)

Sync with Spark 2.2 / 2.3


06d66db Implement skip logic for string indexer. (#289)


e3070b5 issue 231 - support sparse data in standard scaler (#300)

issue 231 - support sparse data in standard scaler
issue 229 - update std scaler code to match Spark


0130150 fix keep_invalid being lost when a model gets deserialized back in Spark (#312)

fix keep_invalid being lost when a model gets deserialized back in Spark
re-trigger the build


85e00af remove custom mleap one hot encoder


7dce151 Improve reverse string indexer with list inputs/outputs. (#350)


8e00ee8 Issue #79: add ALS support - part 1 (#347)


add support for ALS predictions in MLeap - part 1

note: some precision issues with parity test to be fixed


simplify to remove need for vectors


same logic as spark 2.2.0


fix parity test


Merge remote-tracking branch 'remotes/upstream/master' into issue_79_als


fix copy paste error


8bde434 [Ready For Merge] Spark 2.3 Support (#364)


Upgrade spark dependency to 2.3 and ensure all existing unit tests pass.


Remove cross compile for 2.10, as spark 2.3 isn't prebuilt for scala 2.10 anymore.


Initial work on support for FeatureHasher transformer


Few cleanup items and more/better tests for FeatureHasher


Update README.md to remove references to Scala 2.10 and cross compilation.


Serialization / Deserialization


4acb81f Refactor LeapFrame to make a consistent interface (#286)
08cde5f Feature/pypi version (#287)

Others


b8c38f6 Asynchronous interface for transformers (#288)

Add asynchronous interface for MLeap transformers.
Update with clean leap frame interface.
Fix async transform for scala 2.10

MLeap 0.10

Fix Existing Spark Integration Issues


682679f fix StringIndexer serialization and deserialization after changes for Spark 2.3 (#383)


Serialization / Deserialization


**682679f fix StringIndexer serialization and deserialization after changes for Spark 2.3 (#383)

MLeap 0.11

Fix Existing Spark Integration Issues


7e3ca9d Mleap-386 fixing BinaryLogisticRegression predictions (#387)

Sync with Spark 2.3


8a8d69e Feature/regex indexer (#403)


Add regex indexer model.


Add regex indexer transformer.


Add in RegexIndexerOp.


81872e5 Feature/sentence to vec (#402)

Add in alternative kernels for the Word2Vec model.
Include kernel in bundle for word to vector.
Use sentence size for divisor in default kernel.
Make indices optional in Word2Vec transformer.
Word2Vec op can use tensor or list for word vectors.


300f990 MLeap Support for Gensim's Word2Vec model (#404)


initial commit of serializing gensims' word2vec model along with word2sentence transformer for parity with mleap/spark's sent2vec sqrt kernel


initial commit


cleaning up test


updating test


15e8dd3 Feature/drilldown (#405)

Add in Categorical drilldown transformer.
Add in bundle op for categorical drilldown model.


2924b4a Add support for cross validator and train validation split. (#406)


3ee879f [Ready to Merge] Support for Spark 2.3's OneHotEncoderModel (#408)


c7860af Make 1HE runtime op compatible with both legacy 1HE and 1HEModel. (#417)


Serialization / Deserialization


e177e58 Feature/bundle file system (#415)

Add in support for bundle file systems.
Tighter integration with bundle file systems.
Add in bundle HDFS configurations for runtime/spark.
Fix up configuration of HDFS for runtime.
Make hadoop bundle file system configurable.
Add in tests for hadoop file system.
Finish testing for BundleFileSystem.
Fix spec.
Use URI as primary way of saving/loading bundles.
Add in README and make slightly more configurable.
Add in spec for URI saving/loading.
Delete HadoopBundleFileSystem.scala
Fix python requirements file.


MLeap 0.12

Fix Existing Spark Integration Issues


0ff7c16 Update MinMaxScalerModel.scala

Fix apply bug when use sparse vector


Sync with Spark 2.3


297e0a2 1HE NodeShape backwards compatibility (#433)

MLeap 0.13

Fix Existing Spark Integration Issues


ec9e775 fix custom mleap OneVsRest transformer (#492)

Sync with Spark 2.3


5d231f9 added absolute value (abs) function to the MathUnaryModel class
8c07788 ensure backwards compatibility for old version of 1HE

API Changes