- New algorithms in DataFrame-based API
- SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
- SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
- SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
- SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
- SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
- SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)
- Existing algorithms added to Python and R APIs
- SPARK-18239: Gradient Boosted Trees ®
- SPARK-18821: Bisecting K-Means ®
- SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
- SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)
- Major bug fixes
- SPARK-19110: DistributedLDAModel.logPrior correctness fix
- SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
- SPARK-18715: Fix wrong AIC calculation in Binomial GLM
- SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
- SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
- SPARK-20047: Box-constrained Logistic Regression
- Highlights
- ML Prediction now works with Structured Streaming, using updated APIs. Details below.
- New/Improved APIs
- [SPARK-21866]: Built-in support for reading images into a DataFrame (Scala/Java/Python)
- [SPARK-19634]: DataFrame functions for descriptive summary statistics over vector columns (Scala/Java)
- [SPARK-14516]:
ClusteringEvaluator
for tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python) - [SPARK-3181]: Robust linear regression with Huber loss (Scala/Java/Python)
- [SPARK-13969]:
FeatureHasher
transformer (Scala/Java/Python) - Multiple column support for several feature transformers:
- [SPARK-13030]:
OneHotEncoderEstimator
(Scala/Java/Python) - [SPARK-22397]:
QuantileDiscretizer
(Scala/Java) - [SPARK-20542]:
Bucketizer
(Scala/Java/Python)
- [SPARK-13030]:
- [SPARK-21633] and [SPARK-21542]: Improved support for custom pipeline components in Python.
- New Features
- [SPARK-21087]:
CrossValidator
andTrainValidationSplit
can collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models. - [SPARK-19357]: Meta-algorithms
CrossValidator
,TrainValidationSplit,
OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobs - [SPARK-17139]: Model summary for multinomial logistic regression (Scala/Java/Python)
- [SPARK-18710]: Add offset in GLM
- [SPARK-20199]: Added
featureSubsetStrategy
Param toGBTClassifier
andGBTRegressor
. Using this to subsample features can significantly improve training speed; this option has been a key strength ofxgboost
.
- [SPARK-21087]:
- Other Notable Changes
- [SPARK-22156] Fixed
Word2Vec
learning rate scaling withnum
iterations. The new learning rate is set to match the originalWord2Vec
C code and should give better results from training. - [SPARK-22289] Add
JSON
support for Matrix parameters (This fixed a bug for ML persistence withLogisticRegressionModel
when using bounds on coefficients.) - [SPARK-22700]
Bucketizer.transform
incorrectly drops row containingNaN
. When ParamhandleInvalid
was set to “skip,”Bucketizer
would drop a row with a valid value in the input column if another (irrelevant) column had aNaN
value. - [SPARK-22446] Catalyst optimizer sometimes caused
StringIndexerModel
to throw an incorrect “Unseen label” exception whenhandleInvalid
was set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset. - [SPARK-21681] Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients when some features had zero variance.
- Major optimizations:
- [SPARK-22707] Reduced memory consumption for
CrossValidator
- [SPARK-22949] Reduced memory consumption for
TrainValidationSplit
- [SPARK-21690]
Imputer
should train using a single pass over the data - [SPARK-14371]
OnlineLDAOptimizer
avoids collecting statistics to the driver for each mini-batch.
- [SPARK-22707] Reduced memory consumption for
- [SPARK-22156] Fixed
- ee76cc7 Fix binary logistic regression predict method. (#211)
BinaryLogisticRegressionModel
.- Fix issue: LogisticRegressionModel returns wrong predictions.
- b323879 sync GBT Classifier with Spark 2.2 (#236)
- add rawPrediction and probability columns to GBT Classifier
- add loss functions for probabilistic classifiers
- add reference config for spark 2.2
- 89d6ca5 add "keep" as a invalid handling mode to StringIndexerModel (#226)
- 9196e54 Handle null values like Spark does (#262)
- 185cdf4 Feature/core types - overhaul of typing system (#250)
- Extend Model for ElementwiseProductModel and PipelineModel
- Add implicit context to some Ops
- Remove remaining type from Socket
- add numFeatures to LogisticRegressionModel based on coefficients/coefficientsMatrix
- add numFeatures to SupportVectorMachineModel based on coefficients
- fix SparkTransformBuilderSpec
- adding input/output schema and fixing or adding tests for outstanding models
- scikit integration fixes
- Makes it much easier to implement both MLeap transformer/Spark transformer serializers
- Offers full test coverage of input/output schemas for transformers
- Fully-typed pipelines mean that every input/output of a transformer is known by MLeap
- d076532 Allow OneHotEncoder etc. to be copied (#219)
- Allow OneHotEncoder to be copied.
- Same fix for sibling types.
-
0c3329f Fix LogisticRegressionOpV21 for binary classification (#264)
-
7be8bdd Automatic casting of data types in leap frame and Spark DataFrame <--> LeapFrame conversion (#260)
-
c6c4b4d Fix LR threshold/thresholds when loading model into spark pipeline (#265)
-
3de5d79 Fix null string mleap -> spark type conversion. (#261)
-
5b82afe Fix up some Spark bundle ops and make parity specs more flexible (#263)
-
8d7f713 Automatically set spark registry version based on version string. (#248)
-
6181ead Use default registry for Spark bundle registry if it's defined. (#251)
- a0d2218 XGBoost Integration (#259)
- 7be8bdd Automatic casting of data types in leap frame and Spark DataFrame <--> LeapFrame conversion (#260)
- 1c36b9c Fix MaxAbsScaler. (#273)
- 0c3a4d5 Fix issue with MaxAbsScaler (#274)
- 33ee871 Mark input to RegexTokenizer and Tokenizer as nonnullable to prevent NPEs (#278)
- 8d64f66 Fix typedExec copying inputs from the model for the UDF and change spark parity base to check for typedExec for Simple and Multi Transformers (#280)
- 6f681f7 Add empty strings to null numeric casting (#317)
-
06d66db Implement skip logic for string indexer. (#289)
-
e3070b5 issue 231 - support sparse data in standard scaler (#300)
- issue 231 - support sparse data in standard scaler
- issue 229 - update std scaler code to match Spark
-
0130150 fix keep_invalid being lost when a model gets deserialized back in Spark (#312)
- fix keep_invalid being lost when a model gets deserialized back in Spark
- re-trigger the build
-
85e00af remove custom mleap one hot encoder
-
7dce151 Improve reverse string indexer with list inputs/outputs. (#350)
-
8e00ee8 Issue #79: add ALS support - part 1 (#347)
-
add support for ALS predictions in MLeap - part 1
- note: some precision issues with parity test to be fixed
-
simplify to remove need for vectors
-
same logic as spark 2.2.0
-
fix parity test
-
Merge remote-tracking branch 'remotes/upstream/master' into issue_79_als
-
fix copy paste error
-
-
8bde434 [Ready For Merge] Spark 2.3 Support (#364)
-
Upgrade spark dependency to 2.3 and ensure all existing unit tests pass.
-
Remove cross compile for 2.10, as spark 2.3 isn't prebuilt for scala 2.10 anymore.
-
Initial work on support for FeatureHasher transformer
-
Few cleanup items and more/better tests for FeatureHasher
-
Update README.md to remove references to Scala 2.10 and cross compilation.
-
- 4acb81f Refactor LeapFrame to make a consistent interface (#286)
- 08cde5f Feature/pypi version (#287)
-
b8c38f6 Asynchronous interface for transformers (#288)
- Add asynchronous interface for MLeap transformers.
- Update with clean leap frame interface.
- Fix async transform for scala 2.10
- 682679f fix StringIndexer serialization and deserialization after changes for Spark 2.3 (#383)
- **682679f fix StringIndexer serialization and deserialization after changes for Spark 2.3 (#383)
- 7e3ca9d Mleap-386 fixing BinaryLogisticRegression predictions (#387)
-
8a8d69e Feature/regex indexer (#403)
-
Add regex indexer model.
-
Add regex indexer transformer.
-
Add in RegexIndexerOp.
-
-
81872e5 Feature/sentence to vec (#402)
- Add in alternative kernels for the Word2Vec model.
- Include kernel in bundle for word to vector.
- Use sentence size for divisor in default kernel.
- Make indices optional in Word2Vec transformer.
- Word2Vec op can use tensor or list for word vectors.
-
300f990 MLeap Support for Gensim's Word2Vec model (#404)
-
initial commit of serializing gensims' word2vec model along with word2sentence transformer for parity with mleap/spark's sent2vec sqrt kernel
-
initial commit
-
cleaning up test
-
updating test
-
-
15e8dd3 Feature/drilldown (#405)
- Add in Categorical drilldown transformer.
- Add in bundle op for categorical drilldown model.
-
2924b4a Add support for cross validator and train validation split. (#406)
-
3ee879f [Ready to Merge] Support for Spark 2.3's OneHotEncoderModel (#408)
-
c7860af Make 1HE runtime op compatible with both legacy 1HE and 1HEModel. (#417)
- e177e58 Feature/bundle file system (#415)
- Add in support for bundle file systems.
- Tighter integration with bundle file systems.
- Add in bundle HDFS configurations for runtime/spark.
- Fix up configuration of HDFS for runtime.
- Make hadoop bundle file system configurable.
- Add in tests for hadoop file system.
- Finish testing for BundleFileSystem.
- Fix spec.
- Use URI as primary way of saving/loading bundles.
- Add in README and make slightly more configurable.
- Add in spec for URI saving/loading.
- Delete HadoopBundleFileSystem.scala
- Fix python requirements file.
-
0ff7c16 Update MinMaxScalerModel.scala
- Fix apply bug when use sparse vector
- 297e0a2 1HE NodeShape backwards compatibility (#433)
- ec9e775 fix custom mleap OneVsRest transformer (#492)
- 5d231f9 added absolute value (abs) function to the MathUnaryModel class
- 8c07788 ensure backwards compatibility for old version of 1HE