The goal of this assignment is to:
- Get familiar with the H2O stack
- Make an improvement in H2O
H2O provides implementation of the PCA algorithm which depends on the Jama library. The library is used for several tasks including Singular Value Decomposition (SVD). However, the library also introduces sub-optimal performance.
The idea is to replace Jama SVD commutation by the netlib-java library, and measure performance impact.
Your task is to:
- Clone development version of H2O from GitHub https://github.com/h2oai/h2o-3
- Build H2O
- Explore PCA implementation and how it is used, for example, in JUnit tests.
- Replace use of Jama SVD by netlib-java
- Measure impact of the change with a single node micro benchmark(s).
- Create a pull request with your change
- IntelliJ IDEA is a great tool for Java development
- You can build only Java part of H2O by invoking
./gradlew :h2o-assemblies:main:build
- JUnit tests are great source of information
- If you are stuck, please, do not be afraid to contact us
- Does it work? Can I launch H2O and compute PCA?
- Was performance measured?
- How good is the implementation?
Enjoy!
Actually, h2o-2 does not contain all the necessary CSVs. These missing ones cause tests to fail:
smalldata/prostate/prostate_cat.csv
smalldata/pca_test/iris_PCAscore.csv
smalldata/pca_test/USArrests_PCAscore.csv
smalldata/pca_test/decathlon.csv
Any ideas where I can get these missing files?