The goal of this assignment is to:
- Get familiar with the H2O stack
- Make an improvement in H2O
H2O provides implementation of the PCA algorithm which depends on the Jama library. The library is used for several tasks including Singular Value Decomposition (SVD). However, the library also introduces sub-optimal performance.
The idea is to replace Jama SVD commutation by the netlib-java library, and measure performance impact.
Your task is to:
- Clone development version of H2O from GitHub https://github.com/h2oai/h2o-3
- Build H2O
- Explore PCA implementation and how it is used, for example, in JUnit tests.
- Replace use of Jama SVD by netlib-java
- Measure impact of the change with a single node micro benchmark(s).
- Create a pull request with your change
- IntelliJ IDEA is a great tool for Java development
- You can build only Java part of H2O by invoking
./gradlew :h2o-assemblies:main:build
- JUnit tests are great source of information
- If you are stuck, please, do not be afraid to contact us
- Does it work? Can I launch H2O and compute PCA?
- Was performance measured?
- How good is the implementation?
Enjoy!
Missing datasets for PCA JUnit tests
PCA Test fails because of missing CSV datasets:
But Google can find these CSVs in the version 2 of h2o. Should I include the dataset files into repo? Or is there another way to find and load these files (perhaps somehow via Flow)?