Skip to content

Instantly share code, notes, and snippets.

@mmalohlava
Last active March 31, 2017 09:23
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mmalohlava/717ad7b7441a6ff91b5f0a907482bd5d to your computer and use it in GitHub Desktop.
Save mmalohlava/717ad7b7441a6ff91b5f0a907482bd5d to your computer and use it in GitHub Desktop.
Assignment: Improve H2O PCA

Assignment: Improve H2O PCA

The goal of this assignment is to:

  1. Get familiar with the H2O stack
  2. Make an improvement in H2O

Details

H2O provides implementation of the PCA algorithm which depends on the Jama library. The library is used for several tasks including Singular Value Decomposition (SVD). However, the library also introduces sub-optimal performance.

The idea is to replace Jama SVD commutation by the netlib-java library, and measure performance impact.

Your task is to:

  1. Clone development version of H2O from GitHub https://github.com/h2oai/h2o-3
  2. Build H2O
  3. Explore PCA implementation and how it is used, for example, in JUnit tests.
  4. Replace use of Jama SVD by netlib-java
  5. Measure impact of the change with a single node micro benchmark(s).
  6. Create a pull request with your change

Hints

  • IntelliJ IDEA is a great tool for Java development
  • You can build only Java part of H2O by invoking ./gradlew :h2o-assemblies:main:build
  • JUnit tests are great source of information
  • If you are stuck, please, do not be afraid to contact us

Evaluation criteria

  • Does it work? Can I launch H2O and compute PCA?
  • Was performance measured?
  • How good is the implementation?

Enjoy!

@mathemage
Copy link

Missing datasets for PCA JUnit tests

PCA Test fails because of missing CSV datasets:

java.lang.AssertionError: File not found: smalldata/prostate/prostate_cat.csv
	at org.junit.Assert.fail(Assert.java:88)
	at water.TestUtil.makeNfsFileVec(TestUtil.java:272)
	at water.TestUtil.parse_test_file(TestUtil.java:278)
	at hex.pca.PCATest.testCatOnlyPUBDEV3988(PCATest.java:233)

But Google can find these CSVs in the version 2 of h2o. Should I include the dataset files into repo? Or is there another way to find and load these files (perhaps somehow via Flow)?

@mathemage
Copy link

Actually, h2o-2 does not contain all the necessary CSVs. These missing ones cause tests to fail:

  • smalldata/prostate/prostate_cat.csv
  • smalldata/pca_test/iris_PCAscore.csv
  • smalldata/pca_test/USArrests_PCAscore.csv
  • smalldata/pca_test/decathlon.csv
    Any ideas where I can get these missing files?

@mathemage
Copy link

I went to gitter for help. @michalkurka gave me a tip to sync the smalldata first ./gradlew syncSmalldata. I am trying that now, although the speed is really slow makes sync to fail. In particular, the connection to Amazon AWS is rather bad, the ./gradlew syncSmalldata fails with

Failed to download file https://h2o-public-test-data.s3.amazonaws.com/smalldata/arcene/arcene_test.data

and when I download manually with wget, the download speed drops down to ~5 KB/s. Is there a mirror or some other source of smalldata?

@mmalohlava
Copy link
Author

Ahh, there are hosted in S3 but region is bound to us-east. I send you a mail for a direct link to file, but this is something to think about

@mathemage
Copy link

@mmalohlava Great, thanks! I got your email. I keep downloading other dataset files, hopefully they will get downloaded over night 😅

@mathemage
Copy link

Sync failed to finish over the night. I skipped downloading the entire smalldata/ and downloaded just files needed for PCA tests:

url_prefix="https://h2o-public-test-data.s3.amazonaws.com/"
filepaths="smalldata/pca_test/USArrests.csv
smalldata/iris/iris_wheader.csv
smalldata/pca_test/USArrests.csv
smalldata/pca_test/decathlon.csv
smalldata/pca_test/iris_PCAscore.csv
smalldata/pca_test/USArrests_PCAscore.csv
"

for filepath in $filepaths
do
  if [ ! -e $filepath ]
  then
    mkdir -p "$(dirname $filepath)" && touch "$filepath"
    wget -O $filepath $url_prefix$filepath
  fi
done

For now, the PCA JUnit tests are all passing 😀

@mathemage
Copy link

mathemage commented Mar 29, 2017

@mmalohlava By netlib-java, do you mean MTJ specifically?

@mathemage
Copy link

mathemage commented Mar 29, 2017

Okay, I started migration to MTJ (based on netlib-java). You can see my progress in branch mathemage-pca-using-netlib-java-svd. Benchmarks are still to be run...

@mathemage
Copy link

@mmalohlava For single node micro benchmarks, is JMH used for h2o-3? Or what do you guys use?

@mathemage
Copy link

mathemage commented Mar 30, 2017

@mmalohlava How are JMH benchmarks done in h2o? I checked the jmh-gradle-plugin tutorial, but still struggling. Do these Groovy commands from tutorial go to h2o-3/build.gradle or is it a different Gradle file?

@mathemage
Copy link

@mmalohlava OIC, one needs to specify a specific subproject as in ./gradlew :h2o-core:jmh Ok, my bad, it's okay now 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment