glossary:
- "p": "parallelism" setting in circle CI
- "i": "instance type" setting in circle CI
- "r": number of junit runners, configured via
-Dtest.runners
- p=100 doesn't work for JVM dtests, and is not necessary. Seems that the failures we get are caused by a problem in the Circle CI setup / infrastructure. However there are still some flaky tests in those at lower
p
. - JVM dtests don't work when setting r > 1
- current config: (p=1, r=1, i=medium): 20mn
- best: (p=25, r=1, i=xlarge): 4mn
- good compromise time config: (p=10, r=1, i=large): 6mn
- r value needs to be adjusted manually, number of optimal runners doesn't get automatically adjusted to instance type. Using r=nb-of-cores is optimal.
- current config: (p=4, r=1, i=medium): 20mn
- best time config: (p=25, r=4, i=xlarge): 3mn45s
- good compromise config: (p=10, r=4, i=large): 5mn30s
- is equivalent in time to: (p=25, r=2, i=medium)
- i=medium causes a lot of failures (due to resources necessary to run the tests)
- running at i=large or i=xlarge cause much less failures. Only failures are flaky tests (sometimes pass, sometimes don't)
- HEAP sizes need to be increased when increasing instance type
- but not too much otherwise more failures. At i=xlarge, MAX_HEAP and HEAP_NEW should not be higher than
2048
and512
respectively, otherwise timeout errors occur.
- but not too much otherwise more failures. At i=xlarge, MAX_HEAP and HEAP_NEW should not be higher than
- one account cannot run many python dtests at p=100 simultaneously. Between 2 and 3 p=100 runs ever run simultaneously.
- current config: (p=4, i=medium): hours? - lots of failures
- best time config: (p=100, i=xlarge): 19mn
- good compromise config: (p=50, i=large): 22min
This job will re-run the unit test suite with commit log compaction enabled. Same recommendation applies as for the Unit test suite.
These tests are not parallelized at the moment, runs fast enough on i=medium.
- Long tests are not parallelized at the moment, runs slowly.
- (i=xlarge): 30mn
- (i=large): 30mn
- (i=medium): 28mn
- medium will be enough
- Need config update (need
JAVA8_HOME
set) - run very long, each container downloads all versions of C*, can be optimized
- current config: (p=4, i=medium): 3hours+
- best time config: (p=100, i=xlarge): 52mn (lots of failures)
- would suggest looking into parallelizing the tests better (not have each container download all C* versions separately) before being able to give adequate recommendation for these
Remaining data points gathered during experiments, used to draw conclusions on best compromise configs
- (p=10, r=1, i=xlarge): 5mn35 (not worth the xlarge instance as opposed to large)
- (p=25, r=8, i=xlarge): 5mn (not good, for some reason r=8 seemed to cause more failures)
- (p=25, i=large): 32mn (could be acceptable, not as good as p=50 but twice less containers)
- (p=50, i=xlarge): 25mn (as good as with large instance, xlarge not necessary)
- (p=25, i=xlarge): 35mn (not as good as p=50, i=large is better)
- (p=25, i=large): 2h46mn
- (p=50, i=large): 1h28mn
The current configuration (in trunk) is mostly set to the minimal configuration that can make the test suites run. Except for the python dtests where it seems like the instances setup lack resources.
To improve build times, for most cases it doesn't seem like we actually need to use xlarge
instances, we have seen very similar improvements by upgrading to only large
instances instead. With some reasonable parallelism, for example with the python dtests, we can have the full python dtest run in under 30 min if we choose p=50 and i=large.
It turns out that at p=100, we see only minor improvement, not worth using twice more resources.
For the Unit and JVM dtests, current runtimes can be vastly improved by using (p=10, i=large) if we think it is necessary.
In the details above we list for each test suite which configuration would be a "best compromise" config that could bring significant test suite time improvement, while not using the "bulldozer" config because unnecessary.
For the remainder of the tests (excluding the upgrade tests), using the minimal configuration gives reasonable run times.