Author: Anirudh Joshi
email: janirudhg@gmail.com
Github profile: ani5rudh
Mentors:
The goal of this project was to design a new and updated summary statistics API based on the summary statistic implementations in the Commons Math stat.descriptive package. The project aimed to be a library of common statistics functions in line with the latest developments in the Java language, in particular Java's functional programming syntax.
Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not available in the Java programming language or Commons Lang. It had grown too large, become hierarchically complex and interdependent for the commons mission. The new API for the summary statistics which uses Java's functional programming features takes advantage of stream pipeline optimization, parallel processing, lazy evaluation, and code readability.
The project included:
- Providing individual implementations of the descriptive stats with extensive documentation and testing.
- Providing a combine functionality to compute the combined statistic of two partially computed stats.
- Bettering the architectural design of the descriptive package.
- Providing a builder/aggregator class which would compute multiple stats.
- The following stats were implemented, documented and tested:
- Minimum
- Maximum
- Sum
- Mean
- Variance
- The combine functionality was provided for the above stats.
- The module now makes use of the java.util.stream package and its various methods such as map and reduce.
- During the computation of the sum statistic, a bug was encountered in the commons-numbers module when adding 2 special values (see NUMBERS-200). This was fixed in PR-136.
-
The builder class is yet to be implemented based on the individual stats. This should be done in an efficient way so that the stats computed can be reused by other stats if possible. For example, the variance can reuse the mean and sum-of-squares statistics if they are co-computed.
-
The stats sum of squares and sum of logs and their related stats, quadratic mean and geometric mean have to be implemented. These are very similar to the stats sum and mean respectively, and can be implemented using the same design as that of the sum and mean stats.
-
The combine functionality was not present in the commons-math module. The process of combining two instances of a statistic took a significant amount of time, particularly when dealing with mean and variance, due to the need to consult research papers and online artciles for the relevant formulae.
-
The test cases from the commons-math module were not sufficient to identify issues with the original implementation. Adding more test cases with extreme edge values highlighted problems with the implementation that needed to be addressed. I encountered many issues related to the computation of stats for the data which included special values (
Double.MAX_VALUE
,Double.MIN_VALUE
,Double.POSITIVE_INFINITY
andDouble.NEGATIVE_INFINITY
). Resolving these issues helped me to redesign and improve the efficiency of the code. -
I encountered complications when dealing with overflow in the context of large values. I learnt a lot about double value overflows and techniques to prevent them like scaling. It was also challenging to find a right balance between efficiency and accuracy while computing the statistics for large input values.
The initial discussion before submitting the proposal was held in the above JIRA ticket. A proof of concept was also provided as part of the proposal.
- PR-46: Added the base interfaces and implemented the minimum statistic.
- PR-49: Maximum statistic implementation.
- PR-50: Sum statistic implementation.
- PR-51: Mean implementation.
- PR-52: Variance implementation.
This ticket was created for the bug report when adding two Sum instances containing special values. Here's the associated PR submitted to fix this bug.
- How to raise PR's and I now have a better understanding of git.
- Learnt the declarative style of programming and Java's functional programming features.
- How to debug and fix code issues.
- Learnt JUnit5 and how to meet the coverage checks.
- Studied various algorithms and their advantages and disadvantages while computing the stats Sum, Mean and Variance.
- Improved my understanding of floating-point numbers and their representation and also learnt about unit in the last place (ulp).
- The importance of documentation.
- How to work and contribute to the open-source community.
I am incredibly grateful for the chance to take part in GSoC and contribute to The Apache Software Foundation. This was a great opportunity to develop my programming skills.
I am incredibly appreciative of my mentors for their continual support, thorough code reviews, and participation in regular discussions to answer my queries. Special mention to my mentor Alex who helped me a great deal in understanding the various overflow issues and guided me to resolve those. He also provided a great deal of information regarding the various algorithms for computing the stats which really aided me to get a better understanding of those.
The Apache community has been very helpful and their efforts to improve developer experience has made me very grateful to be a part of their organization.