Skip to content

Instantly share code, notes, and snippets.

@ani5rudh
Last active January 11, 2024 09:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save ani5rudh/4d82f3498f20c9c1a6d6e429cffaab3b to your computer and use it in GitHub Desktop.
Save ani5rudh/4d82f3498f20c9c1a6d6e429cffaab3b to your computer and use it in GitHub Desktop.
Final Report of "New Summary Statistics API for Java 8 streams (GSoC 2023)"

gsoc

Author: Anirudh Joshi

email: janirudhg@gmail.com

Github profile: ani5rudh

Mentors:

Project Description

The goal of this project was to design a new and updated summary statistics API based on the summary statistic implementations in the Commons Math stat.descriptive package. The project aimed to be a library of common statistics functions in line with the latest developments in the Java language, in particular Java's functional programming syntax.

Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not available in the Java programming language or Commons Lang. It had grown too large, become hierarchically complex and interdependent for the commons mission. The new API for the summary statistics which uses Java's functional programming features takes advantage of stream pipeline optimization, parallel processing, lazy evaluation, and code readability.

The project included:

  • Providing individual implementations of the descriptive stats with extensive documentation and testing.
  • Providing a combine functionality to compute the combined statistic of two partially computed stats.
  • Bettering the architectural design of the descriptive package.
  • Providing a builder/aggregator class which would compute multiple stats.

Project Progress

  • The following stats were implemented, documented and tested:
    1. Minimum
    2. Maximum
    3. Sum
    4. Mean
    5. Variance
  • The combine functionality was provided for the above stats.
  • The module now makes use of the java.util.stream package and its various methods such as map and reduce.
  • During the computation of the sum statistic, a bug was encountered in the commons-numbers module when adding 2 special values (see NUMBERS-200). This was fixed in PR-136.

Future Work

  • The builder class is yet to be implemented based on the individual stats. This should be done in an efficient way so that the stats computed can be reused by other stats if possible. For example, the variance can reuse the mean and sum-of-squares statistics if they are co-computed.

  • The stats sum of squares and sum of logs and their related stats, quadratic mean and geometric mean have to be implemented. These are very similar to the stats sum and mean respectively, and can be implemented using the same design as that of the sum and mean stats.

Project Challenges

  • The combine functionality was not present in the commons-math module. The process of combining two instances of a statistic took a significant amount of time, particularly when dealing with mean and variance, due to the need to consult research papers and online artciles for the relevant formulae.

  • The test cases from the commons-math module were not sufficient to identify issues with the original implementation. Adding more test cases with extreme edge values highlighted problems with the implementation that needed to be addressed. I encountered many issues related to the computation of stats for the data which included special values (Double.MAX_VALUE, Double.MIN_VALUE, Double.POSITIVE_INFINITY and Double.NEGATIVE_INFINITY). Resolving these issues helped me to redesign and improve the efficiency of the code.

  • I encountered complications when dealing with overflow in the context of large values. I learnt a lot about double value overflows and techniques to prevent them like scaling. It was also challenging to find a right balance between efficiency and accuracy while computing the statistics for large input values.

Project Issues and Merged Pull Requests

The initial discussion before submitting the proposal was held in the above JIRA ticket. A proof of concept was also provided as part of the proposal.

Implementation of univariate statistics:

  • PR-46: Added the base interfaces and implemented the minimum statistic.
  • PR-49: Maximum statistic implementation.
  • PR-50: Sum statistic implementation.
  • PR-51: Mean implementation.
  • PR-52: Variance implementation.

This ticket was created for the bug report when adding two Sum instances containing special values. Here's the associated PR submitted to fix this bug.

Key Takeaways:

  • How to raise PR's and I now have a better understanding of git.
  • Learnt the declarative style of programming and Java's functional programming features.
  • How to debug and fix code issues.
  • Learnt JUnit5 and how to meet the coverage checks.
  • Studied various algorithms and their advantages and disadvantages while computing the stats Sum, Mean and Variance.
  • Improved my understanding of floating-point numbers and their representation and also learnt about unit in the last place (ulp).
  • The importance of documentation.
  • How to work and contribute to the open-source community.

Acknowledgements:

I am incredibly grateful for the chance to take part in GSoC and contribute to The Apache Software Foundation. This was a great opportunity to develop my programming skills.

I am incredibly appreciative of my mentors for their continual support, thorough code reviews, and participation in regular discussions to answer my queries. Special mention to my mentor Alex who helped me a great deal in understanding the various overflow issues and guided me to resolve those. He also provided a great deal of information regarding the various algorithms for computing the stats which really aided me to get a better understanding of those.

The Apache community has been very helpful and their efforts to improve developer experience has made me very grateful to be a part of their organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment