Skip to content

Instantly share code, notes, and snippets.

@Kriyszig
Last active Nov 13, 2019
Embed
What would you like to do?
Google Summer of Code 2019 Workproduct

Google Summer of Code 2019
D Programming Language - DataFrame Project

Complete Details

Community Bonding Period

This I was the time I tried to understand D and it's ecosystem thoroughly. I started working on the fundamental part of the DataFrame mainly the internal structure and the I/O operations - CSV parsing and terminal display.

The efforts in the beginning was to create a Homogeneous DataFrame with all the cells having the same data type. My mentors - Mr Nicolas Wilson and Mr. Ilya Yaroshenko helped me dig deeper into the D's standard libraries.

There were some events of misinterpretation and miscommunication for a brief moment but after everything was resolved, I resumed the works on the DataFrame project now named Magpie.

The initial developments during the Community Bonding Period is listed in the seperate dev branch here.

The major developments during the period:

  • Created the internal representation for Homogeneous DataFrame
  • Display function to print the DataFrame in terminal
  • Initial CSV parser
  • Initial CSV writer to format DataFrame to CSV strings

First Coding Phase

The first comment by Mr. John Hall in the forum thread pointed the out the glaring flaw in the DataFrame implementation - What if you need to manipulate the data in each column differently? With a complete homogeneous approach, the data type conversion would have either led to a massive overhead processing or would lead to a massive wastage of space.

It was time to go back to the drawing board to design a DataFrame which could hold both the homogeneous and heterogeneous data. This led to an early redesign:

  • Instead of using Slice from mir-algorithm as the DataType, we use a TypeTuple of arrays of the DataType of each column element.

This redesign was brought about by this Pull Request - #3, which consisted of:

  • Using TypeTuples for DataFrame data
  • Updating parser, display function and csv writer to function after the changes.
  • Using std:algorithm functions wherever possible.
  • Added unittests to check consistency of read, write and display.

Moving to the next big feature request mentioned in the original project statement - Binary Operations on DataFrame.

Binary Operation on DataFrame uses an intermediate structure called Axis which is used to transfer data. Binary operation on the Axis structure indirectly implemented binary operations for DataFrame.

The binary operations were brought by this Pull Request - #4

End of First Evaluation : Everything was on track and running smoothly. I has started to use more of D's features and leave my strong urge to write C like code behind.

Second Coding Phase

The beginning of the second phase brought with it some useful features to the DataFrame like apply to manipulate the values in a column or row based on an the result of the alias, an option to convert level of data index to a data column and the much needed fastCSV - a faster parser for CSV files.

These changes were included in this Pull Request - #5.

The biggest feature of all were the implementation of groupBy. Grouping is one of the most fundamental parts of a DataFrame. It helps us quickly cluster useful information from the large pile of high descriptive data.

With groupBy came the Group structure which store the data similar to the internal structure of a DataFrame. Group implemented similar array like properties to access and modify values of cells in the Group. [Also added the short hand operations which I had forgot to implement earlier 😅]. After this came merge operation to merge two or more groups into one DataFrame

Items addressed here were:

  • Group and grouping based on arbitrary number of columns
  • Retrieval of each group as a separate DataFrame
  • Index operation on Group
  • Binary Operation on Group
  • Shorthand operation for both DataFrame and Goup

These changes were added in this Pull Request - #6

Given Mir Library is a popular library in D ecosystem for efficient and seamless mathematical calculation, an addition of a way to fetch data from DataFrame as Slice seemed essential. Hence in this Pull Request - #8, a way to retrieve DataFrame data as Slice was added.

What did this PR brring:

  • Getting entire arithmetic data in DataFrame or Group as Slice
  • Getting a row of data as Slice
  • Getting a column of data as Slice

End of Phase 2 Evaluation : groupBy, merge and some other additional features were added to DataFrame and the newly formed structure Group. Some of the places could be optimized - especially the TypeTuple implementation for Homogeneous DataFrame could be replaced with a simple array to avoid static traversal to reach a particular column. Next thing was to both add and improve.

Third Coding Phase

Optimizations and Aggregate operations were being dealt on the same Pull Request - #11.

Optimizations replaced the TypeTuple implementation with an array implementation for a homogeneous DataFrame. This reduced traversal overhead when a particular column was targeted - the reason for this overhead was the fact that Tuple elements can only be accessed using static indexes and to add runtime indexing support to static element access created a traversal overhead where we traverse statically over the possible column index values and access the element when the static values becomes equal to the runtime value.

Aggregate operations were brought to do mathematical calculation o DataFrame without the need for it to be converted to Slice. In the preliminary implementation, aggregate functions were predefined but it was soon realized that this might not be the best approach.

After a brief tinkering and a read of the documentation, Aggregate was modified to take in a function as a parameter to calculate mathematical operations.

A few more optimizations involved using std.algorithm functions instead of approach being used otherwise. A simpler way to assign row and column index was also implemented.

The final PR for the final stage was this Pull Request - #12 which brings support for filtering operations on DataFrame and Group where one can specify a condition for passing and the DataFrame rows are automatically dropped based on the pass or failure of the condition.

Technical Overview

  • Work Done in Community Bonding Period - dev
  • Redesign DataFrame to allow heterogeneous types for each column - #3
  • Binary Operations on DataFrame - #4
  • Apply operation, conversion of column to Index, faster parser - #5
  • gropuBy, merge, binary operation, shorthand operations - #6
  • Slice integration with DataFrame and Group - #8
  • Optimizations and Aggregate operations - #11
  • Filter Operation for DataFrame - #12

Documentation for now is restricted to README.md but after modification of doc strings to fit the correct pattern, a documentation website can be created automatically.

Thank You

I would first like to thank D Programming Language Foundation and the numerous people behind it for peoposing this wonderful project.

I would like to thank my mentors Mr. Nicholas Wilson and Mr. Ilya Yaroshnko for their support and help. Without them this project would never be possible. They have helped me all the way from discovering the great parts of D Programming Language to help be debug long errors, they have helped me in every step on this journey and I'm grateful to them.

I would like to extend a special thanks to Mr. John Hall for sharing his previous works on a similar DataFrame projet.

Last but not the least I would like to thank the community around D for the active interest and their valuable feedback towards this project.

Thank you all for making this a great journey.

@knakul853
Copy link

knakul853 commented Nov 13, 2019

Nice work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment