Skip to content

Instantly share code, notes, and snippets.

@Joshua-Dias-Barreto
Last active August 23, 2023 13:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Joshua-Dias-Barreto/47c6162d3670d3d7e5792a15c1b87740 to your computer and use it in GitHub Desktop.
Save Joshua-Dias-Barreto/47c6162d3670d3d7e5792a15c1b87740 to your computer and use it in GitHub Desktop.

Final Report

Contributor: Joshua Jose Dias Barreto

Mentors: Cyril Ferlicot, Larisa Safina and Oleksandr Zaitsev

Project: DataFrame Improvements

Organization: Pharo Consortium

Description: The aim of my project is to enhance the functionality and usability of Pharo’s DataFrame library, a powerful tool for data analysis and manipulation. During the first coding period, my focus was on implementing key features and addressing some of the existing limitations in the library. I spent the second coding period enhancing the Pharo AI DataFrame inspector, making it more interactive and user-friendly than ever before.

Important links

DataFrame Repository: https://github.com/PolyMathOrg/DataFrame

Pharo-AI DataFrame Inspector Repository: https://github.com/pharo-ai/data-inspector

Summary of my Pull Requests: https://github.com/orgs/PolyMathOrg/projects/2

Video Presentation of my work at the ESUG conference 2023 at Lyon, France: https://youtu.be/hnM0VYGBKl4

Community Bonding Period Blog: https://medium.com/latinxinai/gsoc-community-bonding-period-with-pharo-consortium-3ca1d59b716e

First Coding Period Blog: https://medium.com/latinxinai/gsoc-first-coding-period-with-pharo-consortium-7cd1ab768ea5

Second Period Blog: https://medium.com/@joshuadiasbarreto/gsoc-second-coding-period-with-pharo-consortium-dd5b4f201470

Work done

  • Made the AI DataFrame Inspector interactive 🕵️
    • Editing Capability ✏️ : Users can now modify the contents of individual cells within the data frame directly from the inspector. Whether you need to correct a typo or update a value, the inspector now allows you to do so effortlessly. Users can simply click the ‘edit’ button ,modify the contents and hit enter after modifying each cell. Users can go back to the ‘read’ mode after making the changes.
    • Search Functionality 🔍 : Searching for specific data within a large data frame can be like finding a needle in a haystack. To address this challenge, I integrated a search functionality into the inspector. Now, users can search for specific values, rows, or columns, making data exploration much more efficient. The DataFrame Inspector will dynamically display the rows which have the desired element that is being typed in the search bar.
    • User-Friendly Sorting 📂 : When the ‘sort’ button is clicked, a new window opens where users can choose the column name (from a drop down list) by which the DataFrame should be sorted and how it should be sorted (ascending, descending, other..).
    • Resizable columns ↔️ : The column widths can be resized by dragging the column borders. This allows users to view certain columns clearly and hide necessary columns.
  • Enhanced Data Representation 👓
    • To convert DataFrame objects to other formats, I implemented the #toMarkdown, #toLatex, #toHtml, and #toString methods. These methods allow users to convert DataFrame objects into various formats for easier sharing, visualization, and integration into different workflows.
  • Improved the Sorting API 📂
    • I implemented a set of methods for chain sorting in the DataFrame library such as #sortByAll: and #sortDescendingByAll:. Users can now define a sequence of columns and their corresponding sort orders, enabling them to create complex sorting rules tailored to their specific requirements.
    • Added methods in the DataFrame library to enable sorting based on row names such as #sortByRowNames, #sortByRowNamesUsing: and #sortDescendingByRowNames.
  • Bug Fixes 🐛
    • #addRow originally didn’t consider key ordering, now it does, so users can now add a dataseries to a dataframe in any order as long as the keys match the column names.
    • #removeNils and #withoutNils both did the same thing on a dataseries, i.e. returning a copy of the dataseries without nils. I changed the implementation so that #removeNils removes nils from the original dataseries and #withoutNils returns a copy without nils.
    • #columns originally returned a collection of arrays, it now returns a collection of series because a DataSeries has more information and methods to deal with things as compared to arrays, this was also done to make the API more consistent.
    • Statistical methods such as variance, quartiles, standard deviation, etc. would signal errors if nils were present, now these methods can handle nil values.
    • The DataFrame Inspector didn't display the row names of each row, now it displays the row names if the row names are set otherwise it simply displays the row indices.
  • Miscellaneous Methods Added 💻
    • #describe this method statistically describes a data frame listing out the mean, quartiles, variance, etc.
    • #encodeOneHot this method encodes data into one-hot vectors. It works on all kinds of data, integers, decimals, strings, Roman Numerals, etc.
    • #asDataFrame it converts a collection or collection of collections into dataframes.
    • #removeDuplicatedRows removes duplicate rows from a data frame except the first occurrence.
    • #numericalColumns returns only the columns of the data frame having numerical data.
    • numericalColumnNames returns only the names of the columns of the data frame with numerical data.
    • #CountNils returns the number of nil values in a dataseries.
    • #CountNonNils returns the number of non-nil values in a dataseries.
    • #replaceNilsWithNextRow replaces nils in a data frame with values of the next non-nil row.
  • Documentation 📃
    • Updated the outdated methods in the DataFrame booklet.
    • Added a section on ‘Handling Nil Values’ in DataFrames and DataSeries in the DataFrame booklet.
    • Added runnable examples and comments for over 150 methods in the DataFrame and DataSeries classes.
  • Tutorials 🎓

Acknowledgements

I would like to extend my heartfelt gratitude to my dedicated mentors throughout my Google Summer of Code journey with the Pharo Consortium. Your guidance and continuous support have been instrumental in shaping my growth as a developer. Your insightful feedback on my PR's, patient explanations, and willingness to share your knowledge have been invaluable resources that enriched my learning experience. It has been an honour to work under your mentorship, and I am truly thankful for the opportunity to learn from the best in the field. This GSoC experience would not have been as fulfilling without your mentorship, and I am excited to carry forward the lessons and skills I've gained under your guidance into my future endeavours. Thank you for being exceptional mentors and for making this journey truly memorable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment