heronshoes/RubyGrant2022-RedAmber-final-report_en.md Secret

## RubyGrant2022-RedAmber-final-report_en.md

      
    Raw
  

              RubyGrant2022-RedAmber-final-report_en.md
            
          
    Ruby Association Grant 2022 Final Report [RedAmber - A simple dataframe library for Ruby]

Hirokazu SUZUKI (@heronshoes) March 20th, 2023
I am developing a data frame library named RedAmber, based on Red Arrow which is a Ruby implementation of Apache Arrow (on memory columnar data processing framework).
This is the final report of RedAmber development for Ruby Association Grant 2022, through October 2022 to March 2023.
In this period I released 4 versions. The changelogs and release notes are below.

v0.2.3 (2022-11-16)
v0.3.0 (2022-12-18)
v0.4.0 (2023-02-25)
v0.4.1 (2023-03-11)

I will report the activities in these sections.

New features
Feedbacks for Red Arrow project
Improvement in performance
Imporvement in code quality
Improvement in documentation
Promotional activities

1. New features

1.1 Implementation of methods to join dataframes

I implemented the operations of joining a dataframe to other dataframe in
DataFrameCombinable module.

Introduced in v0.2.3 .
Added to specify keys by a Hash in v0.3.0 .
Added sorting after join by the option :force_order.

Mutating join and Filtering join


I built an R-like style join functions using Red Arrow's Table#join, by implementing methods such as #left_join with the :type option preset.
Since Red Arrow keeps the left and right columns, I used :left_outputs and :right_outputs options to left only the columns needed and merged them together to leave a single column if necessary.
Red Arrow allows duplicate column names, but duplicate column names (keys) are generally not allowed in data frames or RDB tables. RedAmber also implements a feature to rename duplicated keys with suffixes. The default suffix is suffix: '.1', and only the column name of the other data frame is renamed, and if it still duplicates, succ is used. This is because renaming both self and other is excessive, and also because it is a method call to self in Ruby .
If join_key is omitted, it will automatically join using common column names (Natural join).  This has been proposed and merged(GH-15088) into Red Arrow's Table.
Set operation and Binding

Set operations and binding operations (vertical and horizontal) are also constructed by Arrow''s Table#join`.

Although generally similar to the R vocabulary, the difference operation uses #difference as the primary method name instead of #setdiff, which is the same as Ruby's Array. The #concatenate is assigned to the column-lengthening concatenation, and the #merge to the row-lengthening concatenation.
Sorting after join

In Arrow's implementation, the order of records (rows) after merging is undefined. This is more obvious when the number of records becomes large (the result is different for each execution).
This is advantageous in execution speed, but conventional data frame libraries and database systems seem to give priority to the left table to maintain the order of rows, so the indefinite order may not be intuitive. Also, when writing tests, it is inconvenient if the order is indefinite.
Polars, a Rust's data frame library based on another implementation of Arrow (Arrow2) that can handle Arrow format, has an option to keep the original row order. RedAmber also adopted this feature in v0.4.0. The left and right dataframes are pre-loaded with index columns, and sorted after merging to remove the index columns, which is not very fast, but I set the sorting to the default behavior in RedAmber because it provides clear results.
In 0.4.1, sorting is enabled for all join-based methods in mutating joins and filtering joins.
1.2 Separating/Merging columns

A function Vector#split_* is implemented to split an element of a vector by a space character or an arbitrary character, to divide it into multiple vectors, or to generate vectors aligned in length direction (implemented in v0.3.0). Also, a method Vector#merge is implemented to generate a vector whose elements are strings  concatenated with vector elements or scalars (implemented in v0.3.0).
split_to_columns(sep = nil, limit = 0)


Divide the elements of a vector into multiple vectors by whitespaces or characters.

vector = RedAmber::Vector.new(['a b', 'c d', 'e f'])
vector
#=>
#<RedAmber::Vector(:string, size=3):0x0000000000050014>
["a b", "c d", "e f"]

vector.split_to_columns
#=>
[#<RedAmber::Vector(:string, size=3):0x0000000000058cc8>
["a", "c", "e"]
,
 #<RedAmber::Vector(:string, size=3):0x0000000000058cdc>
["b", "d", "f"]
]
This method can be used to split a column of data frames by a specific character.
RedAmber::DataFrame.new(year_month: %w[2023-01 2023-02 2023-03])
  .assign(:year, :month) { year_month.split_to_columns('-') }
#=>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000078ed8>
  year_month year     month
  <string>   <string> <string>
0 2023-01    2023     01
1 2023-02    2023     02
2 2023-03    2023     03
If sep is omitted, this method uses Arrow's ascii_split_whitespace() to split elements of Arrow::StringArray with whitespace characters at high speed. On the other hand, if sep' is specified, Ruby's String#sep` is used to perform flexible division by specifying a regular expression.
RedAmber::DataFrame.new(yearmonth: %w[202301 202302 202303])
  .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }
#=>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000078eec>
  yearmonth year     month
  <string>  <string> <string>
0 202301    2023     01
1 202302    2023     02
2 202303    2023     03
split_to_rows(sep = nil, limit = 0)


The elements of a vector are divided by whitespace or arbitrary characters to produce a vector aligned in the length direction.

vector
#=>
#<RedAmber::Vector(:string, size=3):0x0000000000050014>
["a b", "c d", "e f"]

vector.split_to_rows
#=>
#<RedAmber::Vector(:string, size=6):0x00000000000809d0>
["a", "b", "c", "d", "e", "f"]
merge(other, sep: ' ')


Generate a vector whose elements are strings or vectors concatenated element by element into self.

vector = RedAmber::Vector.new(%w[a c e])
other = RedAmber::Vector.new(%w[b d f])
vector.merge(other)
#=>
#<RedAmber::Vector(:string, size=3):0x00000000000a530c>
["a b", "c d", "e f"]

vector.merge('x', sep: '')
#=>
#<RedAmber::Vector(:string, size=3):0x00000000000b1008>
["ax", "cx", "ex"]
1.3 SubFrames - Ordered sub sets of a DataFrame

As a new concept not found in conventional data frames, the SubFrames class was designed and implemented as an experimental feature.

SubFrames is an object that hold subsets of a DataFrame and have an iterator that returns the elements as DataFrames.
SubFrames unifies the functions of existing dataframe/database's groups (e.g., group_by or groupby), rolling window functions (e.g., moving average), and window functions for element-by-element processing. SubFrames are characterized by a notation that can be applied as is to ordinary DataFrames through the use of Ruby-like iterators, as well as the unified handling of window functions that process each element.
The figure above shows an example of SubFrames for the grouping of dataframes. The left half is essentially the same as the existing group_by operation, which groups data frames according to the value of column y from the base data frame. SubFrames object returns each separated data frames. The right half is an aggregation of a single data frame into a single row. Operations on a single data frame element can be written as an aggregate operation for the entire data frame in SubFrames.
Grouping by class Group

The conventional grouping operation is based on the idea of aggregation from the state of the original data frame through the "grouped state".
The same is true for RedAmber's Group class which has an origin in Red Arrow table Group.

The figure above shows an example of aggregation by RedAmber's Group class, which maintains a dataframe-like state separated by column values, but with Group-specific behavior.
The following is a comparison of the conventional Group class and SubFrames.

Group

It is available for the information on the key column that was the source of the group partitioning.
Grouping and aggregation are integrated and there is an engine to perform them efficiently (Acero)


SubFrames

A more generalized concept, the method of separation is not limited to grouping by value and has a wider range of applications.
While it is possible to generate SubFrames from SubFrames, it is not possible to preserve information on the source key columns of grouping.
Although it is necessary to explicitly write the source key in grouping when outputting, but it is simple to write the desired key name and column Vector to get the result. It is common notation with DataFrame's Rename and Assign.


(As a reference) Examples of distinctive groupings in other languages

R's tidyr introduces the concept of "nest" as shown below.

「[Data tidying with tidyr : : CHEAT SHEET](https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf) から引用(CC BY SA Posit Software, PBC)」
Nested dataframe is the concept of incorporating the original dataframe as nested rows in the form of an aggregated dataframe.
Simpler aggregate functions

In Arrow's Compute Function, aggregate functions for normal tables (e.g., sum) are different from those for groups (e.g., hash_sum).
In RedAmber, however, aggregate methods for Vectors in DataFrames can be used in SubFrames without modification.
Also, if you write a new aggregation function for a Vector, it can be applied directly to SubFrames, making it easy to create your own aggregation process.
This makes it possible to easily create your own aggregation processing.
Rolling Window

The same concept can be applied to a rolling window where operations are performed by sequentially acquiring rows of the original data frame.

In the example above, a window of size 3 is applied to the original DataFrame to obtain SubFrames consisting of 4 DataFrames.
Next #aggregate is used to aggregate one DataFrame into a single row.
The x.mean is therefore a moving average operation. The process of obtaining a moving average consists with the operation of separating the data in the window and the calculation of the average with a Vector's function #mean.
Therefore, the function to calculate the mean does not need to be special. Unlike Grouping described above, the contents of SubFrames have overlaps from each other.
Element-wise operations (Window)

An example of an operation to create new SubFrames from SubFrames is shown below.

In this example, the sequential numbers starting from 1 for each group and the cumulative sum for each group are generated for the SubFrames generated by the Grouping by Value method already introduced.
Since #indeces and #cumsum are not an aggregate function method of Vector, but return values for each element, the result obtained is SubFrames.
The bottom left of the figure above is an example of the same operation on the entire data frame by DataFrame#assign.
Compared to SubFrames#assign on the right, the operation is performed in exactly the same way, not only in the method but also in the contents of the block. This consistency is the greatest feature of SubFrames.
Other examples to get SubFrames

SubFrames can also be generated from Ruby's Enumerator (left). They can also be generated from a masked window (called a kernel) (right).
The kernel can be used, for example, to obtain pairs of data from daily data for "the same weekday of the previous week and today".

Generation from arrays of indices

SubFrames can also be generated based on arrays of indices starting from 0.

The example above shows how to separate even-numbered rows from odd-numbered rows.
And the example below shows how to randomly extract two adjacent rows.
In this way, SubFrames can be applied to a large number of use cases.
(As a reference:) The example above can also be written as follows using Array#partition.
df.build_subframes do
  (0...size).partition(&:odd?)
end
Changing &:odd? to &:even? will reverse the order in SubFrames.
Current Implementation

The current implementation of SubFrames is written in Pure Ruby, it actually generates sub data frames internally even it is partially using lazy generation.
Therefore, it is more than 70 times slower than the Group class, which calls Acero in C++ for the same operation.
In the future, I would like to work with the upstream side to speed up SubFrames. In this case, I would like to make it look to the user as if there is an internal data frame, while in reality it is represented by a Record Batch, or by using a set of index arrays.
1.4 Add Vector's methods

1.4.1 Vector#modulo, #remainder

Although modulo or remainder are relatively common functions, Arrow C++'s Compute function divmod() has been in a draft state for a long time and has not been implemented in RedAmber (it must use #map, Numeric#modulo and Numeirc#remainder in Pure Ruby).
So I combined existing #divide, #floor, #trunc, #multiply, and #subtract to create #modulo and #remainder.
integer = RedAmber::Vector.new(1, 2, 3)
divisor = RedAmber::Vector.new(2, 2, 2)
integer.modulo(divisor)

=>
#<RedAmber::Vector(:uint8, size=3):0x00000000002c46d8>
[1, 0, 1]
There are differences among language systems in the handling of negative numbers, RedAmber's behavior is the same as that of Ruby i.e., #modulo and #% have the same sign as the divisor, and #remainder has the same sign as the divisor. This is not similar to Numo/NArray's #% (has the same sign as the divisor).
1.4.2 Vector#propagate

Introduced method #propagate to use aggregate functions as element-wise functions.
This is useful in combination with SubFrames.
import_cars

=>
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x0000000000390af8>
     Year    Audi     BMW BMW_MINI Mercedes-Benz      VW
  <int64> <int64> <int64>  <int64>       <int64> <int64>
0    2017   28336   52527    25427         68221   49040
1    2018   26473   50982    25984         67554   51961
2    2019   24222   46814    23813         66553   46794
3    2020   22304   35712    20196         57041   36576
4    2021   22535   35905    18211         51722   35215

import_cars
  .to_long(:Year, name: :Manufacturer, value: :n_of_imported)
  .sub_by_value(keys: :Year)
  .assign do
    { sum_by_year: n_of_imported.propagate(:sum) }
  end

=>
#<RedAmber::SubFrames : 0x000000000067ddd8>
@baseframe=#<Enumerator::Lazy:size=5>
5 SubFrames: [5, 5, 5, 5, 5] in sizes.
---
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000067ddec>
      Year Manufacturer  n_of_imported sum_by_Year
  <uint16> <string>           <uint32>    <uint32>
0     2017 Audi                  28336      223551
1     2017 BMW                   52527      223551
2     2017 BMW_MINI              25427      223551
3     2017 Mercedes-Benz         68221      223551
4     2017 VW                    49040      223551
---
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000067de00>
      Year Manufacturer  n_of_imported sum_by_Year
  <uint16> <string>           <uint32>    <uint32>
0     2018 Audi                  26473      222954
1     2018 BMW                   50982      222954
2     2018 BMW_MINI              25984      222954
3     2018 Mercedes-Benz         67554      222954
4     2018 VW                    51961      222954
---
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000067de14>
      Year Manufacturer  n_of_imported sum_by_Year
  <uint16> <string>           <uint32>    <uint32>
0     2019 Audi                  24222      208196
1     2019 BMW                   46814      208196
2     2019 BMW_MINI              23813      208196
3     2019 Mercedes-Benz         66553      208196
4     2019 VW                    46794      208196
---
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000067de28>
      Year Manufacturer  n_of_imported sum_by_Year
  <uint16> <string>           <uint32>    <uint32>
0     2020 Audi                  22304      171829
1     2020 BMW                   35712      171829
2     2020 BMW_MINI              20196      171829
3     2020 Mercedes-Benz         57041      171829
4     2020 VW                    36576      171829
---
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000067de3c>
      Year Manufacturer  n_of_imported sum_by_Year
  <uint16> <string>           <uint32>    <uint32>
0     2021 Audi                  22535      163588
1     2021 BMW                   35905      163588
2     2021 BMW_MINI              18211      163588
3     2021 Mercedes-Benz         51722      163588
4     2021 VW                    35215      163588
1.4.3 Vector#sample

Vector#sample is a method that randomly samples the elements of a Vector and returns a new Vector.
#sample takes an Integer or Float argument as the sampling size.

Integer (n) specifies the number of sampling.
Float (prop) specifies the number of sampling as a ratio to the size of Vector.
If n <= self.size or prop <= 1.0, sampling is done without repetition.

v = RedAmber::Vector.new('A'..'H')
v

=>
#<RedAmber::Vector(:string, size=8):0x0000000000011b20>
["A", "B", "C", "D", "E", "F", "G", "H"]
When n == size, n elements are randomly arranged without repetition.
v.sample(8)

=>
#<RedAmber::Vector(:string, size=8):0x000000000001bda0>
["H", "D", "B", "F", "E", "A", "G", "C"]
If n > size or prop > 1.0, sampling with repetition.
v.sample(2.0)
# =>
#<RedAmber::Vector(:string, size=16):0x00000000000233e8>
["H", "B", "C", "B", "C", "A", "F", "A", "E", "C", "H", "F", "F", "A", ... ]

A scalar is returned without arguments.

v.sample # => "C"
2. Feedbacks for Red Arrow project

Bugs encountered while developing RedAmber and suggestions for functional improvements are fed back to Red Arrow as needed. I would like to actively transfer basic functionality to Red Arrow.

Bug Reports

Bug reports and fixes in CI's homebrew (GH-15093):Merged


Suggestions for improvements to features used in RedAmber

Table#column_names (GH-15089):Merged
Ability to omit join_key in Table#join (GH-15088):Merged
Ability to merge and rename columns in Table#join(GH-15287):@kou's implementation using Acero has been merged.


Feature/Improvement Suggestions

Reduce waiting time in REPL environment by returning self when saving csv with Table#save(GH-15289):Merged
[GLib] Propose support for 'MatchSubstringOptions' (GH-15285):Suggested improvements
GLib] Proposed support for 'IndexOptions' (GH-15286:Suggested improvements
[GLib] Proposed support for 'RankOptions' (GH-34425:Implemented by @kou and merged into 12.0.0.
[C++] 'rank()' does not support ChunkedArray (GH-34426):Issue reported


Review.

RecordBatch{File,Stream}Reader#each supports the case where there is no block (GH-34440)
Suggest additional feature that Enumerator#size can return a non-nil value if the number of iterations is known: merged.


I am especially grateful to Sutou Kouhei(@kou) for his generous support in this activity.
3. Improvement in performance

As the first step, a benchmark was created in v0.2.3 to allow performance comparisons between versions of the main methods.
Benchmarks were created using benchmark_driver, and data was mainly from the nycflights13 dataset from RDataset, which has a relatively large data size.
In the second step, the code was completely revised to improve the processing speed by refactoring, such as replacing faster processes, changing the order of processes, and removing unnecessary processes. The following is a version-by-version comparison: v0.3.0 is the version after refactoring, v0.2.3 is the previous version with almost the same functionality and v0.2.0 is the reference version before the development grant period.
Measurements were taken in the following environment.

distro: Ubuntu 20.04.5 LTS on Windows 11 x86_64
kernel: 5.15.79.1-microsoft-standard-WSL2
cpu: Intel i7-8700K (12) @ 3.695GHz
memory: 30085MiB
Ruby: ruby 3.2.0 (2022-12-25 revision a528908271) +YJIT [x86_64-linux]
Arrow: 10.0.0

3.1 Basic benchmark: Tests for basic data frame operations

Iteration per second (i/s): (The bigger the faster)


#
Benchmark name
0.3.0
0.2.3
0.2.0
0.1.5


B01
Pick([]) by a key name
434,783
8,759
9,357
202,703


B02a
Pick([]) by key names
2,530
897
1,898
2,276


B03
Pick by key names
2,783
653
4,374
2,311


B04
Drop by key names
694
352
761
675


B05
Pick by booleans
792
383
1,094
1,005


B06
Pick by a block
920
386
1,346
1,091


B07
Slice([]) by an index
597
445
798
1,934


B08
Slice by indeces
51.4
47.1
51.7
56.2


B09
Slice([]) by booleans
54.7
2.3
2.3
0.3


B10
Slice by booleans
103.3
2.3
2.2
3.0


B11
Remove by booleans
78.6
2.2
2.4
2.7


B12
Slice by a block
100.9
2.4
2.3
3.0


B13
Rename by Hash
804
508
853
737


B14
Assign an existing variable
3.2
3.2
3.3
3.4


B15
Assign a new variable
3.3
3.4
3.3
3.5


B16
Sort by a key
18.5
19.3
20.0
18.4


B17
Sort by keys
11.8
11.6
12.0
12.1


B18
Convert to a Hash
2.8
2.3
2.4
2.3


B19
Output in TDR style
1.3
1.3
1.3
1.3


B20
Inspect
17.0
14.7
16.6
1.7


There are several cases where the latest version is slower, and I believe this is due to the following:

The early versions were not balanced in terms of processing speed.
(e.g., indexing is fast, but filtering is extremely slow)
Branching in operation took more time as more features were added.
Example: Pick can accept index, column name, or boolean filter now.
(Indexes and column names can be mixed, e.g. penguins.pick(0..2, -5, :year))

3.2 Combine benchmark: Tests for joining operations

Iteration per second (i/s): (The bigger the faster)


#
Benchmark name
0.3.0
0.2.3


C01
Inner join on flights_Q1 by carrier
106.3
0.9


C02
Full join on flights_Q1 by planes
0.9
0.6


C03
Left join on flights_Q1 by planes
70.6
0.6


C04
Semi join on flights_Q1 by planes
103.9
100.5


C05
Anti join on flights_Q1 by planes
244.2
230.4


C06
Intersection of flights_1_2 and flights_1_3
46.8
0.2


C07
Union of flights_1_2 and flights_1_3
0.07
0.07


C08
Difference between flights_1_2 and flights_1_3
51.5
53.1


C09
Concatenate flight_Q1 on flight_Q2
7,393
2,903


C10
Merge flights_Q1_right on flights_Q1_left
0.6
0.6


3.3 Group benchmark: Tests for grouping operations

Iteration per second (i/s): (The bigger the faster)


#
Benchmark name
0.3.0
0.2.3
0.2.2


G01
sum distance by destination
119.9
122.5
120.3


G02
sum arr_delay by month and day
168.4
155.8
140.8


G03
sum arr_delay, mean distance by flight
29.6
25.6
27.8


G04
mean air_time, distance by flight
110.5
102.0
102.9


G05
sum dep_delay, arr_delay by carrer
123.6
121.3
111.0


3.4 Reshape benchmark: Tests for the reshaping operations

Iteration per second (i/s): (The bigger the faster)


#
Benchmark name
0.3.0
0.2.3
0.2.2


R01
Transpose a DataFrame
3.8
3.4
3.7


R02
Reshape to longer DataFrame
1.5
1.6
1.6


R03
Reshape to wider DataFrame
0.7
0.6
0.7


3.5 Vector benchmark: Tests for Vector functions

Iteration per second (i/s): (The bigger the faster)


#
Benchmark name
0.3.0
0.2.3
0.2.0


V01
Vector.new from integer Array
7.2
6.0
6.4


V02
Vector.new from string Array
1.6
1.7
1.7


V03
Vector.new from boolean Vector
1,220,000
6.6
6.7


V04
Vector#sum
11,256
11,624
10,823


V05
Vector#*
1,397
1,527
1,466


V06
Vector#[booleans]
4.8
6.8
6.8


V07
Vector#[boolean_vector]
22.2
6.6
6.7


V08
Vector#[index_vector]
22.0
28.0
27.6


V09
Vector#replace
0.4
0.4
0.4


V10
Vector#replace with broad casting
0.4
0.4
0.4


3.6 DataFrame benchmark: Tests for series of data frame operation

Iteration per second (i/s): (The bigger the faster)


#
Benchmark name
0.3.0
0.2.3
0.2.0


D01
Diamonds test
189.8
14.5
14.5


D02
Starwars test
143.6
78.8
107.0


D03
Import cars test
141.4
141.9
125.6


D04
Simpsons paradox test
45.4
3.1
3.1


Diamonds test : It is used in RedAmber's README
Starwars test : It is used in RedAmber's README
Import cars test : It is used in RedAmber's document (With some arrangement in DataFrame.md)
Simpsons paradox test : Introduced by the post in Qiita 「RedAmber - Ruby's data frame library (in japanese)

The number of iterations (per second) of these four comprehensive tests is converted to execution time to obtain the total execution time, and RedAmber performs the operation to determine the rate of change in execution speed.
require 'red_amber'

df = RedAmber::DataFrame.load(Arrow::Buffer.new(<<CSV), format: :csv)
  test_name,0.3.0,0.2.3,0.2.0
  D01: Diamonds test,189.817,14.531,14.540
  D02: Starwars test,143.570,78.772,107.044
  D03: Inport cars test,141.395,141.861,125.560
  D04: Simpsons paradox test,45.353,3.105,3.133
CSV

df
#=>
#<RedAmber::DataFrame : 4 x 4 Vectors, 0x000000000007e8d8>
  test_name                     0.3.0    0.2.3    0.2.0
  <string>                   <double> <double> <double>
0 D01: Diamonds test           189.82    14.53    14.54
1 D02: Starwars test           143.57    78.77   107.04
2 D03: Inport cars test         141.4   141.86   125.56
3 D04: Simpsons paradox test    45.35     3.11     3.13

versions = df.keys[1..]
#=> [:"0.3.0", :"0.2.3", :"0.2.0"]

versions.map { |ver|  (1 / df[ver]).sum } => a
#=> [0.04135511938110967, 0.41062359984495833, 0.4052649554075024]

a[2] / a[0]
#=>
9.799632100508957
Based on the above, I achieved a speedup of 980% over v0.2.0, far exceeding my initial target of 20% performance improvement over v0.2.0 for a basic set of data frame operations for the benchmark.
On a relatively slow machine, such as:

OS: macOS 11.7.2 20G1020 x86_64
Machine: MacBookPro11,1 (Retina, 13-inch, Late 2013)
CPU: Intel i5-4258U (4) @ 2.40GHz
Memory: 5554MiB / 8192MiB

the improvement was even higher, at a ratio of 1175% over v0.2.0 .
3.7 [Comparison with other languages/libraries] RedAmber/pandas(Python)/tidyverse(R)

Processes equivalent to the DataFrame benchmark in above were written in other languages and the execution times were compared.
It was performed with Python's pandas and R's tiryverse (dplyr or tidyr) as commonly used libraries.
The dataframe operations were assumed to yield the same results as in the DataFrame benchmark above, but they differ from the tests in Section 3.6 because they include deserialization, which reads locally available csv or tsv files.
If there is no corresponding process for a given processor, an equivalent process is used instead. (In R, there is no transpose, so the operation from long to wide on a different axis is performed instead, etc.)
Test execution time (msec): (including deserialization. ** smaller is faster** )


test_name
red_amber
pandas
tidyverse
(Ref) Dataset size


Diamonds_test
28.2
80.9
243.0
53940 x 10


Starwars_test
19.1
19.5
43.3
87 x 12


Import_cars_test
17.4
18.3
48.4
5 x 6


Simpsons_paradox_test
58.0
201.0
653.5
268166 x 4


At the very least, it has an advantage over traditional data frames.
The use of Apache Arrow makes data deserialization particularly fast, and the larger the data set, the more likely the difference is to be noticeable. Future comparisons using the Arrow extension are needed for both R and Python.
Future of Benchmarking

As a large and general dataset that can also be used for scalability evaluation, I am trying with the "Wisconsin Benchmark" machine synthesis dataset, which has been used for database evaluation in the past, but RedAmber is still insufficient for handling large datasets.
The demand for well-designed benchmarks for data frames is expected to increase in the future for the following reasons.

It is more important to compare the entire dataframe processing workflow, not just individual processing nodes.
Unified evaluation of common dataframe and database processing is needed.
Demand for comparison of scalability of processing systems due to increasing data volume.
Benchmarks that automatically generate workloads are useful for testing and identifying bottlenecks.

（Reference：FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems）
4. Imporvement in code quality

To measure Test coverage, simplecov was introduced in v0.2.3. The coverage was 98.54% with 43 uncovered lines at the time of introduction.
I worked on improving the coverage as well as refactoring the code, and achieved 100% coverage in v0.3.0. I will maintain the coverage in the future.
Rubocop was introduced to ensure code quality, and rubocop-performance and rubocop-rubycw are also enabled to ensure consistency.
Although many of the metrics exceed the default values, I do not turn them off globaly but I ignore them for each method and file so that I can see which part is the target of refactoring.
5. Improvement in documentation

YARD document coverage was 73.1% when I started measuring at the beginning of the project.
In the latter half of the project period, special effort was dedicated to document maintenance, achieving 100% YARD document coverage. The latest release continues to achieve 100% coverage even most of the methods has @examples.
DataFrame.md and Vector.md have been completed as markdown documentation for all methods, and the latest methods continue to be added.
5.1 How to add documentation for dynamically generated methods

Functional methods of Vector, such as #mean, #abs, #>, are generated dynamically by define_method using Arrow's Compute function.
At first I did not know how to efficiently add documentation in such cases, but finally I arrived at the following method.
Original code

Create method generator as class methods, and individual method definitions are written in DSL style.
class Vector
  class << self
    private

    def define_unary_aggregation(function)
      define_method(function) do |**options|
        datum = exec_func_unary(function, options)
        get_scalar(datum)
      end
    end
  end

  define_unary_aggregation :approximate_median
end
Attach documentation


Common documents are attached to class methods with @!macro[attach]. (1).


Macros that are not common to all but used as needed should be defined above instance methods (2).


Method-specific documentation should be written right above the method definition (3).


Write arguments and options with @!method (4).


Alias names of methods are written with alias_method (5). Methods can also be written as class methods, which will be correctly displayed in the documentation as Also known as:.
class Vector
  class << self
    private

    # @!macro [attach] define_unary_aggregation         # (1)
    #   [Unary aggregation function] Returns a scalar.
    #
    def define_unary_aggregation(function)
      define_method(function) do |**options|
        datum = exec_func_unary(function, options)
        get_scalar(datum)
      end
    end
  end

  # @!macro scalar_aggregate_options                    # (2)
  #   @param skip_nulls [true, false]
  #     If true, nil values are ignored.
  #     Otherwise, if any value is nil, emit nil.
  #   @param min_count [Integer]
  #     if less than this many non-nil values are observed, emit nil.
  #     If skip_nulls is false, this option is not respected.

  # Approximate median of a numeric Vector with T-Digest algorithm.   # (3)
  #
  # @!method approximate_median(skip_nulls: true, min_count: 1)       # (4)
  # @macro scalar_aggregate_options                    # (2)
  # @return [Float]
  #   median of self.
  #   A nil is returned if there is no valid data point.
  #
  define_unary_aggregation :approximate_median
  alias_method :median, :approximate_median            # (5)
end


Created document is in RedAmber YARD Vector#approximate_median.
The corresponding YARD document is YARD document.
5.2 How to specify monospace font for code sections in YARD documentation

The RedAmber documentation uses @example to display many code examples, but by default they are displayed in a proportional font.
I arrived at the following method to work around this.
Define custom template

Add custom template path in .yardopts
--template-path doc/yard-templates

Put customized css below in doc/yard-templates/default/fulldoc/html/css/common.css.
/* Use monospace font for code */
code {
  font-family: "Courier New", Consolas, monospace;
}

Result:

When customizing this template, it must be placed in the same directory structure as the YARD standard template YARD document.
5.3 Examples of Jupyter Notebook Operations

I have been gradually increasing the number of examples of RedAmber operations in Jupyter Notebook. 100 or more was our goal at the start of the RA Grant, and we have achieved 106 as of v0.4.0.
https://github.com/heronshoes/docker-stacks/blob/RedAmber-binder/binder/examples_of_red_amber.ipynb
5.4 Functional Comparison with Other Libraries

We have created a table comparing the functionality of RedAmber with dataframe libraries in other languages. The column about Julia data frames was contributed by Benson Muite.
https://github.com/heronshoes/red_amber/blob/main/doc/DataFrame_Comparison.md
6. Promotional activities


Article posted to the web

RedAmber - Ruby's new data frame library (November 28, 2022, note)
RedAmber - Ruby's new data frame library, (December 04, 2022, Qiita)
What we are trying to do with RedAmber](https://qiita.com/heronshoes/items/b0ae1f4c23decd8c261b), (December 18, 2022, Qiita)


Youtube streaming

I have been participating in the monthly streaming of Red-data-tools as a questioner since November.


Summary


We were able to develop a data frame library that can directly handle Arrow data in Ruby which has been behind in the development.
We achieved "data frame processing that can be written like Ruby" through block  and use of Ruby's collection classes.
Almost all major functions of data frames were covered.
Documentation with rich examples of operations was prepared.
The difference between the code of October 4 (v0.2.2) and March 11 (v0.4.1) was that 77 files were changed, 14702 lines were added, and 2637 lines were deleted. (documentation contributed significantly).
The code was revised to improve speed since the start of grant..
Test coverage was increased to 100% to ensure a certain level of code quality.
Introduced a new concept "SubFrames" not found in other data frame libraries and implemented it as an experimental feature. I showed that groups and windows can be handled in a simple and consistent manner by Ruby. Improving speed is an issue for the future.
I made suggestions to enhance necessary functions for upstream Red Arrow and contributed to the improvement of table join operations.
Andrew Kane, the author of Rover (the library that inspired the development of Red Amber), created Polars-ruby. Polars-ruby is a library based on Rust's Polars, which can read and write the Arrow Format. It is characterized by its high speed and has a very promising future. I hope to introduce its advanced features such as LazyFrame in RedAmber.

Acknowledgements

I would like to thank my mentor of RA grant, Kenta Murata (@mrkn), for his constant advice and warm support throughout this project.
I would also like to thank Sutou Kouhei (@kou) his wide-ranging advice on Red Arrow commits and RedAmber bugs as well as his support in monthly streaming of Red Data Tools.
Benson Muite (@bkmgit) added the Fedra testing workflow and the Julia section of the comparison table with other data frames.
@kojix2 contributed to the code by adding the YARD documentation generation workflow and modifying the documentation.
I appreciated to their contributions.
I would also like to thank the members of Red Data Tools Gitter for their valuable comments and suggestions on method name discussions and other issues.
Finally, I would like to express my deepest gratitude to Ruby Assciation for giving me this opportunity and to Matz and everyone in the Ruby community for creating and growing Ruby.
Appendix1 Data processing in benchmark

To create the bar chart for benchmark comparison, the data was read as a RedAmber's DataFrame using the following procedure, converted to longer data, and plotted in Charty to create the chart.


Appendix2 Comparison with other language libraries

Below is a summary of the Jupyter Notebook used for benchmark comparisons with other language libraries.
https://gist.github.com/heronshoes/eb3c4dbb8ec5825eefe67a09abe5dd53#file-_appendix-2-rubygrant2022-redamber-final-report-md
#	Benchmark name	0.3.0	0.2.3	0.2.0	0.1.5
B01	Pick([]) by a key name	434,783	8,759	9,357	202,703
B02a	Pick([]) by key names	2,530	897	1,898	2,276
B03	Pick by key names	2,783	653	4,374	2,311
B04	Drop by key names	694	352	761	675
B05	Pick by booleans	792	383	1,094	1,005
B06	Pick by a block	920	386	1,346	1,091
B07	Slice([]) by an index	597	445	798	1,934
B08	Slice by indeces	51.4	47.1	51.7	56.2
B09	Slice([]) by booleans	54.7	2.3	2.3	0.3
B10	Slice by booleans	103.3	2.3	2.2	3.0
B11	Remove by booleans	78.6	2.2	2.4	2.7
B12	Slice by a block	100.9	2.4	2.3	3.0
B13	Rename by Hash	804	508	853	737
B14	Assign an existing variable	3.2	3.2	3.3	3.4
B15	Assign a new variable	3.3	3.4	3.3	3.5
B16	Sort by a key	18.5	19.3	20.0	18.4
B17	Sort by keys	11.8	11.6	12.0	12.1
B18	Convert to a Hash	2.8	2.3	2.4	2.3
B19	Output in TDR style	1.3	1.3	1.3	1.3
B20	Inspect	17.0	14.7	16.6	1.7
#	Benchmark name	0.3.0	0.2.3
C01	Inner join on flights_Q1 by carrier	106.3	0.9
C02	Full join on flights_Q1 by planes	0.9	0.6
C03	Left join on flights_Q1 by planes	70.6	0.6
C04	Semi join on flights_Q1 by planes	103.9	100.5
C05	Anti join on flights_Q1 by planes	244.2	230.4
C06	Intersection of flights_1_2 and flights_1_3	46.8	0.2
C07	Union of flights_1_2 and flights_1_3	0.07	0.07
C08	Difference between flights_1_2 and flights_1_3	51.5	53.1
C09	Concatenate flight_Q1 on flight_Q2	7,393	2,903
C10	Merge flights_Q1_right on flights_Q1_left	0.6	0.6
#	Benchmark name	0.3.0	0.2.3	0.2.2
G01	sum distance by destination	119.9	122.5	120.3
G02	sum arr_delay by month and day	168.4	155.8	140.8
G03	sum arr_delay, mean distance by flight	29.6	25.6	27.8
G04	mean air_time, distance by flight	110.5	102.0	102.9
G05	sum dep_delay, arr_delay by carrer	123.6	121.3	111.0
#	Benchmark name	0.3.0	0.2.3	0.2.2
R01	Transpose a DataFrame	3.8	3.4	3.7
R02	Reshape to longer DataFrame	1.5	1.6	1.6
R03	Reshape to wider DataFrame	0.7	0.6	0.7
#	Benchmark name	0.3.0	0.2.3	0.2.0
V01	Vector.new from integer Array	7.2	6.0	6.4
V02	Vector.new from string Array	1.6	1.7	1.7
V03	Vector.new from boolean Vector	1,220,000	6.6	6.7
V04	Vector#sum	11,256	11,624	10,823
V05	Vector#*	1,397	1,527	1,466
V06	Vector#[booleans]	4.8	6.8	6.8
V07	Vector#[boolean_vector]	22.2	6.6	6.7
V08	Vector#[index_vector]	22.0	28.0	27.6
V09	Vector#replace	0.4	0.4	0.4
V10	Vector#replace with broad casting	0.4	0.4	0.4
#	Benchmark name	0.3.0	0.2.3	0.2.0
D01	Diamonds test	189.8	14.5	14.5
D02	Starwars test	143.6	78.8	107.0
D03	Import cars test	141.4	141.9	125.6
D04	Simpsons paradox test	45.4	3.1	3.1
test_name	red_amber	pandas	tidyverse	(Ref) Dataset size
Diamonds_test	28.2	80.9	243.0	53940 x 10
Starwars_test	19.1	19.5	43.3	87 x 12
Import_cars_test	17.4	18.3	48.4	5 x 6
Simpsons_paradox_test	58.0	201.0	653.5	268166 x 4