Skip to content

Instantly share code, notes, and snippets.

@heronshoes
Last active September 23, 2022 01:09
Show Gist options
  • Save heronshoes/fbb87cd3836678a3da1e3936389511de to your computer and use it in GitHub Desktop.
Save heronshoes/fbb87cd3836678a3da1e3936389511de to your computer and use it in GitHub Desktop.
Answering "On DataFrame datatype in Ruby" by Victor Shepelev aka @zverok.

Answering @zverok 's amazing expectation for DataFrame

This is an answering comments for @zverok 's amazing expectation for DataFrame. I was very impressed to know how my DataFrame library matched his expectations.

Foreword: My RedAmber's DataFrame

Class RedAmber::DataFrame represents 2D-data.

A DataFrame consists with:

  • A series of data which have same data type within. We call it Vector.
  • A label is attached to the Vector. We call it key.
  • A Vector and associated key is grouped as a variable or a column.
  • variables with same vector length are aligned and arranged to be a DataFrame.
  • Each Vector in a DataFrame contains sets of relating data at same position. We call it observation or rows.

What is DataFrame?

columns are named

Yes. The column name is called key. Keys are Symbols.

rows are “indexed”: each row has corresponding label

No. Rows are indexed by implicit index of integer (0...size) only. If you need other indexes like timestamp, you should create a new column.

Row based access is available by #[](indices), #take(indices) or #[](booleans), #filter(booleans). <= This was old ideas.

The current implementation is #[](indices), #slice(indices), #[](booleans), #slice(booleans).

each column has data of only one type inside it (different columns can have different data types);

Yes. Types are Red Arrow's data type.

any column can have empty places (nil values) instead of data;

Yes. Columns can have nil (null in Apache Arrow) in every types. I think treatment of abnormal data is most important role of DataFrame library. Every datatype can have nil (thanks to Arrow). Skip nil, fill nil, replace nil methods are supported. Even the replacement of any value to nil is supported.

typical easy/cheap operations on DataFrame:

  • change data values (without changing row count);

Yes. DataFrame#assign( key_to_array_in_a_Hash ) will update values. It returns a new object because Arrow's data is immutable.

  • including: create a “view” on some part of DataFrame and modify data inside it in one clean line of code (like “replace all nil values in column Salary with 0.0”);

Yes/No View model sounds good (like Numo::NArray), but I have no idea it is useful in immutable data. Columns are sliced by indices (in any size) or booleans (in same size).

A Vector in a column has #replace method. dataframe.assign(:salary) { replace(vec.is_nil?, 0.0) } do the work.

  • add/remove/switch columns;

Yes. We can add new columns by #assign, remove columns by #drop, switch columns by #[].

  • calculate new column on base of existing ones;

Yes. We can use #assign with a block to create a new Vector with vector calculation.

  • select some rows/columns to another DataFrame;

Yes. I don't use the word 'select' because it will remind primitive select method for Rubyists but can't distinguish between column axis and row axis.

For column-wise select/reject we use pick/drop. For row-wise select/reject we use slice/remove.

typically, DataFrame provides methods, or supplied with libraries, for performing stats, summaries, and groupings on data;

Yes. We will provide basic stats (mean, max, min, std, var etc.) and summary of them. Also will have grouping capability with grouped stats.

often, DataFrame supports complex columns and complex indexes, like results of “pivotal table” operation (monthly income, grouped by department AND by manager inside department).

Yes/No. I have no idea to prepare complex columns/indices in my DataFrame. I think pivotted table is a kind of 'wider table' in R. I implemented wide <=> long conversion.

What does Ruby DataFrame need to have from the beginning?

Good Ruby DataFrame class should, of course, correspond to all expectations of modern data processing, coming from other languages and tools (look above for the list of expectations);

Some points are not but almost Yes !!

initialization of DataFrame should look as close to “literal”, as possible (for futher integration in libraries and practices);

Yes. Initialized by key: array style for example. Arguments for Arrow Table is acceptable too.

public interface should be clear, terse and unambiguous: for ex., access to rows should be clearly distinguished from access to columns;

Yes. Access to rows and to columns are distinguished. #pick/#drop are for columns, #slice/#remove are for rows. #[] is overloaded, but #[key], #[keys] are for column(s) by keys in String or Symbol, #[index], #[indices] are for row(s) by indices in numeric array. If you need indexed access for columns, you can use #[keys[index]].

public interface should be as close as possible to Ruby’s best practices: see Array, and Hash, and Enumerable; it should not resemble DataFrames of other languages and tools;

Yes. RedAmber has only DataFrame, Vector and Group classes. Most of methods returns Ruby's primitive data holders.

For example, DataFrame#vectors is an Array. so we can use Ruby's power for vector calcuration.

boolean = dataframe.vectors.map { |vector| vector.numeric? ? (vector > 100) : false }.reduce(&:|)
# => Returns boolean with true for the row which the condition matched.

dataframe.slice(boolean)
# => Returns sliced sub dataframe by boolean

column is an object, row is not (it is rather slice across all the columns)—column object even have a name in other DataFrame-y solutions: either Series or Vector—and this difference should be clearly visible from interface;

Yes! I surprised most at this point. My idea is same as yours. Precisely data of column is an Vector object, but rows are not specific objects. Because sliced row is always associated with keys and data_types, so it is same as a DataFrame of single row.

Column consists of Vector and associated key but column is not a specific class. A Vector knows its corresponding key when it is in a DataFrame, but a lone Vector which is not belong to any DataFrame is headless (without key).

complex columns and complex indexes should ideally be supported;

No. I don't think so. If complex columns needed, it should be represented by normal columns. Some cases may covered by group aggregation.

as DataFrame is frequently used for experiments and prototypes, it should be pretty-printed by design (and this pretty-printing should by design consider “very large” data sets).

Yes! I think it is important to know the shape, types, abnormalities and uniq counts in exploratory data processing. We have #tdr method for this purpose. It is not pretty printing, but I think it shows useful information about DataFrame in compact style.

It resembles str() function in R.

#tdr shows 10 variables (columns) by default. This is considering for large data, you can increase variables to show like #tdr(100) if you want. To show rows more, #tdr(elements: 10) will work. It's default value is 5.

I think this 'transposed' style is one of the important feature of RedAmber. By this style, Vectors are always row vector (not transposed for column vector), slicing cuts vertically (it is like when you use knives), and initialized by Hash as it is. This may reduce our brain power at coding.

Anytime you want to see in familiar 'Table' style, you can use #table or #to_iruby (automatically show the table in Jupyter).

What Ruby DataFrame need not be at the beginning?

Import and export from tons of file formats and datasources: whether we’ll have good usable data structure, it will not be hard to add this functionality;

Thanks to Arrows capability, RedAmber can read many data sources. It includes csv, tsv, parquet, arrow, arrows etc. from a file, a URI or a stream. In the future, Apache Arrow is expected to be a language independent common data format.

One pros with Red Arrow for general users, loading csv file is extremely fast.

Plotting: the same as above;

I agree. I will rely on other plotting libraries. It is not the territory of the DataFrame library, DataFrame library should focus on preparing clean dataset to serve for the plotting libraries.

All stats methods and algorithms somebody may ever need: they may be in mixins, some (or most) of them in different gems; main DataFrame responsibility is to hold data, you see? and make it easy to process;

I totally agree. I will prepare basic stats only. I respect R's statistical resources, I hope I can try to calculate them using RedAmber.

You’ll laugh, but even performance topics are NOT as important as good API and universal aknowledgement of new DataFrame datatype; of course, performance does matter, but fast-and-ugly library is destined to have very limited usage, and pretty-yet-slow one can at least become popular for moderate-sized, “toy” tasks. After that, it becames widespread and somebody optimizes it, and—voila!—it is performant AND pretty.

I agree. Now RedAmber is experimental and proof-of-concept library. Performance will come in the next stage. It is supporsed to get faster by calling Red Arrow's raw function through corresponding C Glib => C++ implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment