eyalroz/column-2019-03-27.md

## column-2019-03-27.md

      
    Raw
  

              column-2019-03-27.md
            
          
    Rejoinder to the initial design document / proposal

_{These comments and questions follow this gist by Jacob Hemstad. They appear in order of the text in Jacob's gist. Please excuse the somewhat argumentative/critical tone - I'm actually very pleased that we are hammering this out.}


We want to write an algorithm once and allow it to work on a variety of columns.

Well, yes, but aren't we doing that now already?

The goal is ... [for]... the same algorithm will work on diverse containers given the container adheres to a minimal interface.


I'm not sure this should be a goal at all. Do we really want to write an algorithm once for all data types? See a concrete example of why this is not a good idea, below. Note that using the same code for a variety of columns is not the same as using the same code for all types of columns.  Also remember that thrust's generic and generalized nature means that it does many things rather slowly.
More specifically, do we really want to write an algorithm once both for fixed-width and variable-width columns? Will a sort() work the same for a column of 10^6 8-byte integers as for 8 strings of length around 1 million?
Even if a single generic algorithm implementation is a goal, it is only one goal. There could be other goals (at least one of which I'll put forward in a later section of this document.)


The kinds of columns include:

The fact that compressed columns are mostly left out is due to a fundamental weakness in the goal. Once we consider compression schemes seriously, it becomes even more apparentl - IMO - that one size does not fit all.
But even if we limit ourselves to just the column types here, we'll notice this occuring already. Consider DICTIONARY compression. The compressed form of a column under this scheme is in fact two columns: A column of the dictionary entires (which is simply a column of the uncompressed element type), and a column of the dictionary indices. So a column can be made up of two columns.
Moreover, even without compression - we have the valid indicators. We really must admit that these are, in fact, a column - a non-nullable column of bits. So we already use multiple columns to represent a single column: A nullable column is a pair of non-nullable (or basic) columns, of the same length, and with the second having a boolean (or bit) data type.

The goal is to achieve what the STL or Thrust algorithms have achieved

Those are two different and somewhat contradictory things: The C++ standard library does not do any execution marshalling (ignoring parallel STL), but rather just runs things in a naive and staightforward manner. Thrust is something you call on the CPU and schedules execution on the GPU. So which is it?

void sort( std::vector<cudf::column> & columns ){ … }

Nitpick: It is "faux pas" to casually take references to vectors of things. Because it is almost never the case that you actually require a vector. That should probably be:
void sort(std::span<cudf::column> columns){ … }

or:
template <typename Container>
void sort( const Container& columns){ … }

(and note that the templating is not over the columns' element types.)

libcudf’s problem is further complicated by the fact that the type of the columns is not known at compile time,

For large columns of data (and I'm not talking about cudf::columns, which are really not even columns but rather chunks of columns which have no more than 2^31 elements), the ultimate solution is JIT-compiling kernel code; that allows for the types to be known at compile-time (as we compile at run-time). But doing that is not realistic right now, when we can only compile things from string source code.

Ideally, we would have a single sort algorithm that could handle performing the sort operation given each column satisfies a minimal interface.

I disagree. this is not ideal. For example, if your column is known to have small support (i.e. a few number of distinct values), you would bucket-sort it, but in the general case you would bitonic-sort it. So without any contextual/statistical information, you would bucket-sort an int8_t column but bitonic-sort a int32_t column.

What would the minimal interface for cudf::column be?

Three orthogonal points:


This is a good question, but it is also easy to mislead ourselves with it. Because this question has an implicit part not spelled out, and it really asks "What would the minimal interface for cudf::column be, given our assumptions about what a column is and what it should provide us with?"
For example, currently, a column has a null count. We would obviously need at least some way of accessing it, right? But the more fundamental question is: "Should a column always have a null count? Is that inherently part of a column?" and note that even if we need to produce columns with null counts because Arrow requires it, or the Python code requires it etc. - this still doesn't mean we need all of our columns, internally, to have it.


Are we talking about the interface for device-side code, for host-side code or for both?


Perhaps it should do nothing, or nothing much, beyond allowing us to get a proxy to a typed version of itself, in which we don't have to worry about type erasure?


Given any two elements i and j within a column, we need to know if i < j

If we're given the elements, we don't need the column; we're given indices and a column. Or... maybe once we work with indices, we don't work with the column any more? It could be a typed column perhaps, like in point (3.) above; or it could be iterators.

This could be satisfied in at least two ways:

... only if we go directly through the column.

Operators

Provide an operator[] to access elements i and j

For the return type of operator[], provide an operator<


This underscores my question about device-side and host-side. On the host side - would it really make sense to cudaMemcpy single values? (The answer is not obvious.) And on the device side - are we really working with full-blown cudf::columns, or can kernels not use something simpler?
It doesn't seem reasonable to provide an operator< for the result of an operator[] if a column's elements are not comparable in the general case.


Member function: 

bool column::elements_less(int i, int j) 

Returns true/false if element i is less than element j

That is not part of what a column needs to allow us to do. Such comparisons should only involve the elements (or proxies to those elements the column provides us with). This also goes against the spirt of how modern, well-regard C++ code is written; and it makes me flinch.

Note, that given the type-erasure requirement, both options 1 and 2 would require virtual functions.

Which means they are out of the question as far as implementing anything that's supposed to have decent performance.

I.e., cudf::column is a base class that would have distinct, concrete derived classes for fixed-width types, strings, nested, dictionary, etc.

Inheritance relations are very strong ones. Even regardless of my points above, I do not see how a typed-column IS-A type-erased column (or the other way around).

Virtual Device Functions

Virtual Function Examples


Skipping these sections due to my categorical objection to using virtual functions in this context.

Technical Requirements


Allows expressing algorithms with no/minimal special casing for different kinds of columns


Again, I'm against this being a requirement for the design of a cudf::column class. However, I think a good design will make it relatively easy to express certain aspects of many algorithms uniformly.


External library dependencies should be kept to a minimum


Depends on what you mean by "minimum". I would actually oppose this requirement. There is no problem IMHO in depending on a well-regarded, publicly available third-party library - especially if it's small and is itself self-contained.
For example, I can well see the potential of using a C++14 implementation of a variant, like Michael Park's; and there are vocabulary classes like optional and any, and of course span, which we might want to use.


Must be usable within device code


I'm not convinced that it is a good idea to use the same class in both device-side and host-side code.
An alternative suggestion for a cudf::column design

TO BE WRITTEN IN!
And note that this proposal will not be a synthesis of everything I've said above (it will just be inspired by some/most of it.)