Skip to content

Instantly share code, notes, and snippets.

@jrhemstad
Last active July 23, 2019 08:37
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jrhemstad/581eb1a7b56673bad7da64dab7b5a724 to your computer and use it in GitHub Desktop.
Save jrhemstad/581eb1a7b56673bad7da64dab7b5a724 to your computer and use it in GitHub Desktop.
Requirements and design for a better column abstraction in libcudf

Problem Statement:

We want to write an algorithm once and allow it to work on a variety of columns.

The kinds of columns include:

  • Elements are fixed-width types (int8, int32, float, double, date32, timestamp, etc.)
  • Elements are variable length (strings, …)
  • “Dictionary” columns
  • Elements in the column are indices into a dictionary
  • Elements are fixed-size structs
  • Compressed column (This is a lower priority than the above)

This goal is to achieve what the STL or Thrust algorithms have achieved in that the same algorithm will work on diverse containers given the container adheres to a minimal interface.

However, libcudf’s problem is further complicated by the fact that the type of the columns is not known at compile time, i.e., type-erased. In contrast, STL or Thrust algorithms benefit from knowing the type of the inputs at compile time.

This requires libcudf’s public APIs to be agnostic to the kind of column, e.g., a multi-column sort function:

void sort( std::vector<cudf::column> & columns ){ … }

This function would perform a lexicographic sort of the rows. One column may be of strings, one may be fixed-width, one may be a column of structs, etc.

Ideally, we would have a single sort algorithm that could handle performing the sort operation given each column satisfies a minimal interface.

What would the minimal interface for cudf::column be?

Consider the interface that would be required for sorting:

  • Given any two elements i and j within a column, we need to know if i < j
  • This could be satisfied in at least two ways:
  1. Operators
  • Provide an operator[] to access elements i and j
  • For the return type of operator[], provide an operator<
  1. Member function
  • bool column::elements_less(int i, int j)
  • Returns true/false if element i is less than element j

Note, that given the type-erasure requirement, both options 1 and 2 would require virtual functions. I.e., cudf::column is a base class that would have distinct, concrete derived classes for fixed-width types, strings, nested, dictionary, etc.

Virtual Device Functions

The main problems with invoking virtual functions in device code are:

  • The object must be allocated w/in device code (i.e., you must call new within a kernel)
  • A virtual function call is the same as an indirect function call, which has implications on performance and optimization specifically in device code. If one needs to envoke a virtual function for every element in a column of hundreds of millions of elements, it could be quite expensive.
  • Putting virtual functions within a base class increases the size of the base class (need to store a pointer to the vtable)
    • This is only problematic if every element in a column were to be a virtual class.

Virtual Function Examples

Given the two approaches outline above, here is a skeleton of what those might look like:

Operators

struct element{
  virtual bool operator<( element const& rhs) = 0;
};

template <typename T>
struct typed_element{
 bool operator< (typed_element<T> const& rhs) override{
  return this.value < rhs.value;
 }
private:
 T value;
};

struct column{
  virtual element& operator[](int i) = 0;
};

template <typename T>
struct typed_column{

 element& operator[](int i) override{
  return elements[i];
 }

private:
 thrust::device_vector< typed_element<T> > elements;
}

The main problem with an operator[] approach for element-wise access is that in order for operator[] to be a virtual function, it has to return a virtual class (e.g., element above).

Member Function

struct column{
  virtual bool elements_less(int i, int j) = 0;
};

// Fixed-width column for primitive C types
template <typename T>
struct typed_column : column {

  bool elements_less(int i, int j) override{
    return data[i] < data[j];
  }
private:
  T * data;
  int size;
  
};

So long as every type of column provided an implementation of elements_less, we could perform a lexicographic sort of the rows of an arbtirary number of columns.

This avoids the need for an element virtual class.

However, this provides a much less generic access to the elements of the column. You'd have to provide other kinds of functions like checking if two elements are equal, checking elements between two distinct columns, etc.

Technical Requirements

  • Allows expressing algorithms with no/minimal special casing for different kinds of columns
  • Must be C++14 compatible
  • External library dependencies should be kept to a minimum
  • Must be usable within device code
  • Support the desired column types (fixed width, variable length, dictionary, etc.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment