Skip to content

Instantly share code, notes, and snippets.

@GregRos
Last active December 15, 2015 13:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save GregRos/5271345 to your computer and use it in GitHub Desktop.
Save GregRos/5271345 to your computer and use it in GitHub Desktop.

Take(n) Implementation

Take(n) returns a new vector containing the first n elements. It performance as well or better than a single call to Update.

It is not possible, using the current implementation, to have a method that Skip or Splits.

The implementation is separate for leaf nodes and parent nodes.

This method is implemented in my Solid.Vector class. See the implementation here.

Leaf Nodes

When Take(n) is called on a leaf node, n ≤ node.Length. We allocate a new array the length of n, and array-copy the first n items of the current array into the new array.

Parent Nodes

When Take(n) is called on a parent node, n ≤ total count. When this happens, we can divide the range specified by n into that section of the node array that can be copied, and the last node, with the index n-1, that we only want some part of.

For example, assuming our parent node has block offset of 2, that is, its bitmask is 11111 00000 00000 and there is a call to Take(01111 00100 10000) which translates to the index 01111 00100 01111, it means that all the indexes ranging from 00000 00000 000000 to 01110 11111 11111 are definitely part of the new vector.

This corresponds to all the child nodes at indices ranging from 00000 to 01110.

The last group of indexes, ranging from 01111 00000 00000 to 01111 00100 01111 belongs to the final child node, but this may be a subset of its indexes. As such, we need to recursively invoke Take(n) on the node at the last index -- 01110. n can remain as it is.

We then replace the last node of the newly created, partial array with the one returned by the recursive call.

The recursion terminates at a call to a leaf node's Take(n) method, which does not recurse.

As such, there number of calls to Take(n) is at most the height of the tree. Each call performs one array copy.

BulkLoad Implementation

Bulk loading is more complicated. It allows significantly more efficient loading of many elements to the end of the vector because it modifies arrays in place.

A call to BulkLoad requires

  • Data, in the form of an array: data
  • An index of the data array from which loading is to begin data_start
  • A count that denotes the number of elements to load data_length

When the externally visible Vector class wants to BulkLoad, it calls the implementation method with data_start = 0 and data_count = data.Length. This ensures all the data is added.

Leaf Nodes

For leaf nodes,

Allocate a new array of values the size of min(32, CurrentCount + data_length). Copy the current value array into it, and fill it with all the values contained in the parameter data_array from the index data_start to data_start + data_length or the remaining capacity of the node. This can be achieved with efficient array copy.

If the current node cannot hold all the items specified by data_length it needs to create a parent node that contains itself, and call BulkLoad on it, with data_length being the remaining items that were not added in this call to BulkLoad.

Parent Nodes

For parent nodes,

We first find the size of the array we need to allocate to fit all the values specified by data_length. We can do this, for example, by:

  let needed_size = (data_length + MyTotalCount) >>> 5·H

Where H is the level of the parent node (the size of its 5-bit offset) needed_size may be bigger than 32.

We create a new array the size of needed_size or 32, whichever is smaller. This array is initialized with empty nodes.

We copy the current array of nodes into the new array, overwriting the first empty nodes but possibly retaining those at the end.

In a parent, all the initial nodes are going to be filled. Only the last node can possibly accept more elements.

We iterate over the nodes in the new array, starting from what was the last node of the original vector, which may have some additional capacity.

Every iteration, we call BulkLoad on the current child node with data_start being equal to the index we stopped at in the last iteration, and data_length being equal to the total number of values the child node can possibly hold, in addition to its current total count.

Every time, we modify the new array in place so that it contains the node returned by the call to BulkLoad.

If we screw up the data_length, there will be a structural issue where child nodes have an offset equal to the offset of the parent node, or there will be 'gaps' in the indexing structure.

After this is done, we create a new parent around the array we just filled.

Now, if the needed_size was bigger than 32, there are still elements remaining.

We need to create a new higher level parent (its array will contain the parent we just bulk loaded) and call BulkLoad on it, with the remaining elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment