Skip to content

Instantly share code, notes, and snippets.

@DavidVorick DavidVorick/db.md
Last active Oct 22, 2017

Embed
What would you like to do?
DB Spec (wip)

ccdb

ccdb is a database that:

  • has buckets
  • is transactional
  • is ACID
  • has random access values (can read and write partial values)
  • has appendable values (can append to a value already stored in the database)
  • uses a hash table implementation for speed
  • is NOT sorted
  • requires O(1) disk accesses for all worst case single-element operations

Implementation:

ccdb is a hash table based database. When you create a bucket, you are actually creating a hash table. As you add keys to the bucket, you add keys to the hash table.

The hash table will directly store small elements, and contain pointers to the data for large elements. If the key is large, the table will only store the hash of the key instead of the full key, and the full key will be stored alongside the value elsewhere in the file.

The hash table itself has 32 bytes per element, and a sparseness of 50%. This means inserting small elements (less than 29 bytes) into the database has a cost of about 64 bytes per element. Larger elements will have less overhead.

Hash Table Details

The hash table itself is broken into bins. Each bin is 1 page, or 4096 bytes. Each bin is broken down into 128 elements that are 32 bytes each. The hash table aims for a sparseness of 50%, meaning in expectation about half of the elements will have data in them, and the other half will be empty.

Bin Collisions

If there is a collision when adding a key to a bin, the key is put into the next slot over in the bin. When looking for new slots, you wrap around to the beginning of the bin. In the rare case that the bin is full, the next bin is used to place the element. If that bin is full, keep going to the next bin, and wrap around to the first bin if needed. Because of the sparseness, you are guaranteed to find space eventually. It should be uncommon that you need to scroll through bins for space.

Key Collisions

Because only 23 bytes are used for the key, it is possible that in large tables you end up with collisions. When you load a value from a key, you need to do a comparison to verify that there was not a key collision. If there was a key collision, you need to return to the bin and check the next element.

Resizing

The hash table aims for 50% sparseness. To minimize the impact of resizing, we perform frequent resizes that only need to touch part of the hash table. When resizing, we add one new bin at a time, and we pull elements from 4 bins. (this means that the minimum size of a hash table is 4 bins). That means the worst case time for resizing the table only requires 4 bin reads and 5 bin writes. Bins are a single disk page, which means this can be quite fast, especially if the 4 bins are grouped together.

A new bin is added to the hash table every time that the total number of elements reaches 64 * numBins (50% sparseness). Elements are pulled out of bins in order, meaning initially you pull elements out of bins 0-3, then 4-7, and you wrap back 0-3 once you reach the end of the table and there are not enough bins to pull from 4 consecutive bins. We always pull from the 0-3 alignment (and never from 1-4, for example) because we will try to make sure bins 0-3 have consecutive positions on-disk.

When pulling elements out of a bin to insert in to the new bin, we use the hash of the key as a source of randomness. The first byte is used to determine which index within the bin the key is stored. Each of the next bytes are used to determine which bin the key is stored in. Each time the key is considered for migration, another byte is used for rng. If the key runs out of entropy, you can hash it again to get more entropy.

The algorithm for determining which bucket a key belongs in is as follows:

randBytes := hashObject(key) // Use key hash for randomness
binIndex := randBytes[0]/2 // Where within the bin the element is placed
currentBin := randBytes[1]/64 // Which of the 4 starting bins the element starts in
entropyOffset := 2 // We use entropy one byte at a time.

// We use entropy 1 byte at a time, though actually we only need 2 bits, because
// all of our rngs are a 1/4 probability. That complicates the code however.

// Perform key moving until the final bin for the key is found.
workingBin := 4 // The bin that we are potentially being inserted into.
for {
	// Determine if we are in the final bin.
	if totalBins < workingBin {
		break
	}
	
	// Determine if we are moving to the working bin.
	if randBytes[entropyOffset] < 32 {
		currentBin = workingBin
	}
	
	// Update the entropy pointer.
	entropyOffset++
	if entropyOffset >= len(randBytes) {
		// We've run out of entropy. Hash the key again to get new
		// entropy. This is acceptable because all that matters is
		// that our new entropy is different from the entropy of any
		// other element that gets added.
		randBytes = hashObject(randBytes)
		entropyOffset = 0
	}

	// Determine the number of working groups based on the working bin.
	// The working groups just indicate how many bin groups (bins release elements
	// in groups of 4) are currently being used to create new bins. The number of
	// working groups is always kept at an even power of 2.
	workingGroups := 1
	for workingGroups*2 < workingBin/4 {
		workingGroups *= 2
	}
	
	// Determine the next bin we may be inserted into.
	binGroup := currentBin/4
	if binGroup < workingGroups {
		// If we are not inserted into the workingBin, then the
		// next bin will be 'workingGroups' bins from now.
		workingBin = workingBin+workingGroups 
	} else {
		// We are not part of the working groups. We will be a part
		// of the working groups at bin workingGroups*2*4 (when the
		// working groups size doubles), and then our first offset will
		// be our current bin group.
		workingBin = workingGroups*2*4 + binGroup
	}
}

While there are a lot of operations, they are all performed in-memory. Even when the table is large, the cost of computing the location of a key is much cheaper than making even a single disk access.

Key + Value Management

If the key+value is smaller than 29 bytes, then they key+value get stored in the hash table directly. The first byte is set equal to '1' to indicate that the full key+value is stored directly in the remaining bytes. The second byte indicates the length of the key, the third byte indicates the length of the value. The remaining bytes store the key and value.

0:   [1]      // The value '1'.
1:   [keyLen] // The length of the key.
2:   [vLen]   // The length of the value.
3-m: [key]    // The key itself.
m-n: [value]  // The value itself.

If the key+value is larger than 29 bytes, then we will store a pointer to the data instead of storing the data directly

0:     [2]          // The value '2'.
1-24:  [key]        // The key if less than 24 bytes, otherwise the hash of the key.
25-31: [value page] // The offset of the data in the file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.