DavidVorick/db.md

## db.md

      
    Raw
  

              db.md
            
          
    ccdb

ccdb is a database that:

has buckets
is transactional
is ACID
has random access values (can read and write partial values)
has appendable values (can append to a value already stored in the database)
uses a hash table implementation for speed
is NOT sorted
requires O(1) disk accesses for all worst case single-element operations

Implementation:

ccdb is a hash table based database. When you create a bucket, you are actually
creating a hash table. As you add keys to the bucket, you add keys to the hash
table.
The hash table will directly store small elements, and contain pointers to the
data for large elements. If the key is large, the table will only store the hash
of the key instead of the full key, and the full key will be stored alongside
the value elsewhere in the file.
The hash table itself has 32 bytes per element, and a sparseness of 50%. This
means inserting small elements (less than 29 bytes) into the database has a cost
of about 64 bytes per element. Larger elements will have less overhead.
Hash Table Details

The hash table itself is broken into bins. Each bin is 1 page, or 4096
bytes. Each bin is broken down into 128 elements that are 32 bytes each. The
hash table aims for a sparseness of 50%, meaning in expectation about half of
the elements will have data in them, and the other half will be empty.
Bin Collisions

If there is a collision when adding a key to a bin, the key is put into the
next slot over in the bin. When looking for new slots, you wrap around to the
beginning of the bin. In the rare case that the bin is full, the next
bin is used to place the element. If that bin is full, keep going to the
next bin, and wrap around to the first bin if needed. Because of the
sparseness, you are guaranteed to find space eventually. It should be uncommon
that you need to scroll through bins for space.
Key Collisions

Because only 23 bytes are used for the key, it is possible that in large tables
you end up with collisions. When you load a value from a key, you need to do a
comparison to verify that there was not a key collision. If there was a key
collision, you need to return to the bin and check the next element.
Resizing

The hash table aims for 50% sparseness. To minimize the impact of resizing, we
perform frequent resizes that only need to touch part of the hash table. When
resizing, we add one new bin at a time, and we pull elements from 4 bins.
(this means that the minimum size of a hash table is 4 bins). That means the
worst case time for resizing the table only requires 4 bin reads and 5 bin
writes. Bins are a single disk page, which means this can be quite fast,
especially if the 4 bins are grouped together.
A new bin is added to the hash table every time that the total number of elements
reaches 64 * numBins (50% sparseness). Elements are pulled out of bins in order,
meaning initially you pull elements out of bins 0-3, then 4-7, and you wrap back
0-3 once you reach the end of the table and there are not enough bins to pull
from 4 consecutive bins. We always pull from the 0-3 alignment (and never from 1-4,
for example) because we will try to make sure bins 0-3 have consecutive positions
on-disk.
When pulling elements out of a bin to insert in to the new bin, we use the hash of
the key as a source of randomness. The first byte is used to determine which index
within the bin the key is stored. Each of the next bytes are used to determine which
bin the key is stored in. Each time the key is considered for migration, another byte
is used for rng. If the key runs out of entropy, you can hash it again to get more
entropy.
The algorithm for determining which bucket a key belongs in is as follows:
randBytes := hashObject(key) // Use key hash for randomness
binIndex := randBytes[0]/2 // Where within the bin the element is placed
currentBin := randBytes[1]/64 // Which of the 4 starting bins the element starts in
entropyOffset := 2 // We use entropy one byte at a time.

// We use entropy 1 byte at a time, though actually we only need 2 bits, because
// all of our rngs are a 1/4 probability. That complicates the code however.

// Perform key moving until the final bin for the key is found.
workingBin := 4 // The bin that we are potentially being inserted into.
for {
	// Determine if we are in the final bin.
	if totalBins < workingBin {
		break
	}
	
	// Determine if we are moving to the working bin.
	if randBytes[entropyOffset] < 32 {
		currentBin = workingBin
	}
	
	// Update the entropy pointer.
	entropyOffset++
	if entropyOffset >= len(randBytes) {
		// We've run out of entropy. Hash the key again to get new
		// entropy. This is acceptable because all that matters is
		// that our new entropy is different from the entropy of any
		// other element that gets added.
		randBytes = hashObject(randBytes)
		entropyOffset = 0
	}

	// Determine the number of working groups based on the working bin.
	// The working groups just indicate how many bin groups (bins release elements
	// in groups of 4) are currently being used to create new bins. The number of
	// working groups is always kept at an even power of 2.
	workingGroups := 1
	for workingGroups*2 < workingBin/4 {
		workingGroups *= 2
	}
	
	// Determine the next bin we may be inserted into.
	binGroup := currentBin/4
	if binGroup < workingGroups {
		// If we are not inserted into the workingBin, then the
		// next bin will be 'workingGroups' bins from now.
		workingBin = workingBin+workingGroups 
	} else {
		// We are not part of the working groups. We will be a part
		// of the working groups at bin workingGroups*2*4 (when the
		// working groups size doubles), and then our first offset will
		// be our current bin group.
		workingBin = workingGroups*2*4 + binGroup
	}
}
While there are a lot of operations, they are all performed in-memory. Even
when the table is large, the cost of computing the location of a key is much
cheaper than making even a single disk access.
Key + Value Management

If the key+value is smaller than 29 bytes, then they key+value get stored in the
hash table directly. The first byte is set equal to '1' to indicate that the
full key+value is stored directly in the remaining bytes. The second byte
indicates the length of the key, the third byte indicates the length of the
value. The remaining bytes store the key and value.
0:   [1]      // The value '1'.
1:   [keyLen] // The length of the key.
2:   [vLen]   // The length of the value.
3-m: [key]    // The key itself.
m-n: [value]  // The value itself.
If the key+value is larger than 29 bytes, then we will store a pointer to the
data instead of storing the data directly
0:     [2]          // The value '2'.
1-24:  [key]        // The key if less than 24 bytes, otherwise the hash of the key.
25-31: [value page] // The offset of the data in the file.