Skip to content

Instantly share code, notes, and snippets.

@rofr
Created December 15, 2019 18:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rofr/9c6cf2010763754b0103185c590c6678 to your computer and use it in GitHub Desktop.
Save rofr/9c6cf2010763754b0103185c590c6678 to your computer and use it in GitHub Desktop.
Conversation about OrigoDb
I had an email conversation about OrigoDB with Laura Bressler, a Computer Science 2020 student at Carnegie Melon University. I'm sharing the conversation (with permission) so we can have a public link to cite but also in case anyone else finds the information useful.
December 2019
Robert Friberg
----
My name is Laura, and I'm a student at Carnegie Mellon University. As part of a project for a databases course, I'm updating the dbib.io page for OrigoDB (dbdb.io/db/origodb). However, I wasn't able to find some information in the online documentation. Would you be willing to answer a few questions to ensure that the information in the article is as accurate as possible?
----
Laura:
[...] I was wondering about the following things:
- When the OrigoDB server is being used, do the nodes share any disk or memory?
- Are indexes supported, and if so, what data structure are they implemented with?
- What type of checkpoint is used?
- From my understanding, operations such as joins are implemented on a model-by-model basis, and there are no commands that are required in any model. Is this correct?
----
Robert:
Nodes do not share disk or memory. The replicas connect to the primary using tcp/ip and the primary will send commands to each replica.
Data and indexes are never written to disk, they stay in RAM. You are free to choose C# data structures that are optimal for your access patterns such as List<T> (array list), Dictionary<K,V> (hashtable, SortedDictionary<K,V> (Binary tree), HashSet<T> and so on. So if I have a Binary Tree with customers, ordered by customer id: SortedDictionary<int,Customer> locating a customer in the tree is an O(log N) operation. If we want to find all the customers from a given city we can iterate over the values with a LINQ query: customers.Values.Where(c => c.City = “Chicago”) but this is an O(N) operation. If we want to have O(log N) lookup for customers by city we would create a SortedDictionary<string, List<Customer>> structure named customersByCity. Using the index is as simple as:
var customersFromChicago = customersByCity[“Chicago”]
The checkpoint question doesn’t really make sense given how OrigoDB operates. A traditional relational database writes all the dirty data pages from the buffer cache to disk during a checkpoint. But OrigoDB does not write the actual to disk at all, it’s always just in RAM!
Joins are also kind of specific for relational databases that store data in separate tables. Entities reference each other by ID (foreign key). So an SQL join will pull data from 2 tables and stitch them together. In OrigoDB we *could* do something similar in memory, for example looping over all the orders and for each order lookup the customer in the customers binary tree. The code would look something like:
foreach (var order in Orders)
{
var customer = Customers[order.CustomerId]
yield return (order, customer); //return them as a Tuple
}
An alternative approach would be to let each order have a direct reference to the customer that placed the order. This is very easy when all the data is in memory. In this case there is no need to join because each order is already connected to the customer object.
Hope this all makes sense to you! If you have any more questions, feel free to ask.
----
Laura:
...
I was understanding snapshot logging to be similar to (optional) checkpoints...Is that a reasonable way to think about it, or am I misunderstanding it?
----
Robert:
A snapshot captures the entire in-memory model at a given point in time and writes it to disk. This can take from seconds to hours depending on the size of the model. Snapshots are optional and you would use them to:
* Load faster at system startup by replaying fewer commands. In practice though, it’s often faster to replay the entire command journal. So this is something you should measure before deciding upon.
* Truncate the log entries older than the snapshot. So yes, this is very similar to RDBMS checkpoints. But I wouldn’t recommend deleting old log entries because they contain valuable information. The journal is a complete history of each change leading up to the current state and you would need a good reason to not keep it forever.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment