Skip to content

Instantly share code, notes, and snippets.

@jankotek
Created March 9, 2014 16:47
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jankotek/9450559 to your computer and use it in GitHub Desktop.
Save jankotek/9450559 to your computer and use it in GitHub Desktop.
MapDB overview video transcript
1 - Hello
--------------
Hello, I am Jan Kotek and in this podcast I will give you quick overview of MapDB;
an embedded java database engine.
2 - MapDB
---------
Traditional Java collections are limited to a few gigabytes by garbage collector and heap size.
MapDB provides collections backed by off-heap or on-disk store.
Now you can have maps with billions of items.
MapDB is embedded database engine. It is written in pure Java,
but has performance comparable to native engines written in C.
MapDB is very fast and fully concurrent database engine.
It is also low-level engine but very easy to use.
MapDB was developed as faster and simpler alternative to SQL db.
It offers several features, modes and configuration options.
3 - History
-----------
MapDB was originally named JDBM, which means Java Database Manager.
This project started 14 years ago, original Database Manager goes back into seventies.
I joined JDBM 5 years ago and since than released two major versions.
About 2 years ago I decided to do complete rewrite to make architecture cleaner and introduce
fine grained concurrency. I also renamed project to MapDB.
1 year ago I quit my daily job and started working on MapDB full time.
And a few months ago company called CodeFutures offered to sponsor MapDB development.
So right now MapDB is open-source project with full time developers and good backing.
This project has features usually found in paid products.
But MapDB is free as a beer under Apache license.
There are no strings attached, everything including documentation and unit tests is published.
We hope that MapDB will become de-facto db engine for Java
MapDB was a hobby project for most of its life time.
So it has cleaner and more enjoyable design when compared to similar projects.
Most parts were redesigned several times until they were perfect.
4 - Features
------------
MapDB has several features
Most importantly MapDB is fully concurrent with record level locks.
Two parallel threads can actually write into underling files at the same time.
Secondly MapDB offers drop in replacement for most java collections.
We think it is better to use existing API rather then invent yet another way to access b-trees.
If your application has problems with memory,
you only have to change a few constructors to move your data off-heap or on-disk.
MapDB was designed to run an low power devices and as such has almost zero internal overhead.
It has more than a decade of optimization.
There are only 3 abstraction layers, we minimize data copying and
use primitive variables for everything to avoid garbage collector trashing.
Also exiting about MapDB is its flexibility.
Because of its architecture it can be used in several modes.
From fast alternative to java collections,
to in-memory cache or durable database.
MapDB offers Data Pump. It allows to stream and filter data from one location to other.
Typical use is to import large text files into store.
It can create 1 terabyte BTree with 10 to 12 items overnight.
And best of all MapDB is very small and easy to deploy.
It is single jar file with 400 kilobytes,
with no dependencies except Java 6.
MapDB runs on all Java versions including Android.
The entire MapDB code-base has about 12 000 lines of code.
5 - Usages
----------
MapDB is universal database engine and can be used almost for anything.
But four usages stands out.
Firstly MapDB is often used as off-heap cache.
HashMap has expiration based on time-to-live, time-from-last access, maximal collection size and maximal memory size.
There is most-recently-used queue for entry expiration.
Other use is some sort of fork-lift for big data.
Lets say you have huge file you need to feed into Hadoop for fast processing.
With MapDB you could filter and pre-process this data file on single machine.
You could also do prototyping before writing full scale algorithm on cluster.
MapDB can also be used as traditional database.
It has ACID transactions with proper isolation and write conflict resolution.
However MapDB is still young project and people tend to be conservative in this case.
And last but best is MapDB as an alternative memory model.
You can switch your data model completely into MapDB if you are limited by garbage collector.
We have maps, lists, atomic variables, references and so on.
You can even use usual java concurrency utilities.
Our tests shows that compared to HEAP MapDB is twice slower but fits three times more data.
6 - Architecture
----------------
MapDB has highly modular architecture.
Every components can be completely disabled, so there is no overhead for unused features.
Each component has several implementations,
for example there are five different caches and three different storage formats.
Also data types such as maps and queues are separated from rest of the system.
They have a swap-able back ends.
And best of all MapDB architecture is very developer friendly and it is easy to develop new components.
For example the default cache has only 50 lines of codes, but is fully concurrent.
Also collections are easily portable from traditional Java heap,
you only have to replace allocation and reference calls to use MapDB.
For example queues were ported from Java collection framework in 1 day.
7 - BTreeMap
------------
Most advanced collection offered by MapDB is BTreeMap and set.
It is optimized for larger number of small keys.
It fully implements ConcurrentNavigableMap and NavigableSet.
It is based on B-plus-linked-tree.
This structure is almost lock-free and offers great concurrent scalability.
It is used in Posgres and other databases.
MapDB offers data pump for this collection.
It is fast way to import data from an iterator.
It build leaf nodes first and than builds directory nodes on top.
Import time is linear, so 1 terabyte btree can be created overnight.
To minimize space consumption we use delta key compression.
For numbers we store only first number and than difference for next.
For strings common prefix is stored and so on.
Typically delta key compression has zero
Also MapDB offers tuples and composite hierarchical keys.
8 - Data types
--------------
Other Map and Set provided by MapDB is HashMap. It is optimized for large keys.
It has optional entry expiration based on several criteria, so it is very easy to use for cache.
Also it uses hash tree table, so it does not need rehashing as other similar collections
There are several Queues implementations. Fifo, lifo circular queues.
Some of them are lock-free, some optimized for single producer multiple consumers and vice versa.
Also MapDD has atomic variables.
Their state is stored in database and persisted between JVM shutdowns.
9 - Serialization life-cycle
----------------------------
Most databases expect user to perform serialization and pass
already serialized data into database.
Even SQL has its own format.
MapDB on other side takes data in user format and also takes serializer
which turns data into binary form.
So MapDB decides when its perform serialization of data.
It is small detail, but it makes MapDB highly modular and flexible
For example data with serializer can be forwarded to background thread for asynchronous serialization.
Data can be also stored in cache to minimize deserialization overhead.
10 - Engine Wrappers
--------------------
Engine is primitive key value store. It has operations for creating, removing,
updating and deleting records.
It is centrepiece of MapDB design, the data model such as BTree sits on top,
low-lever storage beneath it.
Most features in MapDB are implemented as Engine wrappers.
For example read-only wrapper blocks calls which would modify underlying
store and throws and exception instead.
Caches, asynchronous writes or even encryption are implemented in similar way.
11 - Backends
-------------
Volume is an abstraction over device such as memory block, file or even raw disk.
There are several implementation including byte arrays, off-heap byte buffers,
random access files and memory mapped files.
We also work on support for accessing raw disks to avoid file-system overhead.
Volume can also be used to implement software RAID , data duplication,
sharding across multiple disks and so on.
12 - Transactions
-----------------
MapDB evolved from low-level store, so it supports most transactional modes.
Three modes are most important:
First there is direct mode used when transactions are disabled.
In this case the changes are written directly to store.
It is very very fast but there is no protection from data corruption.
This mode is usable for in-memory operations, temporary disk collections and very fast imports.
Second mode is single global transaction.
In this case the store is protected by write-ahead-log or append-only-file.
Because there is only one transaction at time,
we do not have to worry about write conflict resolution.
Third mode are full concurrent transactions.
It has serializable isolation level with optimistic locking.
This is strongest isolation level , it even keeps track if record read by other transactions.
Conflict resolution is done in-memory with very small overhead.
So if transactions is rolled back due to conflict, it does not even touches the disk.
The concurrent transactions in MapDB are very fast,
yet there are no compromises with isolation levels or durability.
In near future we are going to improve speed even more.
13 - Future
-----------
Various versions of MapDB was under development for more than a decade.
Now after rewrite it has features of proper database, with speed of low level store.
It is very exiting to have first stable release scheduled in a few weeks.
There is strong commercial company sponsoring MapDB development.
We will probably offer commercial support, training and consulting soon.
Other goal for future is to improve MapDB tooling and its internals.
There are several ways to improve speed and add some interesting features.
Also we have lot of ideas how to improve concurrent transactions and snapshots.
Users were asking for more collections.
So we will add packed BitSet to handle bloom filters.
Also we will add an alternative to ArrayList called Indexed Deque.
We have some plans for networking.
Users were asking for server-client implementation and replication.
There is lot of interesting stuff added Java8 and MapDB will have first class support for lambdas and Streams.
Java 8 stuff will be added as separate project, so we will keep support for Java 6.
And most importantly Data Pump will be greatly expanded.
MapDB will be able to perform most database operations in streaming fashion.
This will enable MapDB to handle huge data on single computer.
14 - Links
----------
Thats all for now. For more informations visit our website at mapdb.org,
or visit our sponsor at codefutures.com.
You may contact me directly at jan at kotek.net
My twitter id is JanKotek
Thank you for listening and have a great day.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment