kariy/til-database-ed.md

## til-database-ed.md

      
    Raw
  

              til-database-ed.md
            
          
    MacOS (OSX)'s fsync() doesn't exactly guarantee data is flushed straight away to the permanent storage

According to the manpage for fsync in OSX:
Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.

Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.

So, when an application writes to a file and then calls the fsync system calls, it will instruct the operating system to flush the written (dirty) page cache to the disk. The data is then sent over to the storage drive but may reside in the disk's internal buffer for a while before the disk driver decides to actually write the data physically to the disk.
Which mean, even though we call fsync after writing to a file, it still doesn't provide a strong enough guarantee that the data will physically be written to the disk synchronously. For most applications, fsync may be enough but for something like databases, where it needs to ensure the integrity of their data, Mac OS X provides the F_FULLFSYNC command on fcntl.
The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect.
From fcntl man page:
F_FULLFSYNC            Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (arg is ignored). As this drains the entire queue of the device
                            and acts as a barrier, data that had been fsync'd on the same device before is guaranteed to be persisted when this call returns. This is currently implemented on HFS, MS-DOS
                            (FAT), Universal Disk Format (UDF) and APFS file systems. The operation may take quite a while to complete.  Certain FireWire drives have also been known to ignore the request
                            to flush their buffered data.

Extra references:
A commit with an educational commit message about a fix related to this on LevelDB: https://github.com/google/leveldb/commit/296de8d5b8e4e57bd1e46c981114dfbe58a8c4fa
The How can you trust a disk to write data? blog post explains very nicely about how benchmarks on MacOS SSDs are essentially 'cheating'. In short, the benchmarks were not giving the "rate at which it could write data to the disk, but just to transfer the data to its high-speed cache memory, with the write itself occurring some time later, more slowly".
Disk write buffering and its interactions with write flushes
How to force a disk write cache flush operation on Linux
The secret life of fsync
Good HackerNews discussion on this issue
Disk buffer

An embedded memory of a HDD or SSD, acting as a buffer/cache in between the CPU and the drive, for improving read/write access.
Disk buffer is not to be confused with page cache. Page cache is handled by the operating system and is stored in the system's main memory (RAM) as compared to disk buffer which is controlled by the disk drive own on-board microcontroller.
Uses:


Read-ahead/read-behind


Command queuing


Write acceleration


Once the data that needs to be written has reached the disk buffer, the drive can signal to the operating system that the write operation has been done. The drive microcontroller would then perform the actual writes asynchronously and the CPU can perform some other tasks without having to wait until the drive finish writing the data to the disk.
Leaving the data in the disk buffer and deferring the writes to the platter/flash memory at some other time can be dangerous if a system crash or power outages happen before the actual write could even take place. The data will still get written to if the program which the write instructions comes from crashes, but may not survive a power outage or kernel crash. The filesystem may end up with corrupted state once the system recovers.
From the perspective of the operating system, it assumes that the data has been written to the permanent storage. This is the issue which we have discussed about the behaviour of fsync on OS X and what fcntl(F_FULLFSYNC) is trying to do.
What does the disk buffer stores ?

TBD
Why DRAM SSDs are faster at doing random accesses compared to DRAM-less SSDs?

TBD
Host-Memory Buffer (HMB)

https://www.pcworld.com/article/784380/host-memory-buffer-hmb-the-dram-less-nvme-technology-thats-making-ssds-cheaper.html
Extra references:
The difference between fsync & Write Through, through the OS eyes