Skip to content

Instantly share code, notes, and snippets.

@Zheaoli
Created March 12, 2024 13:45
Show Gist options
  • Save Zheaoli/e295cb6fe73aea7c5c97a719d23da1c4 to your computer and use it in GitHub Desktop.
Save Zheaoli/e295cb6fe73aea7c5c97a719d23da1c4 to your computer and use it in GitHub Desktop.

GSoC 2024 OpenDAL Project Proposal Draft

[toc]

1 Basic Information

2 Project Information

  • Project Name: ovirtiofs, OpenDAL File System via Virtio
  • Related Issues: apache/opendal#4133
  • Potential Mentors: Xuanwo
  • Project Community: Apache OpenDAL
  • Project Scale: Large, ~350 hours

3 About Me

I am Runjie Yu, a first-year master's student majoring in Computer Science and Technology at Huazhong University of Science and Technology in China. I also earned my undergraduate degree from the same university. I have a strong passion for programming and have experience in various programming languages, including Rust, Go, C++, Swift, and more. My primary interest lies in studying system development, particularly in the areas of file systems and databases, and I am eager to achieve meaningful research results. I have completed internships at Tencent and ByteDance, and have participated in several projects with Huawei. Coding is not merely a skill for me; I preceive it as a long-term career.

I believe OpenDAL is an outstanding Apache project. It provides a unified, efficient, and cost-effective data access interface for numerous storage services. This project can seamlessly integrate into varius systems as a storage layer, holding significant potential and value in the prevailing era of cloud-native technologies. I am determined to contribute my best efforts to its development through continuous work.

4 Project Abstract

Virtio is an open standard designed to enhance I/O performance between virtual machines (VMs) and host systems in virtualized environments. VirtioFS is an extension of the Virtio standard specifically crafted for file system sharing between VMs and the host. This is particularly beneficial in scenarios where seamless access to shared files and data between VMs and the host is essential. VirtioFS has been widely adopted in virtualization technologies such as QEMU and Kata Container.

Apache OpenDAL is a data access layer that allows users to easily and efficiently retrieve data from various storage services in a unified manner. In this project, our goal is to reference virtiofsd (a standard vhost-user backend, a pure Rust implementation of VirtioFS based on the local file system) and implement VirtioFS based on OpenDAL.

This file-system-as-a-service approach conceals the details of the distributed storage system's file system from VMs. This ensures the security of storage services, as VMs do not need to be aware of the information, configuration and permission credentials of the accessed storage service. Additionally, it enables the utilization of a new backend storage system without reconfiguring all VMs. Through this project, VMs can access numerous data services through the file system interface with the assistance of the OpenDAL service deployed on the host, all without their awareness. Futhermore, it ensures the efficiency of file system reading and writing in VMs through VirtioFS support.

5 Project Detailed Description (Draft)

This chapter serves as an introduction to the overall structure of the project, outlining the design ideas and principles of critical components. It covers the ovirtiofs architecture, interaction principles, key-value based metadata management design, cache pool design, configuration support, and the expected POSIX interface support.

5.1 The Architecture of ovirtiofs

ovirtiofs is a file system implementation based on the VirtioFS protocol and Apache OpenDAL. It serves as a bridge for semantic access to file system interfaces between VMs and external storage systems. Leveraging the multiple service access capabilities and unified abstraction provided by OpenDAL, ovirtiofs can conveniently mount shared direictories in VMs on various existing distributed storage services.

The complete ovirtiofs architecture consists of three crucial components:

  • VMs FUSE client that supports the VirtioFS protocol and implements the VirtioFS Virtio device specification. An appropriately configured Linux 5.4 or later can be used for ovirtiofs. The VirtioFS protocol is built on FUSE and utilizes the VirtioFS Virtio device to transmit FUSE messages. In contrast to traditional FUSE, where the file system daemon runs in the guest user space, the VirtioFS protocol supports forwarding file system requests from the guest to the host, enabling related processes on the host to function as the guest's local file system.
  • A hypervisor that implements the VirtioFS Virtio device specification, such as QEMU. The hypervisor needs to adhere to the VirtioFS Virtio device specification, supporting devices used during the operation of VMs, managing the filesystem operations of the VMs, and delegating these operations to a specific vhost-user device backend implementation.
  • A vhost-user backend implementations, namely ovirtiofsd. This is a curcial aspect that requires particular attention in this project. This backend is a file system daemon running on the host side, responsible for handling all file system operations from VMs to access the shared directory. virtiofsd offers a practical example of a vhost-user backend implementation, based on pure Rust, forwarding VMs' filesystem requests to the local file system on the host side.

5.2 How ovirtiofsd Interacts With VMs and Hypervisor

The Virtio specification defines device emulation and communication between VMs and the hypervisor. Among these, the virtio queue is a core component of the communication mechanism in the Virtio specification and a key mechanism for achieving efficient communication between VMs and the hypervisor. The virtio queue is essentially a shared memory area called vring between VMs and the hypervisor, through which the guest sends and receives data to the host.

Simultaneously, the Virtio specification provides various forms of Virtio device models and data interaction support. The vhost-user backend implemented by ovirtiofsd achieves information transmission through the vhost-user protocol. The vhost-user protocol enables the sharing of virtio queues through communication over Unix domain sockets. Interaction with VMs and the hypervisor is accomplished by listening on the corresponding sockets provided by the hypervisor.

In terms of specific implementation, the vm-memory crate, virtio-queue crate and vhost-user-backend crate play crucial roles in managing the interaction between ovirtiofsd, VMs, and the hypervisor.

The vm-memory crate provides encapsulation of VMs memory and achieves decoupling of memory usage. Through the vm-memory crate, ovirtiofsd can access relevant memory without knowing the implementation details of the VMs memory. Tow formats of virtio queues are defined in the Virtio specification: split virtio queue and packed virtio queue. The virtio-queue crate provides support for the split virtio queue. Through the DescriptorChain package provided by the virtio-queue crate, ovirtiofsd can parse the corresponding virtio queue structure from the original vring data. The vhost-user-backend crate provides a way to start and stop the file system demon, as well as encapsulation of vring access. ovirtiofsd implements the vhost-user backend service based on the framework provided by the vhost-user-backend crate and implements the event loop for the file system process to handle requests through this crate.

5.3 Key-Value Based Metadata Management Design

ovirtiofsd implements the file system model based on OpenDAL. A file system model providing POSIX semantics needs to offer access to file data and metadata, maintenance of directory trees (hierarchical relationships between files), and additional POSIX interfaces.

Why Need Metadata Management In ovirtiofsd

OpenDAL provides native support for various storage systems, such as object storage, file storage, key-value storage, and more. However, not all storage systems directly offer an abstraction of file systems. Take AWS S3, which provides object storage services, as an example. It abstracts the concept of buckets and objects, allowing users to create multiple buckets and multiple objects within each buckets. This classic two-level relationship in object storage is challenging to directly represent in the nested structure of a file system directory tree.

To enable ovirtiofsd to provide richer support for various types of storage systems, metadata management and data management within the file system daemon are required. Specifically, additional metadata management support is implemented within ovirtiofsd, and data operations are carried out through OpenDAL.

Directory Tree Based On Key-Value Model

File system metadata is system data used to describe the characteristics of a file. It encompasses information such as the file name, file size, file type, access time, and access permissions. This data is typically store in the struct stat structure. The inode ID is a unique identifier generated by the file system, distinguishing a file object within the file system and associated with the metadata of the file or directory.

ovirtiofsd stores the metadata of files and directories in the file system, alone with the file hierarchy organized by directories, through Key-Value encoding. The Key format is "inode_id:sub_entry_name", and the corresponding value is "<sub_entry_inode_id, sub_entry_metadata>".

key value
0:"home" <1, struct stat>
1:"test_dir_1" <2, struct stat>
2:"test_dir_2" <3, struct stat>
2:"test_file_1" <4, struct stat>
2:"test_file_2" <5, struct stat>

The table above provides an encoding example. This is a classic metadata abstraction patten commonly used to expedite path traversal processes and metadata operations. Through range quires and point queries, we can perform metadata operations based on the hierarchical directory tree structure build upon this encoding.

The Key-Value model has the following benefits:

  • Realize the storage and access of metadata through a key-value database, and enjoy the benefits brought by the maturity of key-value databases, including read and write optimization, consistency guarantee, stable work, etc.
  • When using LSM-Tree based key-value databses such as leveldb and rocksdb, this memory sorting data structure can significantly optimize the writing of metadata (since metadata is usually small), and multiple writes can be merged.

5.4 Multi Granular Object Size Cache Pool

Thanks to the design of separate management of metadata and data, and entrusting metadata management to kv databases, ovirtiofsd only needs to focus on data management. In order to improve data read and write performance and avoid the significant overhead caused by repeated transmission of hot data between the storage system and the host, ovirtiofsd needs to build a data cache in the memory on the host side. For metadata, key-value databases have provided a good caching mechanism to ensure high-speed read and write performance, which will not become a bottleneck for ovirtiofsd.

Cache Pool Based On Multi Linked List

ovirtiofsd will create a memory pool to cache file data during the file system read and write process. This huge memory pool is divided into object sizes of different granularities (such as 4 kb, 16 kb, 64 kb, etc.) to adapt to different sizes of data file data blocks.

Unused cache blocks of the same size in the memory pool are organized through a linked list. When a cache block needs to be allocated, the unused cache block can be obtained directly from the head of the linked list. When a cache block that is no longer used needs to be recycled, the cache block is added to the tail of the linked list. By using linked lists, no only can the algorithmic complexity of allocation and recycling be O(1), but furthermore, lock-free concurrency can be achieved by using CAS operations.

Write Back Strategy

ovirtiofsd manages the data reading and writing process through the write back strategy. Specifically, when writing data, the data is first written to the cache, and the dirty data will be gradually synchronized to the backend storage system in an asynchronous manner. When reading the file data, the data will be requested from the backend storage system after a cache miss or expiration , and the new data will be updated to the cache, and its expiration time will be set.

ovirtiofsd will update the dirty data in the cache to the storage system in these cases:

  • When VMs calles fysnc, fdatasync, or used related flags during data writing.
  • The cache pool is full, and dirty data needs to be written to make space in the cache. This is also known as cache eviction, and the eviction order can be maintained using the LRU algorithm.
  • Cleaned by threads that regularly clean dirty data or expired data (experimental).

DAX Window Support (Experimental)

The VirtioFS protocol extends the DAX window experimental features based on the FUSE protocol. This feature allows memory mapping of file contents to be supported in virtualization scenarios. The mapping is set up by issuing a FUSE request to ovirtiofsd, which then communicates with QEMU to establish the VMs memory map. VMs can delete mapping in a similar manner. The size of the DAX window can be configured based on available VM address space and memory mapping requirements.

By using the mmap and memfd mechanisms, ovirtiofsd can use the data in the cache to create an anonymous memory mapping area and share this memory mapping with VMs to implement the DAX Window. The best performance is achieved when the file contents are fully mapped, eliminating the need of for file I/O communication with ovirtiofsd. It is possible to use a small DAX window, but this incurs more memory map setup/removoal overhead.

5.5 Flexible Configuration Support

Running QEMU With ovirtiofs

As described in the architecture, deploying ovirtiofs involves three parts: a guest kernel with VirtioFS support, QEMU with VirtioFS support, and the VirtioFS daemon (ovirtiofsd). Here is an example of running QEMU with ovirtiofsd:

host# ovitiofsd --socket-path=/tmp/vfsd.sock --config-file=./config.toml

host# qemu-system \
				-blockdev file,node-name=hdd,filename=<image file> \
        -device virtio-blk,drive=hdd \
        -chardev socket,id=char0,path=/tmp/vfsd.sock \
        -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=<fs tag> \
        -object memory-backend-memfd,id=mem,size=4G,share=on \
        -numa node,memdev=mem \
        -accel kvm -m 4G

guest# mount -t virtiofs <fs tag> <mount point>

The configurations above will generate two devices for the VMs in QEMU. The block device named hdds serves as the backend for the virtio-blk device within the VMs. It functions to store the VMs' disk image files and acts as the primary device within the VMs. Another character device named char0 is implemented as the backend for the vhost-user-fs-pci device using the VirtioFS protocol in the VMs. This character device is of socke type and is connected to the file system daemon in ovirtiofs using the socket path to forward file system messages and requests to ovirtiofsd.

It is worth noting that the configuration method largely refers to the configuration in virtiofsd, and ignores many VMs configurations related file system access permissions or boundary handling methods.

Enable Different Distributed Storage Systems

In order for ovirtiofs to utilize the extensive service support provided by OpenDAL, the corresponding service configuration file needs to be provided when running ovitiofsd. The parameters in the configuration file are used to support access to the storage system, including data root address and permission authentication. Below is an example of a configuration file, using a toml format similar to oli (a command line tool based on OpenDAL):

[enabled_service]
type = "s3"

[profiles.s3]
type = "s3"
root = "/assets"
bucket = "<bucket>"
region = "<region>"
endpoint = "https://s3.amazonaws.com"
access_key_id = "<access_key_id>"
secret_access_key = "<secret_access_key>"

[profiles.swift]
type = "swift"
endpoint = "https://openstack-controller.example.com:8080/v1/account"
container = "container"
token = "access_token"

[profiles.hdfs]
type = "hdfs"
root = "/tmp"
name_node = "hdfs://127.0.0.1:9000"

5.6 Expected POSIX Interface Support

Finally, the table below lists the expected POSIX system call support to be provided by ovirtiofs, along with the corresponding types of distributed storage systems used by OpenDAL.

System Call Object Storage File Storage Key-Value Storage
getattr Support Support Not Support
mknod/unlink Support Support Not Support
mkdir/rmdir Support Support Not Support
open/release Support Support Not Support
read/write Support Support Not Support
truncate Support Support Not Support
opendir/releasedir Support Support Not Support
readdir Support Support Not Support
rename Support Support Not Support
flush/fsync Support Support Not Support
getxattr/setxattr Not Support Not Support Not Support
chmod/chown Not Support Not Support Not Support

Since the data volume of an individual file may be substantial, contradicting the design of key-value storage, we do not intend to include support for Key-Value Storage in this project. The complex permission system control of Linux is not within the scope of this project. Users can restrict file system access behavior based on the configuration of storage system access permissions in the ovirtiofs configuration file.

6 Proposed Schedule (Draft)

This chapter outlines the planning for the GSoC project. I ensure that there will be ample time to complete the GSoC project during the project development period. I plan to dedicate approximately 30 to 40 hours per week for the development and enhancement of the project. Additionally, I will communicate the progress and any challenges encountered during the project development with the mentor on a weekly basis.

To ensure smooth progress and quality assurance of the project, I have divided the entire project cycle into multiple phases:

Planning Phase

This phase encompasses the first week of the project and the period preceding the official start. During this phase, it is crucial to throughly validate the project' feasibility by supplementing knowledge about Virtio and VirtioFS and crafting demo cases for verification testing. This lays a robust foundation for the subsequent phases of project development.

Development Phase

This phase constitutes the central stage of project development and requires advancing project progress in accordance with project planning and program design. The entire development phase can progress bases on three key goals:

  • Complete the end-to-end process verification of ovirtiofs, based on distributed file systems such as hdfs, and complete the development of the vhost-user-backend file system daemon using OpenDAL to implement data reading and writing.
  • Implement metadata management based on the Key-Value model, and based on this metadata management model, ovirtilfs can use distributed object storage, such as s3, as backend storage to provide shared directory access for VMs.
  • Optimize the ovirtiofs access process, implement asynchronous data reading and writing by adding cache, and complete other functions that need to be implemented.

In addition, during the development process, attention should be paid to supplementing test cases and ensuring the correct advancement of development through sufficient unit testing.

Feedback and Optimization Phase

At this phase, need to review the code with mentor more often, look for potential problems in the project development process, optimize the logical organization and data structure, and ensure project quality by optimizing the code.

Documentation Improvements Phase

At this phase, most of the code work of the project should have been completed. Attention needs to be paid to supplementing the documentation and conducting more detailed testing of the project.

Maintenance Phase

This phase is a long-term phase, not just limited to the last few weeks of the project. At this phase, the project needs to be maintained and flexibly developed and configured according to the actual needs of the OpenDAL community and flexibly develop related functions, build related ecology, and continue to serve the community.

Specific plans and some key time points are listed in the table below:

Week Date Range Tasks
Week 1 05.27~06.02 Planning Phase
Week 2 06.03~06.09 Development Phase 1
Week 3 06.10~06.16 Development Phase 1
Week 4 06.17~06.23 Feedback and Optimization Phase
Week 5 06.24~06.30 Development Phase 2
Week 6 07.01~07.07 Development Phase 2
Week 7 07.08~07.14 submit midterm evaluations
Week 8 07.15~07.21 Development Phase 3
Week 9 07.22~07.28 Development Phase 3
Week 10 07.29~08.04 Feedback and Optimization Phase
Week 11 08.05~08.11 Document Improvements Phase
Week 12 08.12~08.18 Maintenance Phase
Final Week 08.19~08.26 submit final work product

7 Contributions For OpenDAL Now

8 Reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment