Skip to content

Instantly share code, notes, and snippets.

@yuvalif
Last active April 5, 2025 11:02
Show Gist options
  • Save yuvalif/9c5a1ed326ca14cf4851d7a0b8ba0db8 to your computer and use it in GitHub Desktop.
Save yuvalif/9c5a1ed326ca14cf4851d7a0b8ba0db8 to your computer and use it in GitHub Desktop.

The More the Merrier

Background

Persistent bucket notifications are a very useful and powerful feature. To learn more about it, you can look at this tech talk and usecase example.

Persistent notifications are usually better that synchronous notification, due to several reasons:

  • the queue they are using is, in fact, a RADOS object. This gives the queue the reliability level of RADOS
  • they do not add the delay of sending the notification to the broker to the client request round trip time
  • they allow for temporary disconnects with the broker or broker restarts without affecting the service
  • they have a time and attempts retry mechanism

However, they can pose a performance issue - the notifications regarding a specific bucket are written to a single RADOS queue, and therefore handled by a single OSD.

While the actual objects are written to RADOS object that are sharded across multiple OSD. So, even though the notification objects are relatively small (usually under 1K) they do not enjoy the parallelism of writing the objects.

Which mean that in case that small objects are written to the bucket, the overhead of the notifications is considerable. In this project, our goal would be to create a sharded bucket notifications queue, to allow for better performance of sending persistent bucket notifications.

Evaluation Stage

Step 1 - Build Ceph and Run Basic Test

First would be to have a Linux based development environment, as a minimum you would need a 4 CPU machine, with 8G RAM and 50GB disk. Unless you already have a Linux distro you like, I would recommend choosing from:

  • Fedora (40/41) - my favorite!
  • Ubuntu (22.04 LTS)
  • WSL (Windows Subsystem for Linux), though it would probably take much longer...
  • RHEL9/Centos9
  • Other Linux distros - try at your own risk :-)

Once you have that up and running, you should clone the Ceph repo from github (https://github.com/ceph/ceph). If you don_t know what github and git are, this is the right time to close these gaps :-) And yes, you should have a github account, so you can later share your work on the project.

Install any missing system dependencies use:

./install-deps.sh

Note that, the first build may take long time, so the following cmake parameter could be used to minimize the build time. With a fresh ceph clone use the following:

./do_cmake.sh -DBOOST_J=$(nproc) -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_MGR_DASHBOARD_FRONTEND=OFF \
  -DWITH_DPDK=OFF -DWITH_SPDK=OFF -DWITH_SEASTAR=OFF -DWITH_CEPHFS=OFF -DWITH_RBD=OFF -DWITH_KRBD=OFF -DWITH_CCACHE=OFF

if the build directory already exists, you can rebuild the ninja files by using (from build):

cmake -DBOOST_J=$(nproc) -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_MGR_DASHBOARD_FRONTEND=OFF \
  -DWITH_DPDK=OFF -DWITH_SPDK=OFF -DWITH_SEASTAR=OFF -DWITH_CEPHFS=OFF -DWITH_RBD=OFF -DWITH_KRBD=OFF -DWITH_CCACHE=OFF ..

Then invoke the build process (using ninja) from within the build directory (created by do_cmake.sh). Assuming the build was completed successfully, you can run the unit tests (see: https://github.com/ceph/ceph#running-unit-tests).

Now you are ready to run the ceph processes, as explained here: https://github.com/ceph/ceph#running-a-test-cluster You probably would also like to check the developer guide (https://docs.ceph.com/docs/master/dev/developer_guide/) and learn more on how to build Ceph and run it locally (https://docs.ceph.com/docs/master/dev/quick_guide/). Ceph's bucket notification documentation:

Run bucket notification tests for persistent notifications using an HTTP endpoint:

  • start the vtsart cluster:
$ MON=1 OSD=1 MDS=0 MGR=0 RGW=1 ../src/vstart.sh -n -d
  • on a separate terminal start an HTTP endpoint:
$ wget https://gist.githubusercontent.com/mdonkers/63e115cc0c79b4f6b8b3a6b797e485c7/raw/a6a1d090ac8549dac8f2bd607bd64925de997d40/server.py
$ python server.py 10900
  • install the awc cli tool
  • configure the tool according to the access and secret keys showing in the output of the vstart.sh command
  • set the region to default
  • create a persistent topic pointing to the above HTTP endpoint:
$ aws --endpoint-url http://localhost:8000 sns create-topic --name=fishtopic \
  --attributes='{"push-endpoint": "http://localhost:10900", "persistent": "true"}'
  • create a bucket:
$ aws --endpoint-url http://localhost:8000 s3 mb s3://fish
  • create a notification on that bucket, pointing to the above topic:
$ aws --endpoint-url http://localhost:8000 s3api put-bucket-notification-configuration  --bucket fish \
  --notification-configuration='{"TopicConfigurations": [{"Id": "notif1", "TopicArn": "arn:aws:sns:default::fishtopic", "Events": []}]}'

Leaving the event list empty is equivalent to setting it to ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"]

  • create a file, and upload it:
$ head -c 512 </dev/urandom > myfile
$ aws --endpoint-url http://localhost:8000 s3 cp myfile s3://fish
  • on the HTTP terminal, see the JSON output of the notifications

Step 2

Try to address one of these (relatively) small features:

Please provide a draft PR with your code (does not have to be a complete implementation of the feature).

Project Goals

  • sharded implementation of persistent topic queue
  • stretch goal: performance test proving performance improvement

Design Considerations (for the proposal)

  • shards creation should happen when the topic and queue are created
  • shard ownership should be implemented similarly to how queue ownership is implemented
    • should a single RGW own all shards of a queue, or show we allow split ownership?
  • we should find the right shards when making the reservation
    • we should hash an identifier from the notification into a number and calculate modulo of that number by the number of shards
    • hashing should distribute uniformly regardless of the values of the identifier
    • we should decide which field(s) to use for the hash. At a minimum we should avoid reordering of notifications of a single object
  • the number of shards should be a config option
    • what should we do when this number is changed
    • do we want to allow changes to existing queues or only new ones?
    • should we act differently when the number increase/decrease?
  • should we handle migration of existing queues? Or apply sharding only to new queues?
@9401adarsh
Copy link

9401adarsh commented Feb 23, 2025

Hello there, I have made an attempt at addressing issue #68788.
This is the link to the draft PR.

@yuvalif
Copy link
Author

yuvalif commented Feb 27, 2025

another simple trackers for the evaluation phase: https://tracker.ceph.com/issues/55790

@9401adarsh
Copy link

9401adarsh commented Feb 28, 2025

Want to add that until this is resolved - https://tracker.ceph.com/issues/70040 - use the following workaround for testing bucket notifications.

aws --endpoint-url http://localhost:8000 s3 cp myfile s3://fish will throw errors as RGW support for the default --checksum-algorithm CRC64NVME is currently buggy as reported in the linked issue. AWS Reference

use aws --endpoint-url http://localhost:8000 s3 cp myfile s3://fish --checksum-algorithm <checksum-algorithm> instead.
supported options: CRC32, SHA1, SHA256 and CRC32C

@AydanPirani
Copy link

Hi @yuvalif ! I'm interested in working on this (and continuing over the summer as part of GSOC), is the overall process still the same?

@yuvalif
Copy link
Author

yuvalif commented Mar 25, 2025

Hi @yuvalif ! I'm interested in working on this (and continuing over the summer as part of GSOC), is the overall process still the same?

not sure what you mean by "the same". the official process, timeline etc. is as described in the GSoC site.
the "evaluation stage", which should be done before you submit the proposal, is described above in the gist.

@9401adarsh
Copy link

Hi @yuvalif , is there any high level overview doc for 2pc_cls_queue ? I wanted to understand how does the dequeue work in the context of bucket notifications.

Does it have a similar mechanism to the 2 phase commit -> where we mark something for delete and only delete it after we reliably know an operation associated with the entry that we want to delete is performed ?

@yuvalif
Copy link
Author

yuvalif commented Mar 31, 2025

@9401adarsh there is no explicit "dequeue" operation.
the process you described above is implemented usign "read" and "delete".
the bucket notification code is reading a bulk of notifications, sending them to the endpoint, and deleting them once we get an ack.

read entries: https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_client.h#L74
delete entries: https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_client.h#L93

since there is a single owner for the dequeue process, this mechanism works well.
if we would have liked to allow multiple dequeuers (similar to the way we have multiple enqueuers) we would need a different mechanism.

@kylehaokunwu
Copy link

Hi @yuvalif , I just submitted a draft PR on issue 68788. Here is the link to my PR.

@9401adarsh
Copy link

9401adarsh commented Apr 5, 2025

Hi @yuvalif, I am trying to identify the impact areas for my design.

Where do I register a new cluster level configuration parameters that I would want the RGW to uptake ? By this I mean, following the registration of this new parameter I should be able to add an entry to the ceph.conf file and should be able to leverage that parameter effectively.

@yuvalif
Copy link
Author

yuvalif commented Apr 5, 2025

@9401adarsh RGW options are registered here: https://github.com/ceph/ceph/blob/main/src/common/options/rgw.yaml.in
later on, from the code you access these options (assuming you have a pointer to CephContext calld cct):

cct->_conf->rgw_...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment