Skip to content

Instantly share code, notes, and snippets.

@AhmedSa-mir
Created August 9, 2018 15:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AhmedSa-mir/f1fdcc8612df83e843d9a69fc2fb52bd to your computer and use it in GitHub Desktop.
Save AhmedSa-mir/f1fdcc8612df83e843d9a69fc2fb52bd to your computer and use it in GitHub Desktop.
This file contains description of the modifications I've added in my FFLIB fork during my contribution with Ste||ar Group in GSoC 2018

The modifications I’ve added in my FFLIB fork:

  • A libfabric component. The main component files are:

    • The connect_libfabric file: It contains all the implementations of the libfabric calls that are called by the libfabric component. It’s the biggest file because it has all the libfabric logic.
    • The fflibfabric file: It contains the initialization and the finalization APIs.
    • The recv and send files: They contain the posting of the send and receive operations through libfabric.
    • The libfabric progresser file: It contains the methods used by the progresser to track the operation completion. The ffop_libfabric_progresser_progress method is where the receive and send completions queues are polled.
  • A libfabric binding file: Binds the fflib APIs with the libfabric APIs to run FFLIB with the libfabric backend.

  • 3 test files running on 2 nodes:

    • send_recv_libfabric test: One sends some arbitrary data and the other node receives it.
    • pingpong_sched_libfabric test: The two nodes posts a FFLIB schedule. Each schedule contains a send and a receive. So each node send data and receives data like pingpong.
    • allreduce_libfabric test

Simple test demonstration

The test send_recv_libfabric goes like this:

  • Initializing FFLIB and binding the FFLIB APIs with libfabric backend to be used.
  • Creating a progresser thread that tracks the completions of the operations.
  • Initializing the libfabric connection between the 2 nodes. Every node allocates the needed resources i.e, fabric, domain, memory region, endpoints, etc… and then establishes a connection with the other node through sockets.
  • The sender puts the data into a buffer from the memory region that it has allocated with the endpoint. Then the send operation is posted in libfabric and scheduled in FFLIB.
  • The receiver gets a buffer from the memory region allocated with the endpoint and posts a receive operation to reserve the buffer for receiving incoming data. The receive operation is also scheduled in FFLIB.
  • Then each node polls for the completion of the operation. The sender polls the send completion queue while the receiver polls for the receive completion queue.
  • When the send operation is completed, the sender’s role is now finished. On the other hand, the receiver waits for the receive completion. Then after it receives the message, the message is checked to see if the message content is what is sent by the sender or not.
  • Finally the 2 nodes free the resources that they have allocated and exit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment