Skip to content

Instantly share code, notes, and snippets.

@anirudhacharya
Last active July 8, 2019 18:59
Show Gist options
  • Save anirudhacharya/fcb05ccd63637af6151d86fb776482da to your computer and use it in GitHub Desktop.
Save anirudhacharya/fcb05ccd63637af6151d86fb776482da to your computer and use it in GitHub Desktop.
ps-lite overview

ps-lite is a communication framework for parameter servers.

Conceptual overview

3 main entities - Scheduler( always 1 in number), Server, Workers.

Scheduler - Node which manages the cluster. Maintains a list of nodes and their addresses in the cluster. Scheduler handshakes with all the nodes in the cluster. Assigns rank to every node in the cluster. Sends control messages to other nodes, monitors if nodes are alive or dead. Notifies the cluster on when to begin work.

Server - Stores parameters as <key, value> pairs. Each server holds a contiguous range of keys. The keys are distributed among servers to prevent server load imbalance.

Worker - Performs gradient calculations. Each worker is responsible for certain part of the data partition. Workers communicate with the servers using the push and pull routines. pull is to fetch the latest parameter values from the server. push is to send the updates parameter values back to the server. Each worker deals with only a subset of all the parameters but its communication will involve all the servers.

Code overview

PostOffice

A singleton class. Every node contains only 1 instance of this object. Used for configuring info about the current node, such as -

  • node type - worker, server, scheduler
  • node id or rank of a node

New DT tasks

  • Adding new worker nodes, synchronize worker nodes with the worker group.
  • Updating environment variables.

Van

It is a member of the postoffice singleton. It has a mapping of node_ids to addresses. It manages the communication in the cluster. It starts the 'receiving thread' on each node that receives messages from other nodes. The Van class just provides the interface for the communication between nodes, the real communication details are handled by ZeroMQ library.

It also handles functionalities like removing node_ids from server and worker lists, during dynamic training.

Note - the logic for adding and removing of a node is slightly mixed up between the Postoffice and Van class. Can be refactored

Customer

This is used for communicating with other nodes. Each connection with another node corresponds to an instance of the Customer object. This also has a receiving thread. Once the 'Van' of a node receives a message on its receiving thread it passes that message to the corresponding Customer's receiving thread.

The Customer object is responsible for keeping track of the status of the different send and receive messages.

Elastic_Training

  • Invokes membership change in the cluster, the function basically just calls Postoffice's updateEnvVariable function. Can move this to Postoffice
  • Has a function called OnSuccessUpdatingEnv which basically sets up the new host that been added to the cluster by calling launchCommandOnNewWorker which executes a bunch of bash commands. (Note - There is a problem here, if there is a node replacement instead of just addition or removal then the system takes 2 epochs to completely update the cluster. Meaning it will remove the old node in one epoch and add the new node into the cluster in the next epoch. There is no reason why the new replacement node should sit idle for one whole epoch)

Some data structures -

  1. Node - stores role, ip address, port number
  2. Control - command_type(add_node, membership_change etc..), destination node, barrier_group_id(1,2,4)
  3. Message - message to be sent
  4. Meta - metadata about the message like sender, recipient, request, timestamp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment