ps-lite is a communication framework for parameter servers.
3 main entities - Scheduler( always 1 in number), Server, Workers.
Scheduler - Node which manages the cluster. Maintains a list of nodes and their addresses in the cluster. Scheduler handshakes with all the nodes in the cluster. Assigns rank to every node in the cluster. Sends control messages to other nodes, monitors if nodes are alive or dead. Notifies the cluster on when to begin work.
Server - Stores parameters as <key, value> pairs. Each server holds a contiguous range of keys. The keys are distributed among servers to prevent server load imbalance.
Worker -
Performs gradient calculations. Each worker is responsible for certain part of the data partition. Workers communicate with the servers using the push
and pull
routines. pull
is to fetch the latest parameter values from the server. push
is to send the updates parameter values back to the server. Each worker deals with only a subset of all the parameters but its communication will involve all the servers.
A singleton class. Every node contains only 1 instance of this object. Used for configuring info about the current node, such as -
- node type - worker, server, scheduler
- node id or rank of a node
New DT tasks
- Adding new worker nodes, synchronize worker nodes with the worker group.
- Updating environment variables.
It is a member of the postoffice singleton. It has a mapping of node_ids to addresses. It manages the communication in the cluster. It starts the 'receiving thread' on each node that receives messages from other nodes. The Van class just provides the interface for the communication between nodes, the real communication details are handled by ZeroMQ library.
It also handles functionalities like removing node_ids from server and worker lists, during dynamic training.
Note - the logic for adding and removing of a node is slightly mixed up between the Postoffice
and Van
class. Can be refactored
This is used for communicating with other nodes. Each connection with another node corresponds to an instance of the Customer object. This also has a receiving thread. Once the 'Van' of a node receives a message on its receiving thread it passes that message to the corresponding Customer's receiving thread.
The Customer object is responsible for keeping track of the status of the different send and receive messages.
- Invokes membership change in the cluster, the function basically just calls
Postoffice
'supdateEnvVariable
function. Can move this toPostoffice
- Has a function called
OnSuccessUpdatingEnv
which basically sets up the new host that been added to the cluster by callinglaunchCommandOnNewWorker
which executes a bunch of bash commands. (Note - There is a problem here, if there is a node replacement instead of just addition or removal then the system takes 2 epochs to completely update the cluster. Meaning it will remove the old node in one epoch and add the new node into the cluster in the next epoch. There is no reason why the new replacement node should sit idle for one whole epoch)
- Node - stores role, ip address, port number
- Control - command_type(add_node, membership_change etc..), destination node, barrier_group_id(1,2,4)
- Message - message to be sent
- Meta - metadata about the message like sender, recipient, request, timestamp