Bluefog

Bluefog Logo

OVERVIEW

BlueFog is a high-performance distributed training framework for PyTorch built with decentralized optimization algorithms. The goal of Bluefog is to make decentralized algorithms easy to use, fault-tolerant, friendly to heterogeneous environment, and even faster than training frameworks built with parameter server, or ring-allreduce.

In each communication stage, neither the typical star-shaped parameter-server toplogy, nor the pipelined ring-allreduce topology is used, which is fundamentally different from other popular distributed training frameworks, such as DistributedDataParallel provided by PyTorch, Horovod, BytePS, etc. Instead, BlueFog will exploit a virtual and probably dynamic network topology (that can be in any shape) to achieve most communication efficiency.

Main Idea: Replace expensive allreduce averaging over gradients by cheap neighbor averaging over parameters

For each training iteration, one process (or agent) will update its model with information received from its direct neighbors defined by the virtual topology. It is observed all communications only occur over the predefied virtual topolgy and no global communication is required. This is why the algorithms is named decentralized. Decentralized training algorithms are proved in literature that it can converge to the same solution as their standard centralized counterparts.

The topology decides the communication efficiency. BlueFog supports both static topology and dynamic topology usages. After tremendous trials, the dynamic Exponential-2 graph is observed to achieve the best performance if the number of agents is the power of 2, such as 4, 32, 128 agents. In Exponential-2 graph, each agent will communicates with the neighbors that are 2 0, 2 1, …, 2 t hops away. Dynamic toplogy means all agents select one neighbor only in one iteration and select next neighbor in next iteration as illustrated in the following figure:

one-peer-exp2

In this scenario, the communcation cost for each iteration is only one unit delay, one standard parameter size to transmit and no communication conflict happens, which is better than what parameter server or ring-allreduce promises. As for loss and accuracy guarantees, please check out our theoratical paper. [Will add a full tutorial soon].

INSTALLATION