.. Bluefog documentation master file, created by sphinx-quickstart on Mon Dec 2 20:20:32 2019. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Bluefog ======= .. image:: _static/bf_logo_h.png :width: 450 :align: center :alt: Bluefog Logo OVERVIEW -------- BlueFog is a high-performance distributed training framework for PyTorch built with **decentralized optimization** algorithms. The goal of Bluefog is to make decentralized algorithms easy to use, fault-tolerant, friendly to heterogeneous environment, and even faster than training frameworks built with parameter server, or ring-allreduce. In each communication stage, neither the typical star-shaped parameter-server toplogy, nor the pipelined ring-allreduce topology is used, which is fundamentally different from other popular distributed training frameworks, such as DistributedDataParallel provided by PyTorch, Horovod, BytePS, etc. Instead, BlueFog will exploit a virtual and probably dynamic network topology (that can be in any shape) to achieve most communication efficiency. .. Main Idea: Replace expensive allreduce averaging over gradients by cheap neighbor averaging over parameters For each training iteration, one process (or agent) will update its model with information received from its **direct** neighbors defined by the virtual topology. It is observed all communications only occur over the predefied virtual topolgy and no global communication is required. This is why the algorithms is named *decentralized*. Decentralized training algorithms are proved in literature that it can converge to the same solution as their standard centralized counterparts. The topology decides the communication efficiency. BlueFog supports both **static** topology and **dynamic** topology usages. After tremendous trials, the dynamic Exponential-2 graph is observed to achieve the best performance if the number of agents is the power of 2, such as 4, 32, 128 agents. In Exponential-2 graph, each agent will communicates with the neighbors that are 2 :sup:`0`, 2 :sup:`1`, ..., 2 :sup:`t` hops away. **Dynamic** toplogy means all agents select one neighbor only in one iteration and select next neighbor in next iteration as illustrated in the following figure: .. raw:: html

one-peer-exp2

In this scenario, the communcation cost for each iteration is only one unit delay, one standard parameter size to transmit and no communication conflict happens, which is better than what parameter server or ring-allreduce promises. As for loss and accuracy guarantees, please check out our theoratical paper. [Will add a full tutorial soon]. .. toctree:: :maxdepth: 1 :caption: INSTALLATION Installing Bluefog .. toctree:: :maxdepth: 2 :caption: API Bluefog Torch API Bluefog Topology API .. toctree:: :maxdepth: 1 :caption: More Information Bluefog Ops Explanation Bluefog Performance Static and Dynamic Topology Neighbor Averaging Lauching Application Through bfrun Bluefog Docker Usage Bluefog Environment Variable Bluefog Timeline Spectrum of Machine Learning Algorithm Codebase Structure Development Guide FAQ .. _openmpi: https://www.open-mpi.org/ .. _examples: https://github.com/Bluefog-Lib/bluefog/tree/master/examples .. _AWS: https://aws.amazon.com/about-aws/whats-new/2018/12/introducing-amazon-ec2-p3dn-instances-our-most-powerful-gpu-instance-yet/