.. _Ops Explanation: Bluefog Operations Explanation ============================== The implementation of bluefog operations is built upon the MPI APIs. The naming of communication operations are all derived from MPI. However, our usage and definition are slightly different from the MPI since the focus of bluefog ops are highly associated with the virtual topology of network. The communication ops that bluefog supported can be catogorized into three types: 1. Collective Ops: ``broadcast``, ``allreduce``, ``allgather``. 2. Neighbor Collective Ops: ``neighbor_allreduce``, ``neighbor_allgather``. 3. Hierarchical Collective Ops: ``hierarchical_local_allreduce``, ``hierarchical_neighbor_allreduce``. 4. One-sided Communication Ops: ``win_create``, ``win_free``, ``win_put``, ``win_get``, ``win_accumulate``, ``win_update``, ``win_update_then_collect``. We use figure to illustrate all those ops with similar style as in `MPI tutorials blog`_. In the figure, we use *circle* to represent one process, which is exchangeablely called node, agent, host, etc. under some circumstance, and use green and red *square* to represent the data or tensor while orange *square* for one machine. The number inside of circle is the rank of that process and th number inside of square is the value of data. If you need more background information about MPI, we recommend this nice `tutorial`_. Collective Ops -------------- These three ``broadcast``, ``allreduce``, ``allgather`` ops are most basic collective MPI ops. The bluefog implementation is almost exactly the same as the MPI definition. One small difference is allreduce only support average and summation since bluefog focused on the numerical calculation only. allgather ######### .. image:: _static/bf_allgather.png :alt: BluefogAllgatherExplanation :width: 350 allreduce ######### .. image:: _static/bf_allreduce.png :alt: BluefogAllreduceExplanation :width: 300 broadcast ######### .. image:: _static/bf_broadcast.png :alt: BluefogBroadcastExplanation :width: 450 Neighbor Colletive Ops ---------------------- Similar to their collective ops cousins, the behavior of neighbor collective ops is very similar, except that their behavior is determined by the virtual topology as well. Loosenly speaking, allreduce and allgather are the same as running the neighbor_allreduce and neighbor_allgather over fully connected network. In the figure, we use the arrowed line to represent the connection of virtual topology (notice it is the directed graph.) neighbor_allgather ################## .. image:: _static/bf_neighbor_allgather.png :alt: BluefogNeighborAllgatherExplanation :width: 600 neighbor_allreduce ################## .. image:: _static/bf_neighbor_allreduce.png :alt: BluefogNeighborAllreduceExplanation :width: 600 .. Note:: In the figure, we only show the neighbor_allreduce with average with uniform weight. Actually, our API allows for any weights for incoming edges. Check out API doc to see how to use it. Hierarchical Collective Ops --------------------------- In practice, the communication speed and behavior is different between intra-machine and inter-machine communcation. Hence, we also provided two hierarchical collective ops. The basic unit in this case is each (physical) machine. Hence, unlike previous neighbor collective ops, of which the topology is defined over the connection between ranks/processes, the topology of hierarchical collective ops is defined over the connection between machines. hierarchical_local_allreduce ############################ .. image:: _static/bf_hier_local_allreduce.png :alt: BluefogHierarchicalLocalAllreduceExplanation :width: 700 Because it is (machine) local operation, the topology definition will not impact this operation. Hence, the rank 0 and rank 1 simply applied the local allreduce average to get (8+4)/2 = 6. Other ranks is like-wise. hierarchical_neighbor_allreduce ############################### .. image:: _static/bf_hier_neighbor_allreduce.png :alt: BluefogHierarchicalNeighborAllreduceExplanation :width: 700 Similar to the *hierarchical_local_allreduce* operation, it first applied the local allreduce average within the machine. So that in the view of external machines, all processes within same machine forms a super node. Then, the super node exchange the information with their neighbor machines like *neighbor_allreduce*. For example, machine 0, 2, and 3 first formed a local average value 6, 3, and 3 respectively. Then, a machine-wise neighbor allreduce produce (6+3+3)/3 = 4. In order to minimize the cross machines communcation, the real implementation is four steps actually: 1. Local Average. 2. All local rank 0 processes do the *neihbor_allreduce*. 3. local rank 0 processes broadcast the received tensors to other local ranks. 4. Compute the average of received neighbor tensors within the process. .. warning:: hierarchical_neighbor_allreduce should be used under the homogeneous environment only, i.e., each machine owns same number of the local processes. One-sided Communication Ops --------------------------- One-sided communication ops is introduced after MPI-2. The most notable feature of one-sided communication is indicated by the name that allows the communication ops of one process to be decoupled from the behavior of another process. Bluefog heavily relies on this feature to build the asynchronous algorithm. Except the win_create and win_free are the collective ops, all rest ops only need to be called by one process. `Here`_ is a nice introduction for the MPI one-sided communication ops. As mentioned before, please note the usage and definition of Bluefog is slightly different from MPI standard. win_create ########## Win create is always the first step to use the one-sided communication. After this call, each process will allocate the number of incoming neighbor's windows as buffer, which is illustrated in the figure as red square. Each buffer is dedicated to one neighbor. You don't need to know which one is dedicated to which neighbor because these buffers are invisible to the python frontend. The only way to interact with them is through the win_update. .. image:: _static/bf_win_create.png :alt: BluefogWinCreateExplanation :width: 650 win_free ######## .. image:: _static/bf_win_free.png :alt: BluefogWinFreeExplanation :width: 650 .. Note:: In the following figures, we only show the behavior of win_put/get/accumulate/sync to all neighbors with no weights. Actually, you are allowed to customize which neighbor to send/receive and assign any weight on tensor. Please check our API doc to see how to use it. win_put ####### Win_put is one of three main methods to exchange the information between the processes in window. By default, it will *put* its own tensor value into all *outgoing* neighbor's buffer. Note it doesn't need the receiver to do anything. .. image:: _static/bf_win_put.png :alt: BluefogWinPutExplanation :width: 650 win_get ####### Win_get is one of three main methods to exchange the information between the processes in window. By default, it will *get* (fetch) the *incoming* neighbor's local value into the its own buffer. Note it doesn't need the sender to do anything. .. image:: _static/bf_win_get.png :alt: BluefogWinGetExplanation :width: 650 win_accumulate ############## Win_accumulate is one of three main methods to exchange the information between the processes in window. By default, it will *accumulate* its own tensor value into all *outgoing* neighbor's buffer, i.e. sum up. Note it doesn't need the receiver to do anything. .. image:: _static/bf_win_accum.png :alt: BluefogWinAccumExplanation :width: 650 win_update ########## win_update is the bridge to connect the value of buffers (corresponding to the neighbor value) with the local value. It has two functionalities. One is to update the buffer to make sure that the neighbor value, which may be changed through win_put, win_get, and/or win_accumulate, is synchronized and visible to local memory. Another is it updates the local value to the average of self and neighbor's value. .. image:: _static/bf_win_update.png :alt: BluefogWinSyncExplanation :width: 650 win_update_then_collect ####################### .. image:: _static/bf_win_update_collect.png :alt: BluefogWinSyncThenCollectExplanation :width: 675 .. _MPI tutorials blog: https://mpitutorial.com/tutorials/ .. _tutorial: https://computing.llnl.gov/tutorials/mpi/ .. _Here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/mpi-onesided.html