Global communication (reductions)
DIY supports general reductions to implement complex global operations over blocks. These reductions are
general-purpose (any global communication pattern can be implemented); they operate over blocks, which cycle in and
out-of-core as necessary; the operations are (multi-)threaded automatically using the same mechanism as foreach()
.
All the reductions included in DIY
operate in rounds over a k-ary reduction tree. The value of k used in each round can vary, but if it's fixed, the number
of rounds is log_k(nblocks). In each round of the reduction tree, the distance between blocks (in terms of their block
global ID or gid
) either increases (so-called "distance-doubling") or decreases (so-called "distance-halving"). This
is controlled by the contiguous
argument, which defaults to true
(distance-doubling).
Only blocks decomposed by a diy::RegularDecomposer
(i.e., blocks arranged in a regular block grid) can be reduced this
way. The decomposer object is one of the arguments to the reduction.
The following patterns are currently available. With the exception of all_to_all
, the other patterns all feature a
call to diy::reduce
, passing a derived class of diy::RegularPartners
to differentiate between the different
patterns. The callback function passed to diy::reduce
is similar to the callback function used with
diy::Master:foreach
described in the Callbacks page. The callback function is called by DIY in each
round of the reduction tree. Each time, the communication proxy (ReduceProxy
) acts similarly as in foreach
: the
proxy contains the neighboring blocks that form a communicating group for the current round. Unlike foreach
, this
group changes from one round to the next in a reduction. In addition to a pointer to the block and the communication
proxy, the callback is given the partners class, which can provide additional information. The user needs to write the
body of the callback function, which typically dequeues received messages from the previous round, performs some local
computation, and enqeues messages for the next round.
Merge-reduce
A merge-reduce accumulates information in successively fewer blocks with each round. Each round, one block in a
communicating group (ReduceProxy
) takes ownership of the merged result and continues to the next round, while the other
blocks in the group idle in the following rounds. In the final round, one block, the root of the reduction tree,
has all of the data.
void foo(Block* b, // local block
const diy::ReduceProxy& rp, // communication proxy
const diy::RegularMergePartners&) // partners of the current block
{
// dequeue and merge
for (int i = 0; i < rp.in_link().size(); ++i)
{
int nbr_gid = rp.in_link().target(i).gid;
std::vector<int> in_vals;
rp.dequeue(nbr_gid, in_vals);
// merge with current block data, e.g., accumulate sum
for (size_t j = 0; j < in_vals.size(); ++j)
(b->data)[j] += in_vals[j];
}
// step 2: enqueue
for (int i = 0; i < rp.out_link().size(); ++i) // redundant since out_link size = 1 for merge-reduce
rp.enqueue(rp.out_link().target(i), b->data);
}
int main(int argc, char* argv[])
{
...
int k = 2; // the radix of the k-ary reduction tree
// partners for merge over regular block grid
diy::RegularMergePartners partners(decomposer, // diy::Decomposer object
k, // radix of k-ary reduction (default 2)
contiguous); // contiguous = true: distance doubling (default true)
// contiguous = false: distance halving
// reduction
diy::reduce(master, // diy::Master object
assigner, // diy::Assigner object
partners, // diy::RegularMergePartners object
&foo); // merge operator callback function
...
}
Broadcast
A broadcast is implemented as a merge-reduce in reverse order. Starting with one block, each round features more blocks participating in the communication. The usage is similar to merge-reduce, with the partners object now being `diy::RegularBroadcastPartners'.
void foo(Block* b, // local block
const diy::ReduceProxy& rp, // communication proxy
const diy::RegularBroadcastPartners&); // partners of the current block
int main(int argc, char* argv[])
{
...
// partners for broadcast over regular block grid
diy::RegularBroadcastPartners partners(decomposer, // diy::Decomposer object
k, // radix of k-ary reduction (default 2)
contiguous); // contiguous = true: distance doubling (default true)
// contiguous = false: distance halving
// reduction
diy::reduce(master, // diy::Master object
assigner, // diy::Assigner object
partners, // diy::RegularBroadcastPartners object
&foo); // broadcast operator callback function
...
}
Swap-reduce
Unlike a merge-reduce, the idea of a swap-reduce is to keep all blocks busy in all rounds. Typically data are
split and swapped among the neighboring blocks in each round, but the actual operations performed
are up to the user to write in the callback function. DIY's role in the swap-reduce is to provide different neighbors in
each round to the callback function, and to keep all blocks participating in all rounds. The usage is similar to
the the prior examples, with diy::RegularSwapPartners
being used for the partners object.
void foo(Block* b, // local block
const diy::ReduceProxy& rp, // communication proxy
const diy::RegularSwapPartners&); // partners of the current block
int main(int argc, char* argv[])
{
...
// partners for swap over regular block grid
diy::RegularSwapPartners partners(decomposer, // diy::Decomposer object
k, // radix of k-ary reduction (default 2)
contiguous); // contiguous = true: distance doubling (default true)
// contiguous = false: distance halving
// reduction
diy::reduce(master, // diy::Master object
assigner, // diy::Assigner object
partners, // diy::RegularSwapPartners object
&foo); // swap operator callback function
...
}
All-reduce
An all-reduce is a merge-reduce followed by a broadcast of the result from the root block to all of the
blocks, so that all blocks have the same result. The same usage follows from the previous examples, only changing the
partners object. Usage is similar to the previous examples, with the partners class now being
diy::RegularAllReducePartners
.
void foo(Block* b, // local block
const diy::ReduceProxy& rp, // communication proxy
const diy::RegularAllReducePartners&); // partners of the current block
int main(int argc, char* argv[])
{
...
// partners for all-reduce over regular block grid
diy::RegularAllReducePartners partners(decomposer, // diy::Decomposer object
k, // radix of k-ary reduction (default 2)
contiguous); // contiguous = true: distance doubling (default true)
// contiguous = false: distance halving
// reduction
diy::reduce(master, // diy::Master object
assigner, // diy::Assigner object
partners, // diy::RegularAllReducePartners object
&foo); // all-reduce operator callback function
...
}
All-to-all
The previous examples followed a common pattern. All-to-all is the exception. In an all-to-all reduction, each
block sends and receives from every other block. Even though all-to-all is implemented with a
reduction tree, it is called as if there were a single round of direct communication between all blocks.
This means there is no partners object, and the callback function is written as if there were a
single round with access to all blocks through the ReduceProxy
. However, the radix k
of the k-ary underlying
implementation is still supplied (defaults to 2).
void foo(Block* b, // local block
const diy::ReduceProxy& rp) // communication proxy
{
// outgoing
if (rp.in_link().size() == 0)
{
// enqueue data to the correct blocks
int dest_gid = ... // destination block global ID
diy::BlockID dest = rp.out_link().target(dest_gid); // need a full BlockID; out_link targets are ordered as gids
rp.enqueue(dest, value);
}
// incoming
else
{
for (int i = 0; i < rp.in_link().size(); ++i)
{
int gid = rp.in_link().target(i).gid;
diy::MemoryBuffer& incoming = rp.incoming(gid);
// copy incoming to the block
...
}
}
}
int main(int argc, char* argv[])
{
...
diy::all_to_all(master, // diy::Master object
assigner, // diy::Assigner object
&foo, // all-to-all operator callback function
k); // radix of k-ary underlying implementation (default 2)
...
}