Monday, November 14, 2016

Twisted Torus Topologies for Enhanced Interconnection Networks

Article (PDF Available)inIEEE Transactions on Parallel and Distributed Systems 21(12):1765-1778 · December 2010with64 Reads
DOI: 10.1109/TPDS.2010.30 · Source: DBLP

Many current parallel computers are built around a torus interconnection network. Machines from Cray, HP, and IBM, among others, make use of this topology. In terms of topological advantages, square (2D) or cubic (3D) tori would be the topologies of choice. However, for different practical reasons, 2D and 3D tori with different number of nodes per dimension have been used. These mixed-radix topologies are not edge symmetric, which translates into poor performance due to an unbalanced use of network resources. In this work, we analyze twisted 2D and 3D mixed-radix tori that remove the network bottlenecks present in nontwisted ones. Such topologies recover edge symmetry, and consequently, balance the utilization of their links. The distance-related properties of twisted tori together with a full characterization of their bisection bandwidth are described in this paper. A simulation-based performance evaluation has been carried out to assess the network performance under synthetic and trace-driven workloads. The obtained results show noticeable and consistent performance gains (up to an increase of 74 percent in accepted load). In addition, we propose scalable and practicable packet routing mechanisms and wiring layouts for these interconnection systems. The complexity of the architectural proposals is similar to the one exhibited by routing and folding mechanisms in standard tori.



Full-text (PDF)

Available from: Enrique Vallejo
Twisted Torus Topologies for
Enhanced Interconnection Networks
´M. Ca
´mara, Miquel Moreto
´, Enrique Vallejo, Ramo
´n Beivide, Member,IEEE,
´Miguel-Alonso, Member,IEEE Computer Society, Carmen Martı
´nez, and Javier Navaridas
Abstract—Many current parallel computers are built around a torus interconnection network. Machines from Cray, HP, and IBM,
among others, make use of this topology. In terms of topological advantages, square (2D) or cubic (3D) tori would be the topologies of
choice. However, for different practical reasons, 2D and 3D tori with different number of nodes per dimension have been used. These
mixed-radix topologies are not edge symmetric, which translates into poor performance due to an unbalanced use of network
resources. In this work, we analyze twisted 2D and 3D mixed-radix tori that remove the network bottlenecks present in nontwisted
ones. Such topologies recover edge symmetry, and consequently, balance the utilization of their links. The distance-related properties
of twisted tori together with a full characterization of their bisection bandwidth are described in this paper. A simulation-based
performance evaluation has been carried out to assess the network performance under synthetic and trace-driven workloads. The
obtained results show noticeable and consistent performance gains (up to an increase of 74 percent in accepted load). In addition, we
propose scalable and practicable packet routing mechanisms and wiring layouts for these interconnection systems. The complexity of
the architectural proposals is similar to the one exhibited by routing and folding mechanisms in standard tori.
Index Terms—Multiprocessor interconnection, parallel architectures, routing, supercomputers.
MANY parallel computers using direct interconnection
networks have been designed and commercialized in
last decades. Meshes, tori, and hypercubes have been the most
popular topologies. Nowadays, hypercubes have declined in
favor of lower degree networks such as two-dimensional and
three-dimensional tori and meshes due to its implicit
difficulty to scale. Different machines, such as the Alpha
21364-based HP GS1280 [10] and the Cray X1E vector
computer [9], use 2D tori. Others, such as the Cray T3D and
T3E [23], which preceded the Cray XT3 [7] and XT4 [8], use 3D
tori. The IBM BlueGene family of massively parallel compu-
ters is a remarkable example of mixed radix 3D tori as the
largest configurations of the /L and /P models use 64 32
32 and 72 32 32 nodes, respectively, [1], [13].
Usually, 2D tori arrange their N nodes in a square Mesh
with ffiffiffiffi
pnodes per side. Sides are connected pairwise by
means of 2ffiffiffiffi
pwraparound links. Above, more or less, a
thousand nodes parallel computers should use 3D topologies
as suggested in [2], being a cubic 3D torus of side 3ffiffiffiffi
most desirable solution. However, the number of nodes per
dimension might be different, leading to rectangular and
prismatic topologies for two and three dimensions, respec-
tively. These topologies, denoted by mixed-radix networks
in [11], are often built for practical reasons such as packaging,
modularity, cost, and scalability. For instance, the HP GS1280
employs a 2D rectangular torus network [10] and the IBM
BlueGene a 3D prismatic one [1]. Mixed-radix tori have two
important drawbacks: first, they are not edge symmetric, and
second, the distance-related network parameters (diameter
and average distance) are far from the optimum values of
square and cubic topologies. The edge asymmetry leads to a
lack of balance in utilization of resources, and for many
traffic patterns, the load on the longer dimensions is higher
than the load on the shorter ones [5]. Hence, links in longer
dimensions become network bottlenecks. In addition, max-
imum and average packet delays are relatively long as they
depend on the poor values of diameter and average distance
exhibited by these networks.
In order to avoid or, at least, mitigate these problems, we
analyze in this work alternative mixed-radix 2D and 3D
torus topologies that twist their wraparound links of one or
two network dimensions. In this way, network symmetry is
partially or totally recovered. As we will see, the perfor-
mance measured on these twisted tori is significantly higher
than the one exhibited by their standard mixed-radix
counterparts. We concentrate in this paper on aspect ratios
2:1 and 2:1:1 although our results could be extended to
other cases. The selected ratios have been previously used
by manufacturers [6], [10], allowing the number of nodes to
be a power of 2, which is sometimes a desirable property.
Even more, these ratios represent the simplest way to
upgrade a square or cubic network, doubling their number
of nodes by only rearranging some wraparound links.
.J.M. Ca
´mara is with the Department of Electromechanical Engineering,
University of Burgos, Avda. Cantabria s/n, 09006, Burgos, Spain.
.M. Moreto
´is with the Department of Computer Architecture, Universitat
`cnica de Catalunya, Jordi Girona, 1-3, Office D6-113, Campus Nord,
08034, Barcelona, Spain. E-mail:
.E. Vallejo, R. Beivide, and C. Martı
nez are with the Departamento de
´nica y Computadores, Universidad de Cantabria, Avenida de los
Castros s/n, 39005 Santander, Spain. E-mail:,
{ramon.beivide, carmen.martinez}
.J. Miguel-Alonso and J. Navaridas are with the Department of Computer
Architecture and Technology, University of the Basque Country, P.
Manuel de Lardizabal 1, 20018 Donostia-San Sebastian, Spain.
E-mail: {j.miguel, javier.navaridas}
Manuscript received 14 July 2008; revised 30 Apr. 2009; accepted 2 Sept.
2009; published online 26 Jan. 2010.
Recommended for acceptance by A. Pietracaprina.
For information on obtaining reprints of this article, please send e-mail to:, and reference IEEECS Log Number TPDS-2008-07-0265.
Digital Object Identifier no. 10.1109/TPDS.2010.30.
1045-9219/10/$26.00 ß2010 IEEE Published by the IEEE Computer Society
The main contributions of the paper can be summarized
as follows:
1. The proposal of twisted 2D and 3D tori as alternative
topologies to mitigate the performance flaws on
existing rectangular and prismatic tori.
2. A detailed analysis of their topological properties.
3. A simple routing mechanism for the proposed
4. A performance evaluation, both theoretical and by
means of network simulations, showing that perfor-
mance increases up to 74 percent.
5. A detailed analysis of the injection rate limit under
uniform traffic based on its relation with the
network bisection bandwidth.
6. A layout for the proposed networks that keeps
link length under a bounded value, eliminating
long peripheral wires and facilitating their imple-
The rest of the paper is organized as follows: Section 2 is
about related work. Section 3 analyzes the topological
properties of 2D and 3D mixed-radix twisted tori. Section 4
describes routing. Section 5 analyzes the bisection band-
width and establishes bounds on network throughput
under random traffic. Section 6 describes the simulation
tools employed and Section 7 provides performance
metrics. Section 8 is about network folding and wiring,
and finally, Section 9 summarizes the contributions of this
research and discusses some future work.
The idea of twisting 2D square tori in one of its two
dimensions for obtaining architectural benefits is not new.
A first approach appeared in the ILLIAC IV [4]. This torus-
based parallel computer employed a twist of one unit in the
wraparound links of dimension X. Such a twist was
introduced in order to embed a Hamiltonian ring into the
topology, which facilitated some control and data move-
ment operations. Doubly twisted tori were introduced by
Sequin in [22] looking for optimal mappings of binary trees
onto processor arrays.
Other square meshes with twisted wraparound links
were proposed in [3] to reduce maximum and average
distance among their nodes. However, the inclusion of a
twist in a square torus generates little improvement in
terms of network throughput and delay, [3]. Additionally,
the twist breaks the natural full symmetry of a square
network. Alternatively, the inclusion of two twists (one per
dimension) provides better performance, but has collateral
drawbacks, such as the presence of edge-asymmetry and
the lack of optimal adaptive routing algorithms.
Not too much work has been done in order to improve
the performance of rectangular tori by twisting one of its
two dimensions. The results introduced in [3] appear to be
the first attempt in this direction. More recently, in [10], it
was shown that a 42-node HP GS1280 computer can offer
some extra performance when introducing a “shuffle” in
the wraparound links of the longest dimension. Similar
topologies were proposed in [26] as a component of
hierarchical networks.
The study of 3D prismatic networks is an important
extension of the 2D rectangular case. Nevertheless, up to
our knowledge, no significant work has been published for
optimizing network performance in prismatic 3D tori by
twisting one or two of their network dimensions.
This Section analyzes the main distance properties of
mixed-radix tori. It is organized into four sections. The first
one deals with definitions, the second considers standard
2D and 3D mixed-radix tori, and the third and fourth ones
present 2D and 3D twisted tori, respectively.
3.1 Topological Distances and Link Utilization
Rectangular (2D) and prismatic (3D) standard and twisted
tori are considered in this work. Such networks lay out their
links into orthogonal dimensions. We will consider co-
ordinates (X, Y) in the plane and (X, Y, Z) in the space, as
usual. For practical reason, we restrict our attention to 2a
aRectangular Tori (RT) and 2aaaPrismatic Tori (PT).
Standard RT and PT are shown in Figs. 1a and 1b, for a¼4.
Their twisted versions can be seen in Figs. 2, 3, and 4.
We define next the main topological network para-
meters. The diameter k is the length of the longest
minimum path. The average distance
kis the average
length of all minimum paths. The average distance of
the networks considered in this paper can be split into the
average distances on each of its dimensions, i.e.,
Fig. 1. Regular tori of two and three dimensions, with a¼4. (a) RT of
size 84. (b) PT of size 844.
for 2D topologies and
kzfor 3D topologies.
We denote
kmax as the maximum of the average distances
per dimension, i.e.,
kmax ¼maxð
Utilization, LU, is the average usage of the network links
under uniform random traffic at maximum network load;
LU = 1 means that all the network links are used every
network cycle to convey data. This can be only the case of
completely symmetric networks in which their average
distance is equally distributed among dimensions. Link
Utilization relates with average distance per dimension,
according to the following Proposition:
Proposition 1. In D-dimensional mixed-radix tori, Link
Utilization can be computed as follows:
LU ¼1
kmax ¼1
Proof. Consider first 2D standard xynetworks, with x>y.
Note that the number of links per dimension is identical
(xy in dimension X and yx in dimension Y) but dimension
X is x/y times longer than dimension Y. Hence, the
average distance of the path traversed by a random packet
in dimension X will be x/y times longer than that of
dimension Y. When dimension X uses its links at full
capacity, links of dimension Y are used, at most, at a y/x
kx) rate. Consequently, the longest dimension
constitutes a network bottleneck that limits the use of
the shortest one. The same reasoning applies to higher
dimensional standard and twisted networks. Hence, the
link utilization, LU, will be the sum of the normalized
average distances per dimension, averaged among the
network dimensions. Twisted networks obey the same
rules but in this case, average distances per dimension are
not related with the length of each dimension. tu
3.2 Mixed-Radix Standard Tori
RT networks with 2aanodes are the Cartesian graph
product of two rings of 2aand anodes and PT networks with
2aaanodes are the Cartesian graph product of a ring of
2a nodes and two more rings of anodes, [14]. By product
associativity, a 2aaaPT can also be seen as the Cartesian
product of 2aaRTs times a ring of anodes. In such graph
products, overall distances can be computed as the sum of
their distances per dimension. Therefore, to study the
distance properties of RTs and PTs, we first consider such
properties in an n-node ring. We consider nto be even; having
nodd would only slightly modify the resulting values.
Remark 2. The diameter and the average distance of a ring
of n nodes are k¼n
4, respectively.
Since a 2aaRT is the product of two rings of 2aand a
nodes, respectively, we can straightforwardly infer the
following remark (see [14, Lemma 1.37]):
Remark 3. The diameter k, average distance
k, and average
link utilization LU of an RT with 2aanodes are:
4;LU ¼3a=4
For 3D tori, the following result can be easily obtained
since a PT is the outcome of the composition of three sets of
orthogonal rings:
Remark 4. The diameter k, the average distance
k, and the
average link utilization LU of a PT with 2aaa
Fig. 2. RTT of size 84(a = 4).
Fig. 3. 3D PTT of 844nodes.
Fig. 4. 3D PDTT of 844nodes.
nodes are:
LU ¼a
3.3 Rectangular Twisted Tori
Definition 5. A2aaRTT consists of 2a2nodes arranged in a
rectangle with labels (x, y) such that 0x2a1and
0ya1. All the inner links in the rectangle form an
orthogonal 2D mesh, that is, any node (x, y) such that 0<
x<2a1and 0<y<a1is adjacent to the four nodes
ðxþ1;yÞ;ðx1;yÞ;ðx;yþ1Þ;ðx;y1Þ. The wraparound
links are defined as:
-ðx; 0Þis adjacent to ðxþa; a 1Þ8x:0xa1.
-ðx; 0Þis adjacent to ðxa; a 1Þ8
-(a; y)isadjacenttoð2a1;yaÞ8x:0x
Fig. 2 depicts an RTT for the case a¼4. Note that RTTs
only differ from RTs in the twist of acolumns on the vertical
wraparound links. We describe next the distance properties
of 2aaRTT networks by referring to its distance
distribution aðdÞ, or the number of nodes at distance d
from a given node.
Proposition 6. The distance distribution of an RTT with 2aa
nodes is:
að0Þ¼1; 8d:0<d<a aðdÞ¼4d;
This distance distribution was proved in [16]. In that paper,
a general family of networks denoted by Gaussian networks
was presented. RTTs are particular members of this family
that are optimum in terms of diameter and average
distance, [16]. In addition, the network twist provides them
with edge symmetry. Edge symmetry means that all edges
incident to a node have an identical view of the rest of the
network. Therefore, all the links entering a certain node
must belong to orthogonal rings of the same size. This
symmetry can be intuitively observed by considering the
horizontal and vertical rings that compose the RTT network
in Fig. 2. Both rings, in dimensions Xand Y, have the same
size (2anodes), whereas the rings in the original RT have
different sizes (2anodes and anodes, respectively). Then,
using Proposition 6 and the network edge symmetry
property, we can state the following result:
Corollary 7. The diameter k, the average distance
k, and the link
utilization LU of an RTT with 2aanodes are:
3;LU ¼1:ð5Þ
Therefore, a twist of amplitude ato the vertical 2alinks
of an RT leads to a significant performance improvement.
The diameter and average distance of the original RT are
50 and 12.5 percent higher, respectively. The link
utilization is 33 percent higher in the RTT. The optimum
link utilization of RTTs coming from their edge symmetry
implies that average distances among dimensions are
equalized, a=3each.
3.4 Three-Dimensional Topologies:
Prismatic Twisted and Doubly Twisted Tori
This resource imbalance exhibited by rectangular 2D net-
works also arises in noncubic, or prismatic, 3D Tori. We
employ two different approaches to build 3D networks
derived from RTTs. The simplest one with a single twist,
denoted by Prismatic Twisted Torus (PTT), can be seen as a
stack of RTTs. The second one twists its wraparound links
on both Yand Zdimensions, resulting in a Prismatic
Doubly Twisted Torus (PDTT).
Definition 8. A2aaaPTT consists of 2a3nodes labeled as
triplets (x, y, z) such that 0x2a1and 0y; z a1.
Any node (x, y, z) is adjacent to:
0;zÞ, where ðx; yÞand ðx0;y
0Þmust be adjacent in
a2aaRTT, and
-ðx; y; z 1ðmod aÞÞ.
Fig. 3 depicts a PTT of 844nodes. Note that, by
definition, the PTT is built as the Cartesian graph product of
a2aaRTT and a ring of anodes. Therefore, the
expressions for the distance-related parameters of a PTT
can be computed as stated in the following Proposition:
Proposition 9. The diameter k, the average distance
k, and the
average link utilization LU of a PTT with 2aaanodes are:
12 ;
LU ¼11a=12
12 :ð6Þ
From a topological point of view, the diameter in the
original PT is 33.3 percent higher than that of the PTT,
and the average distance is 9.1 percent higher. Further-
more, PTTs provide average distance equalization on X
and Ydimensions, leading to a significant improvement
on link utilization.
Definition 10. A2aaaPDTT consists of 2a3nodes labeled
as triplets (x, y, z) such that 0x2a1and 0y;
za1. Any node (x, y, z) is adjacent to:
1;zÞ, where ðx; yÞand ðx1;y
1Þmust be adjacent
in a 2aaRTT;
-ðx; y2;z
2Þ, where ðy; zÞand ðy2;z
2Þmust be adjacent
in a 2aaRTT; and
3Þ, where ðx; zÞand ðx3;z
3Þmust be adjacent
in a 2aaRTT.
Intuitively, this network can be interpreted as a
collection of aRTTs in the XY and XZ planes, respec-
tively. Fig. 4 shows horizontal and vertical cuts of a PDTT
of 844nodes. Note that a PDTT is built by applying
twists to Yand Zwraparound links of a PT, which
provides edge symmetry. Although diameter of PDTTs can
be inferred using combinatorial techniques, the complexity
to compute its average distance invited us to use an
empirical procedure.
Remark 11. We have used a breadth-first search (BFS)
algorithm to find the shortest path between nodes in
PDTTs. We have obtained that the diameter k, the
average distance
k, and the average link utilization LU of
a PDTT with 2aaanodes are:
8;LU ¼1:ð7Þ
Hence, the diameter in the original PT is 33.3 percent
higher than in PDTT, while the average distance is
14.28 percent higher. Finally, X,Y, and Z-dimensional
average distances are fully equalized.
A minimal routing algorithm to convey packets in rectan-
gular twisted tori is considered next. There are multiple
implementations of routing record generator units for RTT
networks. Algorithm 1 presents a function to compute a
routing header that records the shortest path between any
pair of nodes. Such an algorithm is based only on simple
operations over node coordinates: sums, subtractions, and
comparisons. In addition, all the operations can be done in
parallel, minimizing the time required for obtaining the
routing record.
Algorithm 1. Routing Record Generator for RTT
For a generic routing operation, a source node interface
will generate a routing record ðX; YÞheading the
packet. Xrepresents the number of links (hops) that
the packet must traverse along the axis of the first
coordinate, and Yalong the second coordinate’s axis.
Their signs indicate East/West and North/South directions.
Routers will process the header information in the same
way as in a standard torus, decrementing the corresponding
field header before sending the packet to the selected
neighbor. A packet with X¼0and Y¼0has reached
its destination and will be consumed. In our experiments,
we will employ an optimal fully adaptive routing mechan-
ism built on this basis, [19].
Proposition 12. The routing in Algorithm I is optimal for RTTs.
Proof. As these kinds of regular graphs can be fully
represented by plane tessellations, the correctness of this
routing mechanism can be geometrically proved, [25]. A
tile of area 2aathat tessellates the plane characterizes
the graph. Its origin node (left lower corner nodes,
highlighted in each rectangle of Fig. 5 for an RTT of 84
nodes) defines each tile. Points in the same position of
different tiles represent the same network node. As the
diameter of the network is a, any source node reaches
any other destination node with no more than ajumps,
as shown in Fig. 5 with a black border comprising the
original tile (shadowed tile in the center of the figure)
and its six immediate neighbor tiles. The shaded
diamonds in Fig. 5 show the possible destinations for
two given source nodes (0, 2) and (7, 3). tu
The next part of this geometric proof is based on
showing, in terms of a, that the eight sides of the black
border only comprise the seven tiles whose left lower
corners are the nodes 0;0Þ;ða; aÞ;ða; aÞ;ð2a; 0Þ;ð2a; 0Þ;
ða; aÞ;ða; aÞg. This implies that the minimal path within
any pair of nodes can be found in that tile set. Following this
scheme, our routing algorithm computes all the possible
paths from a source node to the destination node in each one
of the seven possible tiles. After computing these paths’
lengths in parallel, the algorithm returns the routing record
of the path with minimum length.
With respect to 3D networks, an optimal routing
algorithm for PTTs is very simple as we can independently
compute the routing for the selected ring of the Zdimension
and for the RTT in the selected XY plane. PDTT routing can
be obtained using a lightly more complex three-dimensional
combinatorial procedure. In this case, with the actual
implementation, 23 cases have to be checked in parallel.
Notwithstanding, this procedure can admit optimizations.
Fig. 5. Possible paths departing from an 84RTT. The original tile
(corresponding to nodes 1-32) is shadowed and centered.
In this Section, we analyze the performance of mixed-radix
networks with respect to their Bisection Bandwidth ðBBÞ,
which represents an upper bound in the management of
random uniform traffic. We also relate BBwith the network
topological properties.
The term BBstands for the bandwidth between two
equal partitions of a network when a minimum cut is
applied. According to [11], in networks with uniform
channel bandwidth as the ones we consider here, BBis
proportional to the channel count of the minimum network
cut. We will show that, while the standard RT and PT
behave in this way, for twisted networks, any optimal
routing leads to an effective bisection bandwidth lower than
the one computed through the minimum cut. We start with
standard RTs, which obey the classical definition of BB.
5.1 Bisection Bandwidth of Rectangular Tori
For a 2aaRT as the one depicted in Fig. 1, a minimum cut
of the network in two halves is the one that divides it into
two adjacent aasquares. In Fig. 1, this corresponds to the
cut that leaves columns 0-3 on one side and 4-7 on the other.
The cut is crossed by 2ahorizontal links (ainternal and a
wraparound links). This means that 2aphits can traverse
these links per cycle in each direction (a phit is the physical
data unit transferred by a link). Under random uniform
traffic, the N=2nodes in one-half of the network generate
packets with probability 0.5 to the other half; therefore, a
maximum of 2a
Nphits/cycle can be generated by each
node. For an RT with N¼2a2, the maximum injection rate
per node is 8a
5.2 Bisection Bandwidth of Rectangular
Twisted Tori
Now, we consider a 2aaRTT as the one in Fig. 2. The cut
separating columns 3 and 4 is again a minimal cut.
However, in this case, it comprises 4a links, twice as much
as in the RT network. If we applied the classic BB definition,
we would find a BB of 4a
Nphits/cycle, establishing an
injection limit of 8
aphits/cycle per node. This is twice the
value obtained for the RT. However, as we will see next,
this performance figure is not valid as some wires of the
network bisection are used to communicate nodes in the
same half of the network.
Fig. 6 shows a 16 8RTT (a¼8), with links omitted for
simplicity. The dashed square line comprises one-half of the
network and represents the minimal cut. A replica tile on
the top right corner is partially shown. Routing from node
A to node B (both belonging to the left partition) should
choose a path going right and up, crossing the partition
boundaries twice (from and then to the original partition).
The alternative internal path inside the partition is longer,
and consequently, the routing algorithm discards it. Hence,
traffic internal to each partition crosses the minimum cut.
We define the Effective Bisection Bandwidth ðEBBÞas the
bisection bandwidth that is really used by traffic going from
one partition to the other; EBBdepends on the bisection
bandwidth and the amount of packets internal to one
partition that cross the boundaries. To quantify this value,
we initially consider the effect of the replica network, shown
on the top right corner of Fig. 6. A message from node Ato
any of the black nodes in Fig. 6 will find that a destination
node in the replica is closer crossing the partition
boundaries than using an internal path. The number of
black nodes is determined by Ti, which is the triangular
number for i(i¼3in Fig. 6). The iþ1nodes in gray at
distance diameter k¼a, both in the original and the replica
tile, are equally distant from node A. To balance traffic, we
assume that internal and external paths are randomly
chosen. Finally, the collection of destination nodes being
closer in the replica tile than in the original one is the same
for any of the a1inodes in the same diagonal as A
(four nodes in the example Fig. 6 with i¼3).
If all nodes in the partition send a packet to nodes inside
the partition (uniform traffic), the amount of packets P1
crossing the boundaries toward the replica tile on the top
right corner is given by:
Considering the effect of four replica tiles on the four
network corners (not shown in Fig. 6), the amount of
packets Pthat, being internal to one partition, crosses the
partition boundaries is:
Thus, if every node sends a packet to any other node in
the network, there will be a4packets sent from the left
partition to the right one (which obviously cross the
boundaries), and roughly 2a4
3packets which are
internal to each partition but crossing twice the bisection
boundaries to follow minimal paths. Hence, the traffic
crossing the minimum cut is increased from a4packets to
3from which only a4are using the bisection to depart
from one partition to the other. This means that having 4a
links in the network bisection, on average, 3alinks are
used for traffic crossing from one partition to the other,
while alinks are used for traffic internal to the partition.
Fig. 6. Path between two nodes in an RTT of size 16 8. Vivid circles on
the down left side represent nodes in one-half of the network, while
dashed square represents the bisection. Pale circles on the down right
side represent nodes in the other half of the network. For the sake of
clarity, a fragment of a replica tile is showed in the upright side. Note that
the shortest path from A to B crosses the bisection twice.
Hence, for an RTT with N¼2a2, the injection rate per
node under random traffic will be limited by 3a
5.3 Injection Rate and Topological Properties
The Effective Bisection Bandwidth metric shows that, under
uniform traffic, RTTs improve the maximum injection rate
per node with respect to RTs in a 50 percent. This is due to
the improvements on two topological properties: average
distance and edge symmetry. First, the average distance in
the original RT is 3a
4, while in the RTT, it is 2a
3, correspond-
ing to a performance improvement of 9/8. Second, the link
utilization in RTs is 3/4, while in RTTs, it is 1, correspond-
enhancements, the expected improvement of 94
is obtained.
Considering the definitions of link utilization and
average distance, it is interesting to make the following
Injrate ¼
kLU ¼
Equation (10) shows that the improvement obtained by
the twist is proportional to the reduction of the longest
average distance per dimension among all dimensions,
which at the end limits performance.
5.4 Bisection Bandwidth and Injection Rate of
3D Networks
Considering 3D networks, the maximum injection rate of a
PT is 4=a phits/cycle/node. The EBBof a PTT can be
calculated considering it as a stack of RTTs connecting
nodes in Z dimension by rings. This leads to a maximum
injection rate of 6=a phits/cycle/node that means, as in the
2D case, 1.5 times the maximum PT injection rate. The same
result comes from an improvement factor of 12/11 in
average distance reduction and 11/8 in link efficiency,
whose product is again 1.5.
A calculation of the EBBfor PDTTs, as the one presented
in Section 5.2, is quite more laborious. Therefore, we will
derive the PDTT maximum injection rate from its topolo-
gical improvements over a PT, which has been shown to be
equivalent to calculate its EBB. PDTTs present an improve-
ment of 12/7 (71.43 percent improvement) over the base PT
injection rate: a factor of 8/7 due to the average distance
reduction, and a factor of 3/2 due to the perfect balance in
the use of the links. Thus, the maximum injection rate under
random traffic is 48
This section is organized into two sections. The first one
deals with the simulation platform and performance
metrics employed and the second with workloads.
6.1 Simulation Configuration and
Performance Metrics
Network evaluations are performed using INSEE, an
interconnection network simulator developed by the
authors of [21]. It can manage multiple topologies and
switching technologies. For this study, the router employed,
shown in Fig. 7, is similar to the one implemented in the
IBM BlueGene/L: virtual cut-through switching strategy,
[12], and Bubble flow-control deadlock avoidance, [19], with a
static virtual channel plus two fully adaptive virtual ones.
BlueGene family of supercomputers implements a conges-
tion-control mechanism that prioritizes in-transit traffic
against new injections, which is also implemented in our
router. In our experiments, packets have a fixed length of
16 phits of 4 bytes each.
In our experiments, performance data will be shown in
accepted load versus provided load or throughput.
Provided load is an input parameter, and accepted load is
a measured value that indicates how many phits/cycle/
node the network is able to successfully deliver. When the
network is not saturated, both values are almost identical.
However, when some of the routers (or all of them)
saturate, actual throughput may be much smaller than
applied load. In some cases, we also provide latency data in
the form of plots of average packet delay versus provided
load. We measure the number of cycles between the instant
a packet is injected (stored in the input buffer of the source
router) until it is consumed at its destination node.
In all the experiments, the internal buffer size has been
limited to a practical value. Injection queues have room for,
at most, eight packets and transit queues have room for
storing up to four packets in each virtual channel.
6.2 Workloads
First, we use independent traffic sources under random
uniform traffic (UN). In this case, packets are distributed
evenly along the whole network so that no persistent
bottlenecks are generated. Next, we use random but
nonuniform (hot region, HR) traffic to check the perfor-
mance when the network has an unbalanced usage. In HR,
25 percent of the traffic is addressed to the first 12.5 percent
of the network (in terms of Cartesian coordinates, i.e., the
lowest rows in 2D and planes in 3D) and the remaining
75 percent is uniformly spread along the whole network. In
this way, we apply more pressure on those resources used
by packets traveling toward the lower rows (planes for 3D).
We finish with permutation-based traffic based on [11] (Bit-
Complement (BC), Bit-Reversal (BR), and Perfect Shuffle
(PS)). In this case, each node sends packets to a given
Fig. 7. Model of the simulated routers, with a detailed view of the X+
input port showing the three virtual channels that share this link. The 2D
case would not include ports and links for the Z axis.
destination as defined by each of the permutations. For all
these workloads, the interinjection interval at each node is
random following a Poisson distribution chosen as to
modulate the provided load in terms of phits/cycle/node.
Other communication pattern of interest included in this
study is nearest neighbor communications.
None of the previous workloads considers causality
among messages. This assumption, although commonly
used in performance studies, is unrealistic because in most
current applications, the reception of messages triggers the
emission of new ones. Thus, we also evaluate networks with
message-passing traces of real applications. These traces
have been obtained running a selection of the NAS Parallel
Benchmarks (NPBs) MPI applications on an actual multi-
computer. Trace-based simulations help us to show that the
predicted benefits of mixed-radix twisted tori are also valid
for the traffic patterns generated by current applications.
We have restricted our study to those benchmarks of the
suite that can be mapped onto rectangular networks, and
among them, to those that make intensive use of the
interconnection network: Conjugate Gradient (CG), Integer
Sort (IS), Fourier Transform (FT), LU solver (LU), and
MultiGrid (MG). Limitations on our experimental frame-
work do not allow us to generate traces for applications
with more than 128 nodes.
For these workloads, messages are packetized and
injected into the network as fast as it can accept them but
always obeying causality relationships: if a node is waiting
for a message, the simulator stalls its traffic generator until
that message arrives [20]. Performance will be measured in
terms of the number of cycles required to fully inject and
deliver the complete workload.
This section is organized into five sections, each one dealt
with experiments driven by traffic of different nature.
7.1 Random, Uniform Traffic
Figs. 8a and 8b summarize the obtained data for a collection
of 2D (32 16 and 64 32) and 3D (32 16 16 and
64 32 32) networks. Injected and accepted loads are
identical until reaching the saturation point. Then, even
when the simulator tries to inject more traffic, the load
actually delivered stabilizes in some cases, and drops to a
lower point before stabilizing in others. The reasons for this
drop are analyzed in [18].
We expect that the maximum accepted load under
random traffic is the one determined by the network effective
bisection bandwidth computed in Section 5. Table 1 shows
simulated maximum accepted loads versus theoretical upper
throughput bounds.
Fig. 9a shows latencies for 64 32 and 32 16 networks,
and Fig. 9b does so for 64 32 32 and 32 16 16
networks. The average latency grows steadily with injected
load until reaching saturation. Beyond that point, it grows
boundless. As the average distance is lower in twisted tori,
so are their latencies when in a nonsaturated state.
Twisted networks maximize link utilization. Fig. 10 plots
the measured link utilization in a 32 16 (a) and a 32
16 16 (b) networks. For 2D networks, the X channels in the
RT are (almost) fully used while the utilization of the Y
channels does not reach 50 percent. In the RTT, all channels
are close to full utilization. This effect is even more
noticeable in 3D networks: the bottleneck at the X channels
induces low utilization levels (less than 50 percent) of
channels Y and Z. The single twist of the PTT increases link
utilization by allowing almost full use of X and Y channels,
and higher utilization of Z channels. The PDTT exhibits full
channel utilization.
7.2 Random, Hot Region Traffic
The lack of uniformity in this traffic generates an
unbalanced utilization of network resources. Figs. 11a and
Fig. 8. Throughput for (a) 2D and (b) 3D networks under random
uniform traffic.
Throughput Upper Bound versus Measured Maximum
Throughput for 2D and 3D Networks
11b show throughput curves for 2D (32 16 and 64 32)
and 3D (32 16 16 and 64 32 32) networks.
In the 2D case, the use of a twist forces the use of the
links in the Ydimension for long Xdisplacements, which
increases the use of the lowest rows (those saturated by the
traffic pattern). For this reason, the obtained results show
that twisted torus outperforms the regular torus but not in
the magnitude as when managing uniform traffic.
The introduction of higher pressure in the lowest planes
of the 3D networks allows the twisted torus to keep a
performance increase over the standard torus similar to the
obtained when managing uniform traffic provided that
dimension Zis not twisted. When Zdimension became
twisted, it is used to shorten paths in the Xdimension, and
as in the 2D case, the hot region becomes even more
saturated due to the extra traffic traveling by Z.
In short, using twists results in a reduction of average
distance, and a homogenization of resources for evenly
distributed traffic. If the traffic pattern does not make a
homogeneous use of the network, it may happen that the
topological advantages are not visible.
7.3 Permutation Patterns
Fig. 12a shows performance measurements for the following
permutation patterns: Bit-Complement, Bit-Reversal, and
Perfect Shuffle [11]. It can be observed that in all cases, the
RTT provides a significant improvement with respect to RT,
ranging from 24.3 percent (Bit-Complement) to 41.1 percent
(Bit-reversal). Fig. 12b shows throughput results on 3D
networks under the same traffic patterns. PTT always
outperforms PT, with a speedup ranging from 37.1 percent
(in Bit-Reversal) to 53.2 percent (in Perfect Shuffle). PDTT
improvements over PT are even higher, from 59.7 percent (in
Bit-Reversal) to 74.5 percent (in Perfect Shuffle).
7.4 Nearest Neighbor Traffic
Some parallel applications exhibit traffic patterns in which
nodes communicate with their nearest neighbors in a torus
topology. This can be either due to the inherent symmetry
of the application or because of mapping big data matrices
Fig. 9. Average latency under random uniform traffic for (a) 2D and
(b) 3D networks.
Fig. 10. Average link utilization under random uniform traffic for (a) 2D
and (b) 3D networks, with the network saturated.
Fig. 11. Throughput under HR traffic for (a) 2D and (b) 3D networks.
on the network nodes. Twisted tori, contrary to the
standard ones, do not provide all the orthogonal wrap-
around links needed to optimally manage nearest neighbor
toroidal communications. We introduce next, through an
example, the basis of a simple mapping scheme that
removes this apparent drawback in RTTs.
Fig. 13a shows an 88matrix of processes that require
toroidal nearest neighbor communications. Fig. 13b shows
the proper mapping of such matrix over a 48RTT. The
idea lies on mapping the logical Yrings of the application
over the twisted physical Yrings of the network. Mapping
bigger square matrices only requires a modulo mapping of
their columns over the physical twisted Yrings. When
mapping processes in this way, all local logical commu-
nications are also physically local.
Exceptionally, in some rare cases, small applications with
low parallelism may lead a standard torus to behave better
than a twisted one under nearest neighbor communication.
7.5 Application Traffic
We consider now a preliminary study of how twisted
topologies behave when executing real applications. In
most real applications, the way of mapping tasks to nodes
has a great impact on performance. The study of application
mapping is considered beyond the scope of this paper.
Nevertheless, we have tested two simple mapping experi-
ments: consecutive allocation (task igoes to node i) and
random, denoted by “_c” and “_r,” respectively. For the
latter case, graphs show the average of 20 simulation runs.
Fig. 14 shows the results obtained when processing the
traces of the selected NAS applications.
It can be seen how the RTT performs better than RT for
CG (up to 10 percent), FT (up to 18 percent), and IS (up to
20 percent), independently of the placement. Using a
consecutive placement in LU, the selection of topology is
almost irrelevant. For MG, the twisted alternative is worse
than the nontwisted one for consecutive placement but
better for random placement. Better mappings could
improve these numbers.
The performance gains exhibited by twisted tori come just
from a rearrangement of their wraparound links. Now, we
evaluate the technological consequences of the twists on the
layout of such networks. We consider first the case of
rectangular networks and then the prismatic ones.
Fig. 12. Throughput for BC, BR, and PS traffic patterns in a (a) 32 16
RT and RTT and (b) 64 32 32 PT, PTT, and PDTT.
Fig. 13. (a) 88matrix of processes that require nearest neighbor
communications and (b) mapping over a 48RTT.
Fig. 14. Total time required to fully inject and deliver the workload with
real applications on a 16 8network.
8.1 Bounded Link Length Layout for 2D RTTs
In tori, the length of the wraparound links grows with the
network size. While internal links are supposed to have
unitary length, wraparound links grow as ffiffiffiffi
pin square
tori. Consequently, real implementations could be nega-
tively affected by this unbalance. The folded torus is a good
solution to equalize all the network links by doubling its
unitary wire length [11]. The idea is based on applying
certain shuffle transformations to rows and columns that
interleave the network nodes. Two different shuffle
transformations can be considered. Given a row (or column)
of n nodes ð0;...;n1Þ, the following transformations map
every node location ðx; yÞonto a different one ðx0;yÞ
(respectively ðx; y0Þ) on the same row (respectively, col-
umn). We describe next the two used shuffle transforma-
tions when applying to rows:
x0¼2x; 8x:0xn=2;
Analogously to the torus case, we propose a new folding
for RTTs, which generates a layout in which link lengths
are equalized and bounded by ffiffi
p. The whole folding
process is detailed in Algorithm 2. The interested reader
can refer to [24] for a formal proof of this algorithm on a
similar topology. Fig. 15 shows the resulting Trellis Folded
layout of an 84RTT. As in the folded torus case, two
planes are enough to lay all the links without cutting
each other.
Algorithm 2. Trellis Folding Mapping Function
The previous result is optimal for RTTs as a maximum
link length of 2 is impossible to obtain. To see this, consider
the case of 2a2nodes arranged in the most compact layout: a
psquare. If that were possible, the physical
distance among opposite corners would be
The diameter d¼aensures that those nodes are connected
by, at most, alinks, allowing for a maximum link length of 2
for any a. Any other rectangular layout would present a
longer physical distance between two opposite corner
nodes, requiring longer links. Being ainteger, a square of
2a2nodes is impossible and so is a layout with a maximum
link length of 2. Consequently, our folding mechanism
leads to an optimal link length of ffiffi
pfor a grid layout.
This technique can be also applied in “block mode.” As
an example, consider a 32 16 RTT, where each 44
submesh is a single block. This network can be repre-
sented as the basic 84RTT in Fig. 2, but each node
becomes in a 44submesh and each link becomes in a
group of 4 parallel links joining two such submeshes. After
applying the Trellis Folding algorithm shown in Algo-
rithm 2 to the basic 84network, the 32 16 block folded
layout is obtained by simply substituting each node by the
44submesh, and each link by the four corresponding
parallel links. In general, this method can be applied to
any block size.
8.2 Layouts and Cabinet Distributions for 3D PTT
and PDTT
Networks for large parallel systems are usually distributed
among cabinets. We take, as an example, the system layout
of a typical configuration of the BlueGene/L (64 32 32
nodes), which is organized as 88cabinets of 1,024 nodes
each [6]. Dimension Z evolves inside every pair of cabinets
as every cabinet has an 8816 node configuration.
Cabinets are connected in pairs so that every pair contains
all of the nodes with different Z for the same (X, Y). Pairs of
cabinets are laid on a rectangular array, with a standard
folding applied on rows and columns.
Now we analyze PTTs. Considering that a close group of
cabinets comprises the whole Z dimension for the (x, y)
nodes within it, we can obviate such a dimension. Thus, the
folding problem is reduced to the previous bidimensional
RTT case. Using the network and cabinet sizes stated above
and 88blocks, each pair of cabinets would correspond
with a node in a regular 84RTT.
Thus, the Trellis Folding can be directly applied. Fig. 16
shows the layout. Considering the distance between
cabinets in the same row or column as the unit length, the
maximum link length between cabinets is 2ffiffi
p. Vertical
labels on cabinets mean x, y, z coordinates of each first node
in the block. Note that only links connecting different
cabinets have to be modified, thus preserving the internal
cabinet connectivity.
In the PDTT case, dimension Zcannot be comprised
inside a group of cabinets that share the same (x,y) nodes.
Instead, dimension Zis spread between two groups of
Fig. 15. Trellis folded RTT of size 84.
cabinets corresponding to different (x,y) nodes. Hence, Z
should connect two pairs of cabinets. In this case, we
cannot apply the Trellis Folding, as it places these pairs of
cabinets in opposite locations. However, another folding
technique based on [15] can be applied, which roughly
corresponds to applying a shuffle once on Yand twice on
X. The resulting layout leaves such pairs of cabinets
together, as shown in Fig. 17 (most links are omitted for
simplicity), but increases the maximum link length to 4.
Note that, as before, only links between cabinets have to
be reconnected.
Mixed-radix tori present severe communication bottlenecks
that negatively affect their performance. These bottlenecks
are caused by the asymmetry exhibited by a network that
has dimensions of different sizes. In this paper, we have
analyzed a class of twisted tori that remove these bottle-
necks by equalizing on each dimension the length of the
paths traversed by packets.
We have evaluated the performance of twisted tori both
analytically and by means of simulation. We have described
their main distance-related properties and the relationship
between the effective bisection bandwidth and the network
performance under uniform traffic. The router model
employed for simulations incorporates all the architectural
features of current packet routers and resembles the one
used in the torus network of the BlueGene/L. The networks
have been tested managing both synthetic traffic and
workloads from real application traces. In all cases, the
twisted topologies showed gains over the standard mixed-
radix tori.
The added costs of twisted networks come from both
their packet routing mechanisms and their wireability. We
have proposed scalable and practicable solutions for both
architectural issues. As a conclusion, the proposed topolo-
gies appear as an option to improve the overall perfor-
mance of mixed-radix tori by just rearranging their
wraparound links.
This work has been supported by the Spanish Ministry of
Education and Science (grants TIN2007-68023-C02-01,
TIN2007-68023-C02-02, TIN2007-60625, and AP-2005-3318,
and CONSOLIDER Project CSD2007-00050), the Basque
Government (grant IT-242-07), the European Network of
Excellence on High Performance and Embedded Archi-
tecture and Compilation (HiPEAC, contract IST-217068),
and the SARC European Project (Contract number 27648).
Javier Navaridas is supported by a doctoral grant from
the UPV/EHU.
[1] N.R. Adiga et al. “An Overview of the BlueGene/L Super-
computer,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing
’02) Technical Papers, Nov. 2002

1 comment:

Blogger said...

AvaTrade is the ultimate forex trading platform for beginning and professional traders.