# Twisted Torus Topologies for Enhanced Interconnection Networks

Abstract

Many
current parallel computers are built around a torus interconnection
network. Machines from Cray, HP, and IBM, among others, make use of this
topology. In terms of topological advantages, square (2D) or cubic (3D)
tori would be the topologies of choice. However, for different
practical reasons, 2D and 3D tori with different number of nodes per
dimension have been used. These mixed-radix topologies are not edge
symmetric, which translates into poor performance due to an unbalanced
use of network resources. In this work, we analyze twisted 2D and 3D
mixed-radix tori that remove the network bottlenecks present in
nontwisted ones. Such topologies recover edge symmetry, and
consequently, balance the utilization of their links. The
distance-related properties of twisted tori together with a full
characterization of their bisection bandwidth are described in this
paper. A simulation-based performance evaluation has been carried out to
assess the network performance under synthetic and trace-driven
workloads. The obtained results show noticeable and consistent
performance gains (up to an increase of 74 percent in accepted load). In
addition, we propose scalable and practicable packet routing mechanisms
and wiring layouts for these interconnection systems. The complexity of
the architectural proposals is similar to the one exhibited by routing
and folding mechanisms in standard tori.

### Full-text (PDF)

Available from: Enrique Vallejo
Twisted Torus Topologies for

Enhanced Interconnection Networks

Jose

´M. Ca

´mara, Miquel Moreto

´, Enrique Vallejo, Ramo

´n Beivide, Member,IEEE,

Jose

´Miguel-Alonso, Member,IEEE Computer Society, Carmen Martı

´nez, and Javier Navaridas

Abstract—Many current parallel computers are built around a torus interconnection network. Machines from Cray, HP, and IBM,

among others, make use of this topology. In terms of topological advantages, square (2D) or cubic (3D) tori would be the topologies of

choice. However, for different practical reasons, 2D and 3D tori with different number of nodes per dimension have been used. These

mixed-radix topologies are not edge symmetric, which translates into poor performance due to an unbalanced use of network

resources. In this work, we analyze twisted 2D and 3D mixed-radix tori that remove the network bottlenecks present in nontwisted

ones. Such topologies recover edge symmetry, and consequently, balance the utilization of their links. The distance-related properties

of twisted tori together with a full characterization of their bisection bandwidth are described in this paper. A simulation-based

performance evaluation has been carried out to assess the network performance under synthetic and trace-driven workloads. The

obtained results show noticeable and consistent performance gains (up to an increase of 74 percent in accepted load). In addition, we

propose scalable and practicable packet routing mechanisms and wiring layouts for these interconnection systems. The complexity of

the architectural proposals is similar to the one exhibited by routing and folding mechanisms in standard tori.

Index Terms—Multiprocessor interconnection, parallel architectures, routing, supercomputers.

Ç

1INTRODUCTION

MANY parallel computers using direct interconnection

networks have been designed and commercialized in

last decades. Meshes, tori, and hypercubes have been the most

popular topologies. Nowadays, hypercubes have declined in

favor of lower degree networks such as two-dimensional and

three-dimensional tori and meshes due to its implicit

difficulty to scale. Different machines, such as the Alpha

21364-based HP GS1280 [10] and the Cray X1E vector

computer [9], use 2D tori. Others, such as the Cray T3D and

T3E [23], which preceded the Cray XT3 [7] and XT4 [8], use 3D

tori. The IBM BlueGene family of massively parallel compu-

ters is a remarkable example of mixed radix 3D tori as the

largest configurations of the /L and /P models use 64 32

32 and 72 32 32 nodes, respectively, [1], [13].

Usually, 2D tori arrange their N nodes in a square Mesh

with ﬃﬃﬃﬃﬃ

N

pnodes per side. Sides are connected pairwise by

means of 2ﬃﬃﬃﬃﬃ

N

pwraparound links. Above, more or less, a

thousand nodes parallel computers should use 3D topologies

as suggested in [2], being a cubic 3D torus of side 3ﬃﬃﬃﬃﬃ

N

pthe

most desirable solution. However, the number of nodes per

dimension might be different, leading to rectangular and

prismatic topologies for two and three dimensions, respec-

tively. These topologies, denoted by mixed-radix networks

in [11], are often built for practical reasons such as packaging,

modularity, cost, and scalability. For instance, the HP GS1280

employs a 2D rectangular torus network [10] and the IBM

BlueGene a 3D prismatic one [1]. Mixed-radix tori have two

important drawbacks: first, they are not edge symmetric, and

second, the distance-related network parameters (diameter

and average distance) are far from the optimum values of

square and cubic topologies. The edge asymmetry leads to a

lack of balance in utilization of resources, and for many

traffic patterns, the load on the longer dimensions is higher

than the load on the shorter ones [5]. Hence, links in longer

dimensions become network bottlenecks. In addition, max-

imum and average packet delays are relatively long as they

depend on the poor values of diameter and average distance

exhibited by these networks.

In order to avoid or, at least, mitigate these problems, we

analyze in this work alternative mixed-radix 2D and 3D

torus topologies that twist their wraparound links of one or

two network dimensions. In this way, network symmetry is

partially or totally recovered. As we will see, the perfor-

mance measured on these twisted tori is significantly higher

than the one exhibited by their standard mixed-radix

counterparts. We concentrate in this paper on aspect ratios

2:1 and 2:1:1 although our results could be extended to

other cases. The selected ratios have been previously used

by manufacturers [6], [10], allowing the number of nodes to

be a power of 2, which is sometimes a desirable property.

Even more, these ratios represent the simplest way to

upgrade a square or cubic network, doubling their number

of nodes by only rearranging some wraparound links.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 12, DECEMBER 2010 1765

.J.M. Ca

´mara is with the Department of Electromechanical Engineering,

University of Burgos, Avda. Cantabria s/n, 09006, Burgos, Spain.

E-mail: checam@ubu.es.

.M. Moreto

´is with the Department of Computer Architecture, Universitat

Polite

`cnica de Catalunya, Jordi Girona, 1-3, Office D6-113, Campus Nord,

08034, Barcelona, Spain. E-mail: mmoreto@ac.upc.edu.

.E. Vallejo, R. Beivide, and C. Martı

´

nez are with the Departamento de

Electro

´nica y Computadores, Universidad de Cantabria, Avenida de los

Castros s/n, 39005 Santander, Spain. E-mail: enrique@atc.unican.es,

{ramon.beivide, carmen.martinez}@unican.es.

.J. Miguel-Alonso and J. Navaridas are with the Department of Computer

Architecture and Technology, University of the Basque Country, P.

Manuel de Lardizabal 1, 20018 Donostia-San Sebastian, Spain.

E-mail: {j.miguel, javier.navaridas}@ehu.es.

Manuscript received 14 July 2008; revised 30 Apr. 2009; accepted 2 Sept.

2009; published online 26 Jan. 2010.

Recommended for acceptance by A. Pietracaprina.

For information on obtaining reprints of this article, please send e-mail to:

tpds@computer.org, and reference IEEECS Log Number TPDS-2008-07-0265.

Digital Object Identifier no. 10.1109/TPDS.2010.30.

1045-9219/10/$26.00 ß2010 IEEE Published by the IEEE Computer Society

The main contributions of the paper can be summarized

as follows:

1. The proposal of twisted 2D and 3D tori as alternative

topologies to mitigate the performance flaws on

existing rectangular and prismatic tori.

2. A detailed analysis of their topological properties.

3. A simple routing mechanism for the proposed

topologies.

4. A performance evaluation, both theoretical and by

means of network simulations, showing that perfor-

mance increases up to 74 percent.

5. A detailed analysis of the injection rate limit under

uniform traffic based on its relation with the

network bisection bandwidth.

6. A layout for the proposed networks that keeps

link length under a bounded value, eliminating

long peripheral wires and facilitating their imple-

mentations.

The rest of the paper is organized as follows: Section 2 is

about related work. Section 3 analyzes the topological

properties of 2D and 3D mixed-radix twisted tori. Section 4

describes routing. Section 5 analyzes the bisection band-

width and establishes bounds on network throughput

under random traffic. Section 6 describes the simulation

tools employed and Section 7 provides performance

metrics. Section 8 is about network folding and wiring,

and finally, Section 9 summarizes the contributions of this

research and discusses some future work.

2RELATED WORK

The idea of twisting 2D square tori in one of its two

dimensions for obtaining architectural benefits is not new.

A first approach appeared in the ILLIAC IV [4]. This torus-

based parallel computer employed a twist of one unit in the

wraparound links of dimension X. Such a twist was

introduced in order to embed a Hamiltonian ring into the

topology, which facilitated some control and data move-

ment operations. Doubly twisted tori were introduced by

Sequin in [22] looking for optimal mappings of binary trees

onto processor arrays.

Other square meshes with twisted wraparound links

were proposed in [3] to reduce maximum and average

distance among their nodes. However, the inclusion of a

twist in a square torus generates little improvement in

terms of network throughput and delay, [3]. Additionally,

the twist breaks the natural full symmetry of a square

network. Alternatively, the inclusion of two twists (one per

dimension) provides better performance, but has collateral

drawbacks, such as the presence of edge-asymmetry and

the lack of optimal adaptive routing algorithms.

Not too much work has been done in order to improve

the performance of rectangular tori by twisting one of its

two dimensions. The results introduced in [3] appear to be

the first attempt in this direction. More recently, in [10], it

was shown that a 42-node HP GS1280 computer can offer

some extra performance when introducing a “shuffle” in

the wraparound links of the longest dimension. Similar

topologies were proposed in [26] as a component of

hierarchical networks.

The study of 3D prismatic networks is an important

extension of the 2D rectangular case. Nevertheless, up to

our knowledge, no significant work has been published for

optimizing network performance in prismatic 3D tori by

twisting one or two of their network dimensions.

3TOPOLOGICAL PROPERTIES OF

MIXED-RADIX TORI

This Section analyzes the main distance properties of

mixed-radix tori. It is organized into four sections. The first

one deals with definitions, the second considers standard

2D and 3D mixed-radix tori, and the third and fourth ones

present 2D and 3D twisted tori, respectively.

3.1 Topological Distances and Link Utilization

Rectangular (2D) and prismatic (3D) standard and twisted

tori are considered in this work. Such networks lay out their

links into orthogonal dimensions. We will consider co-

ordinates (X, Y) in the plane and (X, Y, Z) in the space, as

usual. For practical reason, we restrict our attention to 2a

aRectangular Tori (RT) and 2aaaPrismatic Tori (PT).

Standard RT and PT are shown in Figs. 1a and 1b, for a¼4.

Their twisted versions can be seen in Figs. 2, 3, and 4.

We define next the main topological network para-

meters. The diameter k is the length of the longest

minimum path. The average distance

kis the average

length of all minimum paths. The average distance of

the networks considered in this paper can be split into the

average distances on each of its dimensions, i.e.,

k¼

kxþ

ky

1766 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 12, DECEMBER 2010

Fig. 1. Regular tori of two and three dimensions, with a¼4. (a) RT of

size 84. (b) PT of size 844.

for 2D topologies and

k¼

kxþ

kyþ

kzfor 3D topologies.

We denote

kmax as the maximum of the average distances

per dimension, i.e.,

kmax ¼maxð

kx;

ky;

kzÞ.Finally,Link

Utilization, LU, is the average usage of the network links

under uniform random traffic at maximum network load;

LU = 1 means that all the network links are used every

network cycle to convey data. This can be only the case of

completely symmetric networks in which their average

distance is equally distributed among dimensions. Link

Utilization relates with average distance per dimension,

according to the following Proposition:

Proposition 1. In D-dimensional mixed-radix tori, Link

Utilization can be computed as follows:

LU ¼1

DX

D

i¼1

ki

kmax ¼1

D

ki

kmax

:ð1Þ

Proof. Consider first 2D standard xynetworks, with x>y.

Note that the number of links per dimension is identical

(xy in dimension X and yx in dimension Y) but dimension

X is x/y times longer than dimension Y. Hence, the

average distance of the path traversed by a random packet

in dimension X will be x/y times longer than that of

dimension Y. When dimension X uses its links at full

capacity, links of dimension Y are used, at most, at a y/x

(or

ky=

kx) rate. Consequently, the longest dimension

constitutes a network bottleneck that limits the use of

the shortest one. The same reasoning applies to higher

dimensional standard and twisted networks. Hence, the

link utilization, LU, will be the sum of the normalized

average distances per dimension, averaged among the

network dimensions. Twisted networks obey the same

rules but in this case, average distances per dimension are

not related with the length of each dimension. tu

3.2 Mixed-Radix Standard Tori

RT networks with 2aanodes are the Cartesian graph

product of two rings of 2aand anodes and PT networks with

2aaanodes are the Cartesian graph product of a ring of

2a nodes and two more rings of anodes, [14]. By product

associativity, a 2aaaPT can also be seen as the Cartesian

product of 2aaRTs times a ring of anodes. In such graph

products, overall distances can be computed as the sum of

their distances per dimension. Therefore, to study the

distance properties of RTs and PTs, we first consider such

properties in an n-node ring. We consider nto be even; having

nodd would only slightly modify the resulting values.

Remark 2. The diameter and the average distance of a ring

of n nodes are k¼n

2and

k¼n

4, respectively.

Since a 2aaRT is the product of two rings of 2aand a

nodes, respectively, we can straightforwardly infer the

following remark (see [14, Lemma 1.37]):

Remark 3. The diameter k, average distance

k, and average

link utilization LU of an RT with 2aanodes are:

k¼2a

2þa

2¼3a

2;

k¼2a

4þa

4¼3a

4;LU ¼3a=4

22a=4¼3

4:

ð2Þ

For 3D tori, the following result can be easily obtained

since a PT is the outcome of the composition of three sets of

orthogonal rings:

Remark 4. The diameter k, the average distance

k, and the

average link utilization LU of a PT with 2aaa

C

AMARA ET AL.: TWISTED TORUS TOPOLOGIES FOR ENHANCED INTERCONNECTION NETWORKS 1767

Fig. 2. RTT of size 84(a = 4).

Fig. 3. 3D PTT of 844nodes.

Fig. 4. 3D PDTT of 844nodes.

nodes are:

k¼2a

2þa

2þa

2¼2a;

k¼2a

4þa

4þa

4¼a;

LU ¼a

32a=4¼2

3:ð3Þ

3.3 Rectangular Twisted Tori

Definition 5. A2aaRTT consists of 2a2nodes arranged in a

rectangle with labels (x, y) such that 0x2a1and

0ya1. All the inner links in the rectangle form an

orthogonal 2D mesh, that is, any node (x, y) such that 0<

x<2a1and 0<y<a1is adjacent to the four nodes

ðxþ1;yÞ;ðx1;yÞ;ðx;yþ1Þ;ðx;y1Þ. The wraparound

links are defined as:

-ðx; 0Þis adjacent to ðxþa; a 1Þ8x:0xa1.

-ðx; 0Þis adjacent to ðxa; a 1Þ8

x:0x

a1ax2a1.

-(0;y)isadjacenttoð2a1;yÞ8x:0xa

10ya1.

-(a; y)isadjacenttoð2a1;yaÞ8x:0x

a10ya1.

Fig. 2 depicts an RTT for the case a¼4. Note that RTTs

only differ from RTs in the twist of acolumns on the vertical

wraparound links. We describe next the distance properties

of 2aaRTT networks by referring to its distance

distribution aðdÞ, or the number of nodes at distance d

from a given node.

Proposition 6. The distance distribution of an RTT with 2aa

nodes is:

að0Þ¼1; 8d:0<d<a aðdÞ¼4d;

aðaÞ¼2a1:ð4Þ

This distance distribution was proved in [16]. In that paper,

a general family of networks denoted by Gaussian networks

was presented. RTTs are particular members of this family

that are optimum in terms of diameter and average

distance, [16]. In addition, the network twist provides them

with edge symmetry. Edge symmetry means that all edges

incident to a node have an identical view of the rest of the

network. Therefore, all the links entering a certain node

must belong to orthogonal rings of the same size. This

symmetry can be intuitively observed by considering the

horizontal and vertical rings that compose the RTT network

in Fig. 2. Both rings, in dimensions Xand Y, have the same

size (2anodes), whereas the rings in the original RT have

different sizes (2anodes and anodes, respectively). Then,

using Proposition 6 and the network edge symmetry

property, we can state the following result:

Corollary 7. The diameter k, the average distance

k, and the link

utilization LU of an RTT with 2aanodes are:

k¼a;

k¼Pa

i¼1iaðiÞ

2a2¼4a21

32a2a2a

3;LU ¼1:ð5Þ

Therefore, a twist of amplitude ato the vertical 2alinks

of an RT leads to a significant performance improvement.

The diameter and average distance of the original RT are

50 and 12.5 percent higher, respectively. The link

utilization is 33 percent higher in the RTT. The optimum

link utilization of RTTs coming from their edge symmetry

implies that average distances among dimensions are

equalized, a=3each.

3.4 Three-Dimensional Topologies:

Prismatic Twisted and Doubly Twisted Tori

This resource imbalance exhibited by rectangular 2D net-

works also arises in noncubic, or prismatic, 3D Tori. We

employ two different approaches to build 3D networks

derived from RTTs. The simplest one with a single twist,

denoted by Prismatic Twisted Torus (PTT), can be seen as a

stack of RTTs. The second one twists its wraparound links

on both Yand Zdimensions, resulting in a Prismatic

Doubly Twisted Torus (PDTT).

Definition 8. A2aaaPTT consists of 2a3nodes labeled as

triplets (x, y, z) such that 0x2a1and 0y; z a1.

Any node (x, y, z) is adjacent to:

-ðx0;y

0;zÞ, where ðx; yÞand ðx0;y

0Þmust be adjacent in

a2aaRTT, and

-ðx; y; z 1ðmod aÞÞ.

Fig. 3 depicts a PTT of 844nodes. Note that, by

definition, the PTT is built as the Cartesian graph product of

a2aaRTT and a ring of anodes. Therefore, the

expressions for the distance-related parameters of a PTT

can be computed as stated in the following Proposition:

Proposition 9. The diameter k, the average distance

k, and the

average link utilization LU of a PTT with 2aaanodes are:

k¼aþa

2¼3a

2;

k¼a

3þa

3

þa

4¼11a

12 ;

LU ¼11a=12

3a=3¼11

12 :ð6Þ

From a topological point of view, the diameter in the

original PT is 33.3 percent higher than that of the PTT,

and the average distance is 9.1 percent higher. Further-

more, PTTs provide average distance equalization on X

and Ydimensions, leading to a significant improvement

on link utilization.

Definition 10. A2aaaPDTT consists of 2a3nodes labeled

as triplets (x, y, z) such that 0x2a1and 0y;

za1. Any node (x, y, z) is adjacent to:

-ðx1;y

1;zÞ, where ðx; yÞand ðx1;y

1Þmust be adjacent

in a 2aaRTT;

-ðx; y2;z

2Þ, where ðy; zÞand ðy2;z

2Þmust be adjacent

in a 2aaRTT; and

-ðx3;y;z

3Þ, where ðx; zÞand ðx3;z

3Þmust be adjacent

in a 2aaRTT.

Intuitively, this network can be interpreted as a

collection of aRTTs in the XY and XZ planes, respec-

tively. Fig. 4 shows horizontal and vertical cuts of a PDTT

of 844nodes. Note that a PDTT is built by applying

twists to Yand Zwraparound links of a PT, which

provides edge symmetry. Although diameter of PDTTs can

be inferred using combinatorial techniques, the complexity

1768 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 12, DECEMBER 2010

to compute its average distance invited us to use an

empirical procedure.

Remark 11. We have used a breadth-first search (BFS)

algorithm to find the shortest path between nodes in

PDTTs. We have obtained that the diameter k, the

average distance

k, and the average link utilization LU of

a PDTT with 2aaanodes are:

k¼3a

2;

k7a

8;LU ¼1:ð7Þ

Hence, the diameter in the original PT is 33.3 percent

higher than in PDTT, while the average distance is

14.28 percent higher. Finally, X,Y, and Z-dimensional

average distances are fully equalized.

4NETWORK ROUTING

A minimal routing algorithm to convey packets in rectan-

gular twisted tori is considered next. There are multiple

implementations of routing record generator units for RTT

networks. Algorithm 1 presents a function to compute a

routing header that records the shortest path between any

pair of nodes. Such an algorithm is based only on simple

operations over node coordinates: sums, subtractions, and

comparisons. In addition, all the operations can be done in

parallel, minimizing the time required for obtaining the

routing record.

Algorithm 1. Routing Record Generator for RTT

For a generic routing operation, a source node interface

will generate a routing record ðX; YÞheading the

packet. Xrepresents the number of links (hops) that

the packet must traverse along the axis of the first

coordinate, and Yalong the second coordinate’s axis.

Their signs indicate East/West and North/South directions.

Routers will process the header information in the same

way as in a standard torus, decrementing the corresponding

field header before sending the packet to the selected

neighbor. A packet with X¼0and Y¼0has reached

its destination and will be consumed. In our experiments,

we will employ an optimal fully adaptive routing mechan-

ism built on this basis, [19].

Proposition 12. The routing in Algorithm I is optimal for RTTs.

Proof. As these kinds of regular graphs can be fully

represented by plane tessellations, the correctness of this

routing mechanism can be geometrically proved, [25]. A

tile of area 2aathat tessellates the plane characterizes

the graph. Its origin node (left lower corner nodes,

highlighted in each rectangle of Fig. 5 for an RTT of 84

nodes) defines each tile. Points in the same position of

different tiles represent the same network node. As the

diameter of the network is a, any source node reaches

any other destination node with no more than ajumps,

as shown in Fig. 5 with a black border comprising the

original tile (shadowed tile in the center of the figure)

and its six immediate neighbor tiles. The shaded

diamonds in Fig. 5 show the possible destinations for

two given source nodes (0, 2) and (7, 3). tu

The next part of this geometric proof is based on

showing, in terms of a, that the eight sides of the black

border only comprise the seven tiles whose left lower

corners are the nodes fð0;0Þ;ða; aÞ;ða; aÞ;ð2a; 0Þ;ð2a; 0Þ;

ða; aÞ;ða; aÞg. This implies that the minimal path within

any pair of nodes can be found in that tile set. Following this

scheme, our routing algorithm computes all the possible

paths from a source node to the destination node in each one

of the seven possible tiles. After computing these paths’

lengths in parallel, the algorithm returns the routing record

of the path with minimum length.

With respect to 3D networks, an optimal routing

algorithm for PTTs is very simple as we can independently

compute the routing for the selected ring of the Zdimension

and for the RTT in the selected XY plane. PDTT routing can

be obtained using a lightly more complex three-dimensional

combinatorial procedure. In this case, with the actual

implementation, 23 cases have to be checked in parallel.

Notwithstanding, this procedure can admit optimizations.

C

AMARA ET AL.: TWISTED TORUS TOPOLOGIES FOR ENHANCED INTERCONNECTION NETWORKS 1769

Fig. 5. Possible paths departing from an 84RTT. The original tile

(corresponding to nodes 1-32) is shadowed and centered.

5NETWORK THROUGHPUT UNDER

UNIFORM TRAFFIC

In this Section, we analyze the performance of mixed-radix

networks with respect to their Bisection Bandwidth ðBBÞ,

which represents an upper bound in the management of

random uniform traffic. We also relate BBwith the network

topological properties.

The term BBstands for the bandwidth between two

equal partitions of a network when a minimum cut is

applied. According to [11], in networks with uniform

channel bandwidth as the ones we consider here, BBis

proportional to the channel count of the minimum network

cut. We will show that, while the standard RT and PT

behave in this way, for twisted networks, any optimal

routing leads to an effective bisection bandwidth lower than

the one computed through the minimum cut. We start with

standard RTs, which obey the classical definition of BB.

5.1 Bisection Bandwidth of Rectangular Tori

For a 2aaRT as the one depicted in Fig. 1, a minimum cut

of the network in two halves is the one that divides it into

two adjacent aasquares. In Fig. 1, this corresponds to the

cut that leaves columns 0-3 on one side and 4-7 on the other.

The cut is crossed by 2ahorizontal links (ainternal and a

wraparound links). This means that 2aphits can traverse

these links per cycle in each direction (a phit is the physical

data unit transferred by a link). Under random uniform

traffic, the N=2nodes in one-half of the network generate

packets with probability 0.5 to the other half; therefore, a

maximum of 2a

N=4¼8a

Nphits/cycle can be generated by each

node. For an RT with N¼2a2, the maximum injection rate

per node is 8a

2a2¼4

phits/cycle.

5.2 Bisection Bandwidth of Rectangular

Twisted Tori

Now, we consider a 2aaRTT as the one in Fig. 2. The cut

separating columns 3 and 4 is again a minimal cut.

However, in this case, it comprises 4a links, twice as much

as in the RT network. If we applied the classic BB definition,

we would find a BB of 4a

N=4¼16a

Nphits/cycle, establishing an

injection limit of 8

aphits/cycle per node. This is twice the

value obtained for the RT. However, as we will see next,

this performance figure is not valid as some wires of the

network bisection are used to communicate nodes in the

same half of the network.

Fig. 6 shows a 16 8RTT (a¼8), with links omitted for

simplicity. The dashed square line comprises one-half of the

network and represents the minimal cut. A replica tile on

the top right corner is partially shown. Routing from node

A to node B (both belonging to the left partition) should

choose a path going right and up, crossing the partition

boundaries twice (from and then to the original partition).

The alternative internal path inside the partition is longer,

and consequently, the routing algorithm discards it. Hence,

traffic internal to each partition crosses the minimum cut.

We define the Effective Bisection Bandwidth ðEBBÞas the

bisection bandwidth that is really used by traffic going from

one partition to the other; EBBdepends on the bisection

bandwidth and the amount of packets internal to one

partition that cross the boundaries. To quantify this value,

we initially consider the effect of the replica network, shown

on the top right corner of Fig. 6. A message from node Ato

any of the black nodes in Fig. 6 will find that a destination

node in the replica is closer crossing the partition

boundaries than using an internal path. The number of

black nodes is determined by Ti, which is the triangular

number for i(i¼3in Fig. 6). The iþ1nodes in gray at

distance diameter k¼a, both in the original and the replica

tile, are equally distant from node A. To balance traffic, we

assume that internal and external paths are randomly

chosen. Finally, the collection of destination nodes being

closer in the replica tile than in the original one is the same

for any of the a1inodes in the same diagonal as A

(four nodes in the example Fig. 6 with i¼3).

If all nodes in the partition send a packet to nodes inside

the partition (uniform traffic), the amount of packets P1

crossing the boundaries toward the replica tile on the top

right corner is given by:

P1¼X

a2

i¼0ða1iÞTiþ1

2X

a2

i¼0ða1iÞðiþ1Þ:ð8Þ

Considering the effect of four replica tiles on the four

network corners (not shown in Fig. 6), the amount of

packets Pthat, being internal to one partition, crosses the

partition boundaries is:

P¼4P1¼a4a2

6a4

6:ð9Þ

Thus, if every node sends a packet to any other node in

the network, there will be a4packets sent from the left

partition to the right one (which obviously cross the

boundaries), and roughly 2a4

6¼a4

3packets which are

internal to each partition but crossing twice the bisection

boundaries to follow minimal paths. Hence, the traffic

crossing the minimum cut is increased from a4packets to

4a4

3from which only a4are using the bisection to depart

from one partition to the other. This means that having 4a

links in the network bisection, on average, 3alinks are

used for traffic crossing from one partition to the other,

while alinks are used for traffic internal to the partition.

1770 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 12, DECEMBER 2010

Fig. 6. Path between two nodes in an RTT of size 16 8. Vivid circles on

the down left side represent nodes in one-half of the network, while

dashed square represents the bisection. Pale circles on the down right

side represent nodes in the other half of the network. For the sake of

clarity, a fragment of a replica tile is showed in the upright side. Note that

the shortest path from A to B crosses the bisection twice.

Hence, for an RTT with N¼2a2, the injection rate per

node under random traffic will be limited by 3a

N=4¼12a

2a2¼6

a

phits/cycle/node.

5.3 Injection Rate and Topological Properties

The Effective Bisection Bandwidth metric shows that, under

uniform traffic, RTTs improve the maximum injection rate

per node with respect to RTs in a 50 percent. This is due to

the improvements on two topological properties: average

distance and edge symmetry. First, the average distance in

the original RT is 3a

4, while in the RTT, it is 2a

3, correspond-

ing to a performance improvement of 9/8. Second, the link

utilization in RTs is 3/4, while in RTTs, it is 1, correspond-

ingtoaperformancegainof4/3.Consideringboth

enhancements, the expected improvement of 94

83¼3

2¼1:5

is obtained.

Considering the definitions of link utilization and

average distance, it is interesting to make the following

derivation:

Injrate ¼

kLU ¼

kRT

kRTT LURT T

LURT ¼

kRT

max

kRT T

max

:ð10Þ

Equation (10) shows that the improvement obtained by

the twist is proportional to the reduction of the longest

average distance per dimension among all dimensions,

which at the end limits performance.

5.4 Bisection Bandwidth and Injection Rate of

3D Networks

Considering 3D networks, the maximum injection rate of a

PT is 4=a phits/cycle/node. The EBBof a PTT can be

calculated considering it as a stack of RTTs connecting

nodes in Z dimension by rings. This leads to a maximum

injection rate of 6=a phits/cycle/node that means, as in the

2D case, 1.5 times the maximum PT injection rate. The same

result comes from an improvement factor of 12/11 in

average distance reduction and 11/8 in link efficiency,

whose product is again 1.5.

A calculation of the EBBfor PDTTs, as the one presented

in Section 5.2, is quite more laborious. Therefore, we will

derive the PDTT maximum injection rate from its topolo-

gical improvements over a PT, which has been shown to be

equivalent to calculate its EBB. PDTTs present an improve-

ment of 12/7 (71.43 percent improvement) over the base PT

injection rate: a factor of 8/7 due to the average distance

reduction, and a factor of 3/2 due to the perfect balance in

the use of the links. Thus, the maximum injection rate under

random traffic is 48

7aphits/cycle/node.

6EXPERIMENTAL ENVIRONMENT

This section is organized into two sections. The first one

deals with the simulation platform and performance

metrics employed and the second with workloads.

6.1 Simulation Configuration and

Performance Metrics

Network evaluations are performed using INSEE, an

interconnection network simulator developed by the

authors of [21]. It can manage multiple topologies and

switching technologies. For this study, the router employed,

shown in Fig. 7, is similar to the one implemented in the

IBM BlueGene/L: virtual cut-through switching strategy,

[12], and Bubble flow-control deadlock avoidance, [19], with a

static virtual channel plus two fully adaptive virtual ones.

BlueGene family of supercomputers implements a conges-

tion-control mechanism that prioritizes in-transit traffic

against new injections, which is also implemented in our

router. In our experiments, packets have a fixed length of

16 phits of 4 bytes each.

In our experiments, performance data will be shown in

accepted load versus provided load or throughput.

Provided load is an input parameter, and accepted load is

a measured value that indicates how many phits/cycle/

node the network is able to successfully deliver. When the

network is not saturated, both values are almost identical.

However, when some of the routers (or all of them)

saturate, actual throughput may be much smaller than

applied load. In some cases, we also provide latency data in

the form of plots of average packet delay versus provided

load. We measure the number of cycles between the instant

a packet is injected (stored in the input buffer of the source

router) until it is consumed at its destination node.

In all the experiments, the internal buffer size has been

limited to a practical value. Injection queues have room for,

at most, eight packets and transit queues have room for

storing up to four packets in each virtual channel.

6.2 Workloads

First, we use independent traffic sources under random

uniform traffic (UN). In this case, packets are distributed

evenly along the whole network so that no persistent

bottlenecks are generated. Next, we use random but

nonuniform (hot region, HR) traffic to check the perfor-

mance when the network has an unbalanced usage. In HR,

25 percent of the traffic is addressed to the first 12.5 percent

of the network (in terms of Cartesian coordinates, i.e., the

lowest rows in 2D and planes in 3D) and the remaining

75 percent is uniformly spread along the whole network. In

this way, we apply more pressure on those resources used

by packets traveling toward the lower rows (planes for 3D).

We finish with permutation-based traffic based on [11] (Bit-

Complement (BC), Bit-Reversal (BR), and Perfect Shuffle

(PS)). In this case, each node sends packets to a given

C

AMARA ET AL.: TWISTED TORUS TOPOLOGIES FOR ENHANCED INTERCONNECTION NETWORKS 1771

Fig. 7. Model of the simulated routers, with a detailed view of the X+

input port showing the three virtual channels that share this link. The 2D

case would not include ports and links for the Z axis.

destination as defined by each of the permutations. For all

these workloads, the interinjection interval at each node is

random following a Poisson distribution chosen as to

modulate the provided load in terms of phits/cycle/node.

Other communication pattern of interest included in this

study is nearest neighbor communications.

None of the previous workloads considers causality

among messages. This assumption, although commonly

used in performance studies, is unrealistic because in most

current applications, the reception of messages triggers the

emission of new ones. Thus, we also evaluate networks with

message-passing traces of real applications. These traces

have been obtained running a selection of the NAS Parallel

Benchmarks (NPBs) MPI applications on an actual multi-

computer. Trace-based simulations help us to show that the

predicted benefits of mixed-radix twisted tori are also valid

for the traffic patterns generated by current applications.

We have restricted our study to those benchmarks of the

suite that can be mapped onto rectangular networks, and

among them, to those that make intensive use of the

interconnection network: Conjugate Gradient (CG), Integer

Sort (IS), Fourier Transform (FT), LU solver (LU), and

MultiGrid (MG). Limitations on our experimental frame-

work do not allow us to generate traces for applications

with more than 128 nodes.

For these workloads, messages are packetized and

injected into the network as fast as it can accept them but

always obeying causality relationships: if a node is waiting

for a message, the simulator stalls its traffic generator until

that message arrives [20]. Performance will be measured in

terms of the number of cycles required to fully inject and

deliver the complete workload.

7PERFORMANCE EVALUATION

This section is organized into five sections, each one dealt

with experiments driven by traffic of different nature.

7.1 Random, Uniform Traffic

Figs. 8a and 8b summarize the obtained data for a collection

of 2D (32 16 and 64 32) and 3D (32 16 16 and

64 32 32) networks. Injected and accepted loads are

identical until reaching the saturation point. Then, even

when the simulator tries to inject more traffic, the load

actually delivered stabilizes in some cases, and drops to a

lower point before stabilizing in others. The reasons for this

drop are analyzed in [18].

We expect that the maximum accepted load under

random traffic is the one determined by the network effective

bisection bandwidth computed in Section 5. Table 1 shows

simulated maximum accepted loads versus theoretical upper

throughput bounds.

Fig. 9a shows latencies for 64 32 and 32 16 networks,

and Fig. 9b does so for 64 32 32 and 32 16 16

networks. The average latency grows steadily with injected

load until reaching saturation. Beyond that point, it grows

boundless. As the average distance is lower in twisted tori,

so are their latencies when in a nonsaturated state.

Twisted networks maximize link utilization. Fig. 10 plots

the measured link utilization in a 32 16 (a) and a 32

16 16 (b) networks. For 2D networks, the X channels in the

RT are (almost) fully used while the utilization of the Y

channels does not reach 50 percent. In the RTT, all channels

are close to full utilization. This effect is even more

noticeable in 3D networks: the bottleneck at the X channels

induces low utilization levels (less than 50 percent) of

channels Y and Z. The single twist of the PTT increases link

utilization by allowing almost full use of X and Y channels,

and higher utilization of Z channels. The PDTT exhibits full

channel utilization.

7.2 Random, Hot Region Traffic

The lack of uniformity in this traffic generates an

unbalanced utilization of network resources. Figs. 11a and

1772 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 12, DECEMBER 2010

Fig. 8. Throughput for (a) 2D and (b) 3D networks under random

uniform traffic.

TABLE 1

Throughput Upper Bound versus Measured Maximum

Throughput for 2D and 3D Networks

11b show throughput curves for 2D (32 16 and 64 32)

and 3D (32 16 16 and 64 32 32) networks.

In the 2D case, the use of a twist forces the use of the

links in the Ydimension for long Xdisplacements, which

increases the use of the lowest rows (those saturated by the

traffic pattern). For this reason, the obtained results show

that twisted torus outperforms the regular torus but not in

the magnitude as when managing uniform traffic.

The introduction of higher pressure in the lowest planes

of the 3D networks allows the twisted torus to keep a

performance increase over the standard torus similar to the

obtained when managing uniform traffic provided that

dimension Zis not twisted. When Zdimension became

twisted, it is used to shorten paths in the Xdimension, and

as in the 2D case, the hot region becomes even more

saturated due to the extra traffic traveling by Z.

In short, using twists results in a reduction of average

distance, and a homogenization of resources for evenly

distributed traffic. If the traffic pattern does not make a

homogeneous use of the network, it may happen that the

topological advantages are not visible.

7.3 Permutation Patterns

Fig. 12a shows performance measurements for the following

permutation patterns: Bit-Complement, Bit-Reversal, and

Perfect Shuffle [11]. It can be observed that in all cases, the

RTT provides a significant improvement with respect to RT,

ranging from 24.3 percent (Bit-Complement) to 41.1 percent

(Bit-reversal). Fig. 12b shows throughput results on 3D

networks under the same traffic patterns. PTT always

outperforms PT, with a speedup ranging from 37.1 percent

(in Bit-Reversal) to 53.2 percent (in Perfect Shuffle). PDTT

improvements over PT are even higher, from 59.7 percent (in

Bit-Reversal) to 74.5 percent (in Perfect Shuffle).

7.4 Nearest Neighbor Traffic

Some parallel applications exhibit traffic patterns in which

nodes communicate with their nearest neighbors in a torus

topology. This can be either due to the inherent symmetry

of the application or because of mapping big data matrices

C

AMARA ET AL.: TWISTED TORUS TOPOLOGIES FOR ENHANCED INTERCONNECTION NETWORKS 1773

Fig. 9. Average latency under random uniform traffic for (a) 2D and

(b) 3D networks.

Fig. 10. Average link utilization under random uniform traffic for (a) 2D

and (b) 3D networks, with the network saturated.

Fig. 11. Throughput under HR traffic for (a) 2D and (b) 3D networks.

on the network nodes. Twisted tori, contrary to the

standard ones, do not provide all the orthogonal wrap-

around links needed to optimally manage nearest neighbor

toroidal communications. We introduce next, through an

example, the basis of a simple mapping scheme that

removes this apparent drawback in RTTs.

Fig. 13a shows an 88matrix of processes that require

toroidal nearest neighbor communications. Fig. 13b shows

the proper mapping of such matrix over a 48RTT. The

idea lies on mapping the logical Yrings of the application

over the twisted physical Yrings of the network. Mapping

bigger square matrices only requires a modulo mapping of

their columns over the physical twisted Yrings. When

mapping processes in this way, all local logical commu-

nications are also physically local.

Exceptionally, in some rare cases, small applications with

low parallelism may lead a standard torus to behave better

than a twisted one under nearest neighbor communication.

7.5 Application Traffic

We consider now a preliminary study of how twisted

topologies behave when executing real applications. In

most real applications, the way of mapping tasks to nodes

has a great impact on performance. The study of application

mapping is considered beyond the scope of this paper.

Nevertheless, we have tested two simple mapping experi-

ments: consecutive allocation (task igoes to node i) and

random, denoted by “_c” and “_r,” respectively. For the

latter case, graphs show the average of 20 simulation runs.

Fig. 14 shows the results obtained when processing the

traces of the selected NAS applications.

It can be seen how the RTT performs better than RT for

CG (up to 10 percent), FT (up to 18 percent), and IS (up to

20 percent), independently of the placement. Using a

consecutive placement in LU, the selection of topology is

almost irrelevant. For MG, the twisted alternative is worse

than the nontwisted one for consecutive placement but

better for random placement. Better mappings could

improve these numbers.

8WIRING MIXED-RADIX TWISTED TORI

The performance gains exhibited by twisted tori come just

from a rearrangement of their wraparound links. Now, we

evaluate the technological consequences of the twists on the

layout of such networks. We consider first the case of

rectangular networks and then the prismatic ones.

1774 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 12, DECEMBER 2010

Fig. 12. Throughput for BC, BR, and PS traffic patterns in a (a) 32 16

RT and RTT and (b) 64 32 32 PT, PTT, and PDTT.

Fig. 13. (a) 88matrix of processes that require nearest neighbor

communications and (b) mapping over a 48RTT.

Fig. 14. Total time required to fully inject and deliver the workload with

real applications on a 16 8network.

8.1 Bounded Link Length Layout for 2D RTTs

In tori, the length of the wraparound links grows with the

network size. While internal links are supposed to have

unitary length, wraparound links grow as ﬃﬃﬃﬃﬃ

N

pin square

tori. Consequently, real implementations could be nega-

tively affected by this unbalance. The folded torus is a good

solution to equalize all the network links by doubling its

unitary wire length [11]. The idea is based on applying

certain shuffle transformations to rows and columns that

interleave the network nodes. Two different shuffle

transformations can be considered. Given a row (or column)

of n nodes ð0;...;n1Þ, the following transformations map

every node location ðx; yÞonto a different one ðx0;yÞ

(respectively ðx; y0Þ) on the same row (respectively, col-

umn). We describe next the two used shuffle transforma-

tions when applying to rows:

SHUFFLE A:

x0¼2x; 8x:0xn=2;

x0¼2n2x1;8x:n=2xn1;

SHUFFLE B:

x0¼2xþ1;8x:0xn=2;

x0¼2n2x2;8x:n=2xn1:

Analogously to the torus case, we propose a new folding

for RTTs, which generates a layout in which link lengths

are equalized and bounded by ﬃﬃﬃ

5

p. The whole folding

process is detailed in Algorithm 2. The interested reader

can refer to [24] for a formal proof of this algorithm on a

similar topology. Fig. 15 shows the resulting Trellis Folded

layout of an 84RTT. As in the folded torus case, two

planes are enough to lay all the links without cutting

each other.

Algorithm 2. Trellis Folding Mapping Function

The previous result is optimal for RTTs as a maximum

link length of 2 is impossible to obtain. To see this, consider

the case of 2a2nodes arranged in the most compact layout: a

aﬃﬃﬃ

2

paﬃﬃﬃ

2

psquare. If that were possible, the physical

distance among opposite corners would be

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ðaﬃﬃﬃ

2

p1Þ2þðaﬃﬃﬃ

2

p1Þ2

q¼2aﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

1ﬃﬃﬃ

2

p

aþ2

a2

s2a:

The diameter d¼aensures that those nodes are connected

by, at most, alinks, allowing for a maximum link length of 2

for any a. Any other rectangular layout would present a

longer physical distance between two opposite corner

nodes, requiring longer links. Being ainteger, a square of

2a2nodes is impossible and so is a layout with a maximum

link length of 2. Consequently, our folding mechanism

leads to an optimal link length of ﬃﬃﬃ

5

pfor a grid layout.

This technique can be also applied in “block mode.” As

an example, consider a 32 16 RTT, where each 44

submesh is a single block. This network can be repre-

sented as the basic 84RTT in Fig. 2, but each node

becomes in a 44submesh and each link becomes in a

group of 4 parallel links joining two such submeshes. After

applying the Trellis Folding algorithm shown in Algo-

rithm 2 to the basic 84network, the 32 16 block folded

layout is obtained by simply substituting each node by the

44submesh, and each link by the four corresponding

parallel links. In general, this method can be applied to

any block size.

8.2 Layouts and Cabinet Distributions for 3D PTT

and PDTT

Networks for large parallel systems are usually distributed

among cabinets. We take, as an example, the system layout

of a typical configuration of the BlueGene/L (64 32 32

nodes), which is organized as 88cabinets of 1,024 nodes

each [6]. Dimension Z evolves inside every pair of cabinets

as every cabinet has an 8816 node configuration.

Cabinets are connected in pairs so that every pair contains

all of the nodes with different Z for the same (X, Y). Pairs of

cabinets are laid on a rectangular array, with a standard

folding applied on rows and columns.

Now we analyze PTTs. Considering that a close group of

cabinets comprises the whole Z dimension for the (x, y)

nodes within it, we can obviate such a dimension. Thus, the

folding problem is reduced to the previous bidimensional

RTT case. Using the network and cabinet sizes stated above

and 88blocks, each pair of cabinets would correspond

with a node in a regular 84RTT.

Thus, the Trellis Folding can be directly applied. Fig. 16

shows the layout. Considering the distance between

cabinets in the same row or column as the unit length, the

maximum link length between cabinets is 2ﬃﬃﬃ

2

p. Vertical

labels on cabinets mean x, y, z coordinates of each first node

in the block. Note that only links connecting different

cabinets have to be modified, thus preserving the internal

cabinet connectivity.

In the PDTT case, dimension Zcannot be comprised

inside a group of cabinets that share the same (x,y) nodes.

Instead, dimension Zis spread between two groups of

C

AMARA ET AL.: TWISTED TORUS TOPOLOGIES FOR ENHANCED INTERCONNECTION NETWORKS 1775

Fig. 15. Trellis folded RTT of size 84.

cabinets corresponding to different (x,y) nodes. Hence, Z

should connect two pairs of cabinets. In this case, we

cannot apply the Trellis Folding, as it places these pairs of

cabinets in opposite locations. However, another folding

technique based on [15] can be applied, which roughly

corresponds to applying a shuffle once on Yand twice on

X. The resulting layout leaves such pairs of cabinets

together, as shown in Fig. 17 (most links are omitted for

simplicity), but increases the maximum link length to 4.

Note that, as before, only links between cabinets have to

be reconnected.

9CONCLUSIONS

Mixed-radix tori present severe communication bottlenecks

that negatively affect their performance. These bottlenecks

are caused by the asymmetry exhibited by a network that

has dimensions of different sizes. In this paper, we have

analyzed a class of twisted tori that remove these bottle-

necks by equalizing on each dimension the length of the

paths traversed by packets.

We have evaluated the performance of twisted tori both

analytically and by means of simulation. We have described

their main distance-related properties and the relationship

between the effective bisection bandwidth and the network

performance under uniform traffic. The router model

employed for simulations incorporates all the architectural

features of current packet routers and resembles the one

used in the torus network of the BlueGene/L. The networks

have been tested managing both synthetic traffic and

workloads from real application traces. In all cases, the

twisted topologies showed gains over the standard mixed-

radix tori.

The added costs of twisted networks come from both

their packet routing mechanisms and their wireability. We

have proposed scalable and practicable solutions for both

architectural issues. As a conclusion, the proposed topolo-

gies appear as an option to improve the overall perfor-

mance of mixed-radix tori by just rearranging their

wraparound links.

ACKNOWLEDGMENTS

This work has been supported by the Spanish Ministry of

Education and Science (grants TIN2007-68023-C02-01,

TIN2007-68023-C02-02, TIN2007-60625, and AP-2005-3318,

and CONSOLIDER Project CSD2007-00050), the Basque

Government (grant IT-242-07), the European Network of

Excellence on High Performance and Embedded Archi-

tecture and Compilation (HiPEAC, contract IST-217068),

and the SARC European Project (Contract number 27648).

Javier Navaridas is supported by a doctoral grant from

the UPV/EHU.

REFERENCES

[1] N.R. Adiga et al. “An Overview of the BlueGene/L Super-

computer,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing

’02) Technical Papers, Nov. 2002
## 1 comment:

AvaTradeis the ultimate forex trading platform for beginning and professional traders.Post a Comment