# Efficient Methods for Multi-Dimensional Array Redistribution 

CHING-HSIEN HSU, YEH-CHING CHUNG, AND CHYI-REN DOW<br>chhsu@fcu.edu.tw ychung@fcu.edu.tw crdow@fcu.edu.tw

Department of Information Engineering, Feng Chia University, Taichung, Taiwan 407, ROC

Final version accepted January 8, 1999


#### Abstract

In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access on distributed memory multicomputers. Since the redistribution is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods for multi-dimensional array redistribution. Based on the previous work, the basic-cycle calculation technique, we present a basic-block calculation (BBC) and a complete-dimension calculation ( $C D C$ ) techniques. We also developed a theoretical model to analyze the computation costs of these two techniques. The theoretical model shows that the $B B C$ method has smaller indexing costs and performs well for the redistribution with small array size. The $C D C$ method has smaller packing/unpacking costs and performs well when array size is large. When implemented these two techniques on an IBM SP2 parallel machine along with the PITFALLS method and the Prylli's method, the experimental results show that the $B B C$ method has the smallest execution time of these four algorithms when the array size is small. The $C D C$ method has the smallest execution time of these four algorithms when the array size is large.


Keywords: array redistribution, distributed memory multicomputers, the basic-block calculation technique, the complete-dimension calculation technique

## 1. Introduction

The data parallel programming model has become a widely accepted paradigm for programming distributed memory multicomputers. To efficiently execute a data parallel program on a distributed memory multicomputer, an appropriate data decomposition is critical. The data decomposition involves data distribution and data alignment. The data distribution deals with how data arrays should be distributed. The data alignment deals with how data arrays should be aligned with respect to one another. The purpose of data decomposition is to balance the computational load and minimize the communication overheads.

Many data parallel programming languages such as High Performance Fortran (HPF) [9], Fortran D [6], Vienna Fortran [33], and High Performance C (HPC) [28] provide compiler directives for programmers to specify array distribution. The array distribution provided by those languages, in general, can be classified into two categories, regular and irregular. The regular array distribution, in general, has three types, BLOCK, CYCLIC, and BLOCK-CYCLIC(c). The BLOCK-CYCLIC(c) is the
most general regular array distribution among them. Dongarra et al. [5] have shown that these distribution are essential for many dense matrix algorithms design in distributed memory machines. The irregular array distribution uses user-defined array distribution functions to specify array distribution.

In some algorithms, such as multi-dimensional fast Fourier transform, the Alternative Direction Implicit (ADI) method for solving two-dimensional diffusion equations, and linear algebra solvers, an array distribution that is well-suited for one phase may not be good for a subsequent phase in terms of performance. Array redistribution is required for those algorithms during run-time. Therefore, many data parallel programming languages support run-time primitives for array redistribution. Since array redistribution is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing array among processors. Thus efficient methods for performing array redistribution are of great importance.

In this paper, based on the basic-cycle calculation technique [4], we present a basic-block calculation (BBC) and a complete-dimension calculation (CDC) technique for multi-dimensional array redistribution. The main idea of the basic-block calculation technique is first to use the basic-cycle calculation technique to determine source/destination processors of some specific array elements in a basic-block. From the source/destination processor/data sets of a basic-block, we can efficiently perform a redistribution. The complete-dimension calculation technique also uses the basic-cycle calculation technique to generate the communication sets of a redistribution. However, it generates the communication sets for array elements in the first row of each dimension of a local array. This will result in a high indexing overheads. But the packing/unpacking overheads can be greatly reduced. In this paper, we also developed a theoretical model to analyze the tradeoff between these two techniques. The two techniques can be easily implemented in a parallelizing compiler, run-time systems, or parallel programs.

This paper is organized as follows. In Section 2, a brief survey of related work will be presented. In Section 3, we will introduce notations and terminology used in this paper. Section 4 presents the basic-block calculation and the complete-dimension calculation techniques for multi-dimensional array redistribution. The theoretical model to analyze the performance tradeoff of these two methods will also be presented in this section. In Section 5, the experimental results of the basic-block calculation technique, the complete-dimension calculation technique, the PITFALLS method, and the Prylli's method will be given.

## 2. Related work

Many methods for performing array redistribution have been presented in the literature. Since techniques of redistribution can be performed either by using the multicomputer compiler technique [27] or using the runtime support technique, we briefly describe the related research in these two approaches.

Gupta et al. [7] derived closed form expressions to efficiently determine the send/receive processor/data sets. They also provided a virtual processor approach [8]
for addressing the problem of reference index-set identification for array statements with BLOCK-CYCLIC ( $c$ ) distribution and formulated active processor sets as closed forms. A recent work in [16] extended the virtual processor approach to address the problem of memory allocation and index-sets identification. By using their method, closed form expressions for index-sets of arrays that were mapped to processors using one-level mapping can be translated to closed form expressions for index-sets of arrays that were mapped to processors using two-level mapping and vice versa. A similar approach that addressed the problems of the index set and the communication sets identification for array statements with BLOCK-CYCLIC(c) distribution was presented in [24]. In [24], the $\operatorname{CYCLIC}(k)$ distribution was viewed as an union of $k$ CYCLIC distribution. Since the communication sets for CYCLIC distribution is easy to determine, communication sets for $\operatorname{CYCLIC}(k)$ distribution can be generated in terms of unions and intersections of some CYCLIC distributions.

In [3], Chatterjee et al. enumerated the local memory access sequence of communication sets for array statements with BLOCK-CYCLIC(c) distribution based on a finite-state machine. In this approach, the local memory access sequence can be characterized by a FSM at most $c$ states. In [17], Kennedy et al. also presented algorithms to compute the local memory access sequence for array statements with BLOCK-CYCLIC( $c$ ) distribution. Lee et al. [18] derived communication sets for statements of arrays which were distributed in arbitrary BLOCK-CYCLIC(c) fashion. They also presented closed form expressions of communication sets for restricted block size.

Thakur et al. [25, 26] presented algorithms for run-time array redistribution in HPF programs. For BLOCK-CYCLIC( $k r$ ) to BLOCK-CYCLIC( $r$ ) redistribution (or vice versa), in most cases, a processor scanned its local array elements once to determine the destination (source) processor for each block of array elements of size $r$ in the local array. In [10], an approach for generating communication sets by computing the intersections of index sets corresponding to the LHS and RHS of array statements was presented. The intersections are computed by a scanning approach that exploits the repetitive pattern of the intersection of two index sets. In [22, 23], Ramaswamy and Banerjee used a mathematical representation, PITFALLS, for regular data redistribution. The basic idea of PITFALLS is to find all intersections between source and destination distributions. Based on the intersections, the send/receive processor/data sets can be determined and general redistribution algorithms can be devised. Prylli et al. [21] proposed runtime scan algorithm for BLOCKCYCLIC array redistribution. Their approach has the same time complexity as that proposed in [23], but has simple basic operation compared to that proposed in [23]. The disadvantage of these approaches is that, when the number of processors is large, iterations of the out-most loop in intersection algorithms increased as well. This leads to high indexing overheads and degrades the performance of a redistribution algorithm.

In [32], a spiral mapping technique was proposed. The main idea of this approach was to map formal processors onto actual processors such that the global communication can be translated to the local communication in a certain processor group. Since the communication is local to a processor group, one can reduce communication conflicts when performing a redistribution. Kalns and Ni [12, 13]
proposed a processor mapping technique to minimize the amount of data exchange for BLOCK to BLOCK-CYCLIC ( $c$ ) redistribution and vice versa. Using the data to logical processors mapping, they show that the technique can achieve the maximum ratio between data retained locally and the total amount of data exchanged. Walker et al. [30] used the standardized message passing interface, MPI, to express the redistribution operations. They implemented the BLOCK-CYCLIC array redistribution algorithms in a synchronous and an asynchronous scheme. Since the excessive synchronization overheads incurred from the synchronous scheme, they also presented the random and optimal scheduling algorithms for BLOCK-CYCLIC array redistribution.

Kaushik et al. [14, 15] proposed a multi-phase redistribution approach for BLOCK$\operatorname{CYCLIC}(s)$ to BLOCK-CYCLIC $(t)$ redistribution. The main idea of multi-phase redistribution is to perform a redistribution as a sequence of redistribution such that the communication cost of data movement among processors in the sequence is less than that of direct redistribution. Instead of redistributing the entry array at one time, a strip mining approach was presented in [31]. In this approach, portions of array elements were redistributed in sequence in order to overlap the communication and computation. In [19], a generalized circulant matrix formalism was proposed to reduce the communication overheads for BLOCK-CYCLIC $(r)$ to BLOCK-CYCLIC( $k r$ ) redistribution. Using the generalized circulant matrix formalism, the authors derived direct, indirect, and hybrid communication schedules for the cyclic redistribution with the block size changed by an integer factor $k$. They also extended this technique to solve some multi-dimensional redistribution problem [20]. However, as the array size increased, the above methods will have a large amount of extra transmission costs and degrades the performance of a redistribution algorithm.

## 3. Preliminaries

In this section, we will present the notations and terminology used in this paper. To simplify the presentation, we use $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ to represent the $\left(\operatorname{CYCLIC}\left(s_{0}\right), \operatorname{CYCLIC}\left(s_{1}\right), \ldots, \operatorname{CYCLIC}\left(s_{n-1}\right)\right)$ to $\left(\operatorname{CYCLIC}\left(t_{0}\right)\right.$, $\left.\operatorname{CYCLIC}\left(t_{1}\right), \ldots, \operatorname{CYCLIC}\left(t_{n-1}\right)\right)$ redistribution for the rest of the paper.

Definition 1. An $n$-dimensional array is defined as the set of array elements $A^{(n)}=$ $A\left[1: n_{0}, 1: n_{1}, \ldots, 1: n_{n-1}\right]=\left\{a_{d_{0}, d_{1}, \ldots, d_{n-1}} \mid 0 \leq d_{\ell} \leq n_{\ell}-1,0 \leq \ell \leq n-1\right\}$. The size of array $A^{(n)}$, denoted by $\left|A^{(n)}\right|$, is equal to $n_{0} \times n_{1} \times \cdots \times n_{n-1}$. In this paper, we assume that array elements are stored in a memory by a row-major manner.

Figure 1(a) shows a two-dimensional array $A^{(2)}=A[1: 12,1: 12]$. There are $12 \times$ 12 array elements in $A[1: 12,1: 12]$, i.e., $\left|A^{(2)}\right|=144$. In Figure 1(a), we use the italic fonts and the normal fonts to represent the indices of each dimension of array $A[1: 12,1: 12]$ and the global array indices of array $A[1: 12,1: 12]$, respectively. We assume that array elements were stored in a row major fashion and the array index starts from 1.

|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $I$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 2 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 3 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| 4 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 |
| 5 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 |
| 6 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
| 7 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 |
| 8 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 |
| 9 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
| 10 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 |
| 11 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 |
| 12 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 |



Source: BC(3,2)


Destination: $\mathrm{BC}(2,4)$
(b)

$$
\begin{aligned}
& S L A_{0}^{(2)}=\left(\begin{array}{cccc}
1 & 2 & 7 & 8 \\
13 & 14 & 19 & 20 \\
25 & 26 & 31 & 32 \\
73 & 74 & 79 & 80 \\
85 & 86 & 91 & 92 \\
97 & 98 & 103 & 104
\end{array}\right), \quad S L A_{0,0}^{(2)}=\left(\begin{array}{c}
1 \\
13 \\
25 \\
73 \\
85 \\
97
\end{array}\right), \quad \quad S L A_{0,1}^{(2)}=\left(\begin{array}{l}
1 \\
2 \\
7 \\
8
\end{array}\right), \\
& S L A_{0,0}^{(2)}[2]=\{13\}, \quad S L A_{0,0}^{(2)}[4]=\{73\}, \quad S L A_{0,1}^{(2)}[2]=\{2\}, \quad S L A_{0,1}^{(2)}[4]=\{8\}
\end{aligned}
$$

(c)

Figure 1. (a) A global array $A[12,12]$. (b) $\mathrm{A} \mathrm{BC}(3,2) \rightarrow \mathrm{BC}(2,4)$ redistribution on $A[1: 12,1: 12]$ over $M[2,3]$. (c) Examples of $S L A_{i}^{(n)}, S L A_{i, \ell}^{(n)}$, and $S L A_{i, \ell}^{(n)}[r]$.

Definition 2. An $n$-dimensional processor grid is defined as the set of processors $M^{(n)}=M\left[m_{0}, m_{1}, \ldots, m_{n-1}\right]=\left\{\tilde{p}_{d_{0}, d_{1}, \ldots, d_{n-1}} \mid 0 \leq d_{\ell} \leq m_{\ell}-1,0 \leq \ell \leq n-1\right\}$. The number of processors of $M^{(n)}$, denoted by $\left|M^{(n)}\right|$, is equal to $m_{0} \times m_{1} \times \cdots$ $\times m_{n-1}$.

Figure $1(\mathrm{~b})$ shows a $\mathrm{BC}(3,2)$ to $\mathrm{BC}(2,4)$ redistribution on $A[1: 12,1: 12]$ over a processor grid $M[2,3]$ with six processors. The shadow portions represent the array elements distributed to processor $P_{0}$ before and after the redistribution.

Definition 3. Given an $n$-dimensional processor grid $M^{(n)}$, the rank of processor $\tilde{p}_{d_{0}, d_{1}, \ldots, d_{n-1}}$ is equal to $i=\sum_{k=0}^{n-1}\left(d_{k} \times \prod_{\ell=k+1}^{n-1} m_{\ell}\right)$, where $0 \leq d_{\ell} \leq m_{\ell}-1$, $0 \leq \ell \leq n-1$. To simplify the presentation, we also use processor $P_{i}$ to denote $\tilde{p}_{d_{0}, d_{1}, \ldots, d_{n-1}}$ in this paper, where $0 \leq i \leq\left|M^{(n)}\right|-1$.

According to Definition 3, we know that $\tilde{p}_{0,0}=P_{0}, \tilde{p}_{0,1}=P_{1}, \tilde{p}_{0,2}=P_{2}, \tilde{p}_{1,0}=$ $P_{3}, \tilde{p}_{1,1}=P_{4}, \tilde{p}_{1,2}=P_{5}$.

Definition 4. Given an $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right), \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right), s_{\ell}$ and $t_{\ell}$ are called the source distribution, the destination distribution, the source distribution factors, and the destination distribution factors of the redistribution, respectively, where $0 \leq \ell \leq n-1$.

In Figure 1(b), the source distribution is $\operatorname{BC}(3,2)$. The destination distribution is $\mathrm{BC}(2,4)$. The source distribution factors in the first and the second dimension are equal to three and two, respectively. The destination distribution factors in the first and the second dimension are equal to two and four respectively.

Definition 5. Given a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on $A^{(n)}$ over $M^{(n)}$, the source local array of processor $P_{i}$, denoted by $S L A_{i}^{(n)}\left[1:\left(n_{0} / m_{0}\right)\right.$, 1: $\left(n_{1} / m_{1}\right), \ldots, 1:\left(n_{n-1} / m_{n-1}\right)$ ], is defined as the set of array elements that are distributed to processor $P_{i}$ in the source distribution, i.e., $\left|S L A_{i}^{(n)}\right|=\prod_{b=0}^{n-1}\left\lceil n_{b} / m_{b}\right\rceil$, where $0 \leq i \leq\left|M^{(n)}\right|-1$. The destination local array of processor $P_{j}$, denoted by $D L A_{j}^{(n)}\left[1:\left(n_{0} / m_{0}\right), 1:\left(n_{1} / m_{1}\right), \ldots, 1:\left(n_{n-1} / m_{n-1}\right)\right]$, is defined as the set of array elements that are distributed to processor $P_{j}$ in the destination distribution, i.e., $\left|D L A_{j}^{(n)}\right|=\prod_{b=0}^{n-1}\left[n_{b} / m_{b}\right\rceil$, where $0 \leq j \leq\left|M^{(n)}\right|-1$.

Definition 6. We define $S L A_{i, \ell}^{(n)}$ as the set of array elements in the first row of the $\ell$ th dimension of $S L A_{i}^{(n)}$, i.e., $S L A_{i, \ell}^{(n)}=S L A_{i}^{(n)}\left[1, \ldots, 1,1:\left(n_{\ell} / m_{\ell}\right), 1, \ldots, 1\right]$, where $0 \leq i \leq\left|M^{(n)}\right|-1$ and $0 \leq \ell \leq n-1$. The number of array elements in $S L A_{i, \ell}^{(n)}$ is equal to $n_{\ell} / m_{\ell} . S L A_{i, \ell}^{(n)}[r]$ is defined as the $r$ th array element of $S L A_{i, \ell}^{(n)}$.

Figure 1(c) shows examples of notations that were defined in Definitions 5 and 6. In Figure $1(\mathrm{c})$, there are 24 array elements in $S L A_{0}^{(2)}$. The sets of array elements in $S L A_{0,0}^{(2)}$ and $S L A_{0,1}^{(2)}$ are $\{1,13,25,73,85,97\}$ and $\{1,2,7,8\}$, respectively. The second and the fourth elements in $S L A_{0,0}^{(2)}$ are $S L A_{0,0}^{(2)}[2]=\{13\}$ and $S L A_{0,0}^{(2)}[4]=$ $\{73\}$, respectively. The second and the fourth elements in $S L A_{0,1}^{(2)}$ are $S L A_{0,1}^{(2)}[2]=$ $\{2\}$ and $S L A_{0,1}^{(2)}[4]=\{8\}$, respectively.

Definition 7. Given a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on $A^{(n)}$ over $M^{(n)}$, a basic-cycle of the $\ell$ th dimension of $S L A_{i}^{(n)}$ (or $D L A_{j}^{(n)}$ ), denoted by $B C_{\ell}$, is defined as the quotient of the least common multiple of $s_{\ell}$ and $t_{\ell}$ to the greatest common divisor of $s_{\ell}$ and $t_{\ell}$, i.e., $B C_{\ell}=\operatorname{lcm}\left(s_{\ell}, t_{\ell}\right) / \operatorname{gcd}\left(s_{\ell}, t_{\ell}\right)$. We
define $S L A_{i, \ell}^{(n)}\left[1: B C_{\ell}\right]\left(D L A_{j, \ell}^{(n)}\left[1: B C_{\ell}\right]\right)$ as the first basic-cycle of $S L A_{i, \ell}^{(n)}\left(D L A_{j, \ell}^{(n)}\right)$ of processor $P_{i}\left(P_{j}\right), S L A_{i, \ell}^{(n)}\left[B C_{\ell}+1: 2 \times B C_{\ell}\right]\left(D L A_{i, \ell}^{(n)}\left[B C_{\ell}+1: 2 \times B C_{\ell}\right]\right)$ as the second basic-cycle of $S L A_{i, \ell}^{(n)}\left(D L A_{j, \ell}^{(n)}\right)$ of processor $P_{i}\left(P_{j}\right)$, and so on, where $0 \leq$ $\ell \leq n-1$.

In the $\mathrm{BC}(3,2)$ to $\mathrm{BC}(2,4)$ redistribution shown in Figure $1(\mathrm{~b})$, in the first dimension, the source and the destination distribution factor are equal to three and two, respectively. According to the above definition, the basic-cycle of the first dimension is $B C_{0}=\operatorname{lcm}(3,2) / \operatorname{gcd}(3,2)=6$. In the second dimension, the source and the destination distribution factor are equal to two and four, respectively. According to the above definition, the basic-cycle of the first dimension is $B C_{1}=$ $\operatorname{lcm}(2,4) / \operatorname{gcd}(2,4)=2$.

Definition 8. Given a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on $A^{(n)}$ over $M^{(n)}$, a basic-block of $S L A_{i}^{(n)}$ (or $D L A_{j}^{(n)}$ ) is defined as the multiplication of the basic-cycles in each dimension. The size of a basic-block is equal to $B C_{0} \times$ $B C_{1} \times \cdots \times B C_{n-1}$.

In Figure $1(\mathrm{~b}), B C_{0}=6$ and $B C_{1}=2$. According to Definition 8, the basic-block is equal to $B C_{0} \times B C_{1}=12$.

## 4. Multi-dimensional array redistribution

To perform a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right)$ to $\operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, in general, a processor needs to compute the communication sets. Based on the characteristics of a redistribution, we have the following lemmas.

Lemma 1. Given a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on $A^{(n)}$ over $M^{(n)}$, for a source (destination) processor $P_{i}$, if the rank of the destination (source) processor of $S L A_{i, k}^{(n)}\left[r_{k}\right]\left(D L A_{i, k}^{(n)}\left[r_{k}\right]\right)$ is $\tilde{p}_{0, \ldots, j_{k}, 0, \ldots, 0}$, where $0 \leq i \leq\left|M^{(n)}\right|-1, k=$ 0 to $n-1,0 \leq j_{k} \leq m_{k}-1$, and $1 \leq r_{k} \leq\left\lceil n_{k} / m_{k}\right\rceil$, then the destination (source) processor of $S L A_{i}^{(n)}\left[r_{0}, r_{1}, \ldots, r_{n-1}\right]\left(D L A_{i}^{(n)}\left[r_{0}, r_{1}, \ldots, r_{n-1}\right]\right)$ is $P_{j}=\tilde{p}_{j_{0}, j_{1}, \ldots, j_{n-1}}$, where $j=\sum_{k=0}^{n-1}\left(j_{k} \times \prod_{\ell=k+1}^{n-1} m_{\ell}\right)$.

Proof. We only prove the source processor part. The proof of the destination processor part is similar. In a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, we assume that the destination processor of $\operatorname{SL} A_{i}^{(n)}\left[r_{0}\right]\left[r_{1}\right] \ldots\left[r_{n-1}\right]$, is $\tilde{p}_{d_{0}, d_{1}, \ldots, d_{n-1}}$, where $0 \leq d_{\ell} \leq m_{\ell}-1$ and $0 \leq \ell \leq n-1$. If the destination processor of $S L A_{i, 0}^{(n)}\left[r_{0}\right]$ is $\tilde{p}_{j_{0}, 0, \ldots, 0}$, then $d_{0}=j_{0}$, where $0 \leq j_{0} \leq m_{0}-1$ and $1 \leq r_{0} \leq\left\lceil n_{0} / m_{0}\right\rceil$. For the same reason, if the destination processors of $S L A_{i, 1}^{(n)}\left[r_{1}\right], S L A_{i, 2}^{(n)}\left[r_{2}\right], \ldots, S L A_{i, n-1}^{(n)}\left[r_{n-1}\right]$ are $\tilde{p}_{0, j_{1}, 0, \ldots, 0}, \tilde{p}_{0,0, j_{2}, 0, \ldots, 0}, \ldots$, and $\tilde{p}_{0, \ldots, 0, j_{n-1}}$, respectively, we have $d_{1}=j_{1}, d_{2}=$ $j_{2}, \ldots$, and $d_{n-1}=j_{n-1}$. Therefore, according to Definition 3, if the rank of the destination processor of $S L A_{i, k}^{(n)}\left[r_{k}\right]$ is $\tilde{p}_{0, \ldots, j_{k}, 0, \ldots, 0}$, where $0 \leq i \leq\left|M^{(n)}\right|-1, k=0$
to $n-1,0 \leq j_{k} \leq m_{k}-1$, and $1 \leq r_{k} \leq\left\lceil n_{k} / m_{k}\right\rceil$, then the destination processor of $S L A_{i}^{(n)}\left[r_{0}\right]\left[r_{1}\right] \ldots\left[r_{n-1}\right]$ is $P_{j}=\tilde{p}_{j_{0}, j_{1}, \ldots, j_{n-1}}$, where $j=\sum_{k=0}^{n-1}\left(j_{k} \times \prod_{\ell=k+1}^{n-1} m_{\ell}\right)$.

According to Lemma 1, the destination (source) processor of $S L A_{i}^{(n)}\left[r_{0}, r_{1}, \ldots\right.$, $\left.r_{n-1}\right]\left(D L A_{j}^{(n)}\left[r_{0}, r_{1}, \ldots, r_{n-1}\right]\right)$ can be determined by the ranks of the destination (source) processors of $S L A_{i, 0}^{(n)}\left[r_{0}\right], S L A_{i, 1}^{(n)}\left[r_{1}\right], \ldots$, and $S L A_{i, n-1}^{(n)}\left[r_{n-1}\right]\left(D L A_{j, 0}^{(n)}\left[r_{0}\right]\right.$, $D L A_{j, 1}^{(n)}\left[r_{1}\right], \ldots$, and $\left.D L A_{j, n-1}^{(n)}\left[r_{n-1}\right]\right)$. Therefore, how to efficiently determine the communication sets of these array elements is important. In this section, we present two efficient techniques, the basic-block calculation technique and the complete-dimension calculation technique, to deal with this problem. Both techniques are based on the basic-cycle calculation technique proposed in [4]. The main idea of the basic-cycle calculation technique is based on the following lemma.

Lemma 2. Given a $\mathrm{BC}(s) \rightarrow \mathrm{BC}(t)$ and a $\mathrm{BC}(s / \operatorname{gcd}(s, t)) \rightarrow \mathrm{BC}(t / \operatorname{gcd}(s, t))$ redistribution on a one-dimensional array $A[1: N]$ over $M$ processors, for a source (destination) processor $P_{i}\left(P_{j}\right)$, if the destination (source) processor of $S L A_{i}[k]\left(D L A_{j}[k]\right)$ in $\mathrm{BC}(s / \operatorname{gcd}(s, t)) \rightarrow \mathrm{BC}(t / \operatorname{gcd}(s, t))$ redistribution is $P_{j}\left(P_{i}\right)$, then the destination (source) processors of $S L A_{i}[(k-1) \times \operatorname{gcd}(s, t)+1: k \times \operatorname{gcd}(s, t)]\left(D L A_{j}[(k-1) \times\right.$ $\operatorname{gcd}(s, t)+1: k \times \operatorname{gcd}(s, t)])$ in $\mathrm{BC}(s) \rightarrow \mathrm{BC}(t)$ redistribution will also be $P_{j}\left(P_{i}\right)$, where $1 \leq k \leq\lceil N /(M \times \operatorname{gcd}(s, t))\rceil$.

Proof. We only prove the source processor part. The proof of the destination processor part is similar. For a source processor $P_{i}$, if the global array index of $S L A_{i}[k]$ in $\operatorname{BC}(s / \operatorname{gcd}(s, t)) \rightarrow \mathrm{BC}(t / \operatorname{gcd}(s, t))$ redistribution is $\alpha$, then the global array indices of $S L A_{i}[(k-1) \times \operatorname{gcd}(s, t)+1: k \times \operatorname{gcd}(s, t)]$ in $\mathrm{BC}(s) \rightarrow \mathrm{BC}(t)$ redistribution are $(\alpha-1) \times \operatorname{gcd}(s, t)+1,(\alpha-1) \times \operatorname{gcd}(s, t)+2, \ldots, \alpha \times \operatorname{gcd}(s, t)$. If $A[1: N]$ is distributed by $\operatorname{BC}(t / \operatorname{gcd}(s, t))$ distribution, then $A[\alpha]$ is in the $\lceil\alpha \times$ $\operatorname{gcd}(s, t) / t\rceil$ th block of size $t / \operatorname{gcd}(s, t)$. If $A[1: N]$ is distributed by $\mathrm{BC}(t)$ distribution, then $A[(\alpha-1) \times \operatorname{gcd}(s, t)+1], A[(\alpha-1) \times \operatorname{gcd}(s, t)+2], \ldots$, and $A[\alpha \times \operatorname{gcd}(s, t)]$ are in the $\lceil(\alpha-1) \times \operatorname{gcd}(s, t)+1 / t\rceil$ th, the $\lceil(\alpha-1) \times \operatorname{gcd}(s, t)+2 / t\rceil$ th, $\ldots$, and the $\lceil(\alpha \times \operatorname{gcd}(s, t) / t\rceil$ th block of size $t$, respectively. Since $\lceil(\alpha-1) \times \operatorname{gcd}(s, t)+1 / t\rceil=$ $\lceil(\alpha-1) \times \operatorname{gcd}(s, t)+2 / t\rceil=\ldots=\lceil\alpha \times \operatorname{gcd}(s, t) / t\rceil$, if the destination processor of $A[\alpha]$ is $P_{j}$ in $\mathrm{BC}(s / \operatorname{gcd}(s, t)) \rightarrow \mathrm{BC}(t / \operatorname{gcd}(s, t))$ redistribution, then the destination processors of $A[(\alpha-1) \times \operatorname{gcd}(s, t)+1], A[(\alpha-1) \times \operatorname{gcd}(s, t)+2], \ldots$, and $A[\alpha \times \operatorname{gcd}(s, t)]$ are $P_{j}$ in $\mathrm{BC}(s) \rightarrow \mathrm{BC}(t)$ redistribution. Therefore, if the destination processor of $S L A_{i}[k]$ in $\mathrm{BC}(s / \operatorname{gcd}(s, t)) \rightarrow \mathrm{BC}(t / \operatorname{gcd}(s, t))$ redistribution is $P_{j}$, then the destination processors of $S L A_{i}[(k-1) \times \operatorname{gcd}(s, t)+1: k \times \operatorname{gcd}(s, t)]$ in $\mathrm{BC}(s) \rightarrow \mathrm{BC}(t)$ redistribution will also be $P_{j}$, where $0 \leq i, j \leq M-1$ and $1 \leq k \leq$ $\lceil N /(M \times \operatorname{gcd}(s, t))\rceil$.

Given a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, according to Lemmas 1, and 2, we know that the communication sets of $\mathrm{BC}\left(s_{0} / \operatorname{gcd}\left(s_{0}\right.\right.$, $\left.\left.t_{0}\right), s_{1} / \operatorname{gcd}\left(s_{1}, t_{1}\right), \ldots, s_{n-1} / \operatorname{gcd}\left(s_{n-1}, t_{n-1}\right)\right) \rightarrow \operatorname{BC}\left(t_{0} / \operatorname{gcd}\left(s_{0}, t_{0}\right), t_{1} / \operatorname{gcd}\left(s_{1}, t_{1}\right), \ldots\right.$, $\left.t_{n-1} / \operatorname{gcd}\left(s_{n-1}, t_{n-1}\right)\right)$ redistribution can be used to generate the communication
sets of $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution. Therefore, in the following discussion, for a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, we assume that $\operatorname{gcd}\left(s_{i}, t_{i}\right)$ is equal to 1 , where $1 \leq i \leq n-1$. If $\operatorname{gcd}\left(s_{i}, t_{i}\right)$ is not equal to 1 , we use $s_{i} / \operatorname{gcd}\left(s_{i}, t_{i}\right)$ and $t_{i} / \operatorname{gcd}\left(s_{i}, t_{i}\right)$ as the source and destination distribution factors of the redistribution, respectively.

### 4.1. The basic-block calculation technique

Given a $\mathrm{BC}\left(s_{0}, s_{1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}\right)$ redistribution on a two-dimensional array $A\left[1: n_{0}\right.$, 1: $\left.n_{1}\right]$ over $M\left[m_{0}, m_{1}\right]$, to perform the redistribution, we have to first construct the communication sets. According to Lemma 1, a source processor $P_{i}$ only needs to determine the destination processor sets for the two basic cycles, $S L A_{i, 0}^{(2)}\left[1: B C_{0}\right]$ and $S L A_{i, 1}^{(2)}\left[1: B C_{1}\right]$. Then it can generate the destination processor sets for the basic block, $S L A_{i}^{(2)}\left[1: B C_{0}, 1: B C_{1}\right]$. For example, if the destination processors of $S L A_{i, 0}^{(2)}\left[r_{0}\right]$ and $S L A_{i, 1}^{(2)}\left[r_{1}\right]$ are $\tilde{p}_{j_{0}, 0}$ and $\tilde{p}_{0, j_{1}}$, respectively, the destination processor $P_{j}$ of $S L A_{i}^{(2)}\left[r_{0}\right]\left[r_{1}\right]$ can be determined by the following equation,

$$
\begin{equation*}
\operatorname{Rank}\left(P_{j}\right)=j_{0} m_{1}+j_{1}, \tag{1}
\end{equation*}
$$

where $\operatorname{rank}\left(P_{j}\right)$ is the rank of destination processor $P_{j}$.
For a source processor $P_{i}$, if $P_{i}=\tilde{p}_{i_{0}, i_{1}}$, according to Definition 3, we have $i_{0}=$ $\left\lfloor i / m_{1}\right\rfloor$ and $i_{1}=\bmod \left(i, m_{1}\right)$ where $0 \leq i_{0} \leq m_{0}-1$ and $0 \leq i_{1} \leq m_{1}-1$. Since the values of $i_{0}$ and $i_{1}$ are known, the destination processors of $S L \bar{A}_{i, 0}^{(2)}\left[1: B C_{0}\right]$ and $S L A_{i, 1}^{(2)}\left[1: B C_{1}\right]$ can be determined by the following equation,

$$
D P_{(\ell)}=\text { Destination_Processors }\left(S L A_{i, \ell}^{(2)}\left[1: B C_{\ell}\right]\right)=\left[\begin{array}{c}
F(1)  \tag{2}\\
F(2) \\
\vdots \\
F\left(B C_{\ell}\right)
\end{array}\right]_{B C_{\ell} \times 1}
$$

where $\ell=0$ and 1 . The function $F(x)$ is defined as follows,

$$
\begin{equation*}
F(x)=\left\lfloor\frac{\bmod \left(\left(i_{\ell}+m_{\ell} \times\left\lfloor\frac{x}{s_{\ell}}\right\rfloor\right) \times s_{\ell}, m_{\ell} \times t_{\ell}\right)}{t_{\ell}}\right\rfloor \tag{3}
\end{equation*}
$$

where $x=1$ to $B C_{\ell}, i_{\ell}$ is the rank of source processor $P_{i}$ in the $\ell$ th dimension, $\ell=0,1$.

For a two-dimensional array redistribution, from Equation 2, we can obtain $D P_{(0)}$ and $D P_{(1)}$ that represent the destination processors of $S L A_{i, 0}^{(2)}\left[1: B C_{0}\right]$ and $S L A_{i, 1}^{(2)}\left[1: B C_{1}\right]$, respectively. According to $D P_{(0)}, D P_{(1)}$ and Equation 1, a source processor $P_{i}$ can determine the destination processor of array elements in the


Figure 2. $\mathrm{A} \mathrm{BC}(3,1) \rightarrow \mathrm{BC}(2,4)$ redistribution on $A[1: 24,1: 24]$ over $M[2,3]$.
first basic-block of $S L A_{i}^{(2)}$, i.e., $S L A_{i}^{(2)}\left[1: B C_{0}, 1: B C_{1}\right]$. Figure 2 shows an example of a $\mathrm{BC}(3,1) \rightarrow \mathrm{BC}(2,4)$ redistribution on $A[1: 24,1: 24]$ over $M[2,3]$. In this example, $B C_{0}=6$ and $B C_{1}=4$. For source processor $P_{0}\left(=\tilde{p}_{0,0}\right)$, according to Equation 2, the destination processors of $S L A_{0,0}^{(2)}\left[1: B C_{0}\right]$ and $S L A_{0,1}^{(2)}\left[1: B C_{1}\right]$ are $D P_{(0)}=[0,0,1,1,1,0]$ and $D P_{(1)}=[0,0,1,2]$, respectively. Based on $D P_{(0)}, D P_{(1)}$,
and Equation 1, the destination processors of $S L A_{0}^{(2)}[1: 6,1: 4]$ are

$$
\left[\begin{array}{llll}
0 & 0 & 1 & 2 \\
0 & 0 & 1 & 2 \\
3 & 3 & 4 & 5 \\
3 & 3 & 4 & 5 \\
3 & 3 & 4 & 5 \\
0 & 0 & 1 & 2
\end{array}\right]_{6 \times 4} .
$$

For a multi-dimensional array redistribution, each basic-block of a local array has the same communication patterns. The following lemmas state this characteristic.

Lemma 3. Given a $\mathrm{BC}(s) \rightarrow \mathrm{BC}(t)$ redistribution on a one-dimensional array $A[1: N]$ over $M$ processors, $S L A_{i}[m], S L A_{i}[m+\operatorname{lcm}(s, t)], S L A_{i}[m+2 \times \operatorname{lcm}(s, t)], \ldots$, and $S L A_{i}[m+(\lfloor N / \operatorname{lcm}(s, t) \times M\rfloor-1) \times \operatorname{lcm}(s, t)]$ have the same destination processor, where $0 \leq i \leq M-1$ and $1 \leq m \leq \operatorname{lcm}(s, t)$.

Proof. The proof of this lemma can be found in [11].
Lemma 4. Given a $\mathrm{BC}\left(s_{0}, s_{1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}\right)$ redistribution on a two-dimensional array $A\left[1: n_{0}, 1: n_{1}\right]$ over $M\left[m_{0}, m_{1}\right], S L A_{i}^{(2)}[x, y], S L A_{i}^{(2)}\left[x+k_{0} \times B C_{0}, y\right], S L A_{i}^{(2)}[x, y+$ $\left.k_{1} \times B C_{1}\right], S L A_{i}^{(2)}\left[x+k_{0} \times B C_{0}, y+k_{1} \times B C_{1}\right]$ have the same destination processor, where $0 \leq i \leq m_{0} \times m_{1}-1,1 \leq x \leq \operatorname{lcm}\left(s_{0}, t_{0}\right), 1 \leq y \leq \operatorname{lcm}\left(s_{1}, t_{1}\right), 1 \leq k_{0} \leq$ $\left\lfloor n_{0} /\left(\operatorname{lcm}\left(s_{0}, t_{0}\right) \times m_{0}\right)\right\rfloor$ and $1 \leq k_{1} \leq\left\lfloor n_{1} /\left(\operatorname{lcm}\left(s_{1}, t_{1}\right) \times m_{1}\right)\right\rfloor$.

Proof. The proof of this lemma can be easily established according to Lemma 3.

Since each basic-block has the same communication patterns, we can pack local array elements to messages according to the destination processors of array elements in $S L A_{i}^{(2)}\left[1: B C_{0}, 1: B C_{1}\right]$. However, if the value of $B C_{0} \times B C_{1}$ is large, it may take a lot of time to compute the destination processors of array elements in a basic-block by using Equation 1. In the basic-block calculation technique, instead of using the destination processors of array elements in the first basic-block, it uses a table lookup method to pack array elements. Given a $\mathrm{BC}\left(s_{0}, s_{1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}\right)$ redistribution on a two-dimensional array $A\left[1: n_{0}, 1: n_{1}\right]$ over $M\left[m_{0}, m_{1}\right]$, since the destination processors of $S L A_{i}^{(2)}\left[1: B C_{0}, 1: B C_{1}\right]$ can be determined by $D P_{(0)}$ and $D P_{(1)}$, if we gather the indices of array elements, that have the same destination processor, in $S L A_{i, l}^{(2)}\left[1: B C_{\ell}\right]$ to tables, called Send_Tables, we can also determine the destination processors of $S L A_{i}^{(2)}$ [1: $\left.B C_{0}, 1: B C_{1}\right]$ from Send_Tables. For example, considering the $B C(3,1) \rightarrow B C(2,4)$ redistribution shown in Figure 2. For the source processor $P_{0}$, since $D P_{(0)}=[0,0,1,1,1,0], S L A_{0,0}^{(2)}[1], S L A_{0,0}^{(2)}[2]$ and $S L A_{0,0}^{(2)}[6]$ have the same destination processor $\tilde{p}_{0,0} . S L A_{0,0}^{(2)}[3], S L A_{0,0}^{2}$ [4] and $S L A_{0,0}^{(2)}$ [5] have the same destination processor $\tilde{p}_{1,0}$. Therefore, the indices of array elements in $S L A_{0,0}^{(2)}[1: 6]$ can be classified into two sets as shown in Figure 3(a). Since $D P_{(1)}=$

| Send_Tables | $S T_{0}$ | $D P$ | Indices set | $S T_{1}$ |  | $D P$ |
| :--- | :---: | :---: | :--- | :---: | :---: | :--- |
| Indices set |  |  |  |  |  |  |
|  |  | $\widetilde{p}_{0,0}$ | $1,2,6$ |  | $\widetilde{p}_{0,0}$ | 1,2 |
|  | $\widetilde{p}_{1,0}$ | $3,4,5$ |  | $\widetilde{p}_{0,1}$ | 3 |  |
|  |  |  |  | $\widetilde{p}_{0,2}$ | 4 |  |

(a)

Figure 3. The Send_Tables for $S L A_{0,0}^{(2)}\left[1: B C_{0}\right]$ and $S L A_{0,1}^{(2)}\left[1: B C_{1}\right]$.
$[0,0,1,2], S L A_{0,1}^{(2)}[1]$ and $S L A_{0,1}^{(2)}[2]$ have the same destination processor $\tilde{p}_{0,0}$. The destination processors of $S L A_{0,1}^{(2)}[3]$ and $S L A_{0,1}^{(2)}[4]$ are $\tilde{p}_{0,1}$ and $\tilde{p}_{0,2}$, respectively. Therefore, the indices of array elements in $S L A_{0,1}^{(2)}[1: 4]$ can be classified into three sets as shown in Figure 3(b).

Based on the Send_Tables, we can pack array elements in a source local array to messages, considering the message that processor $P_{0}$ will send to the destination processor $P_{4}$ in the example shown in Figure 2. Since the destination processor $P_{4}=$ $\tilde{p}_{1,1}$, according to Figure 3 and Lemma 1, the destination processor of $S L A_{0}^{(2)}$ [3][3], $S L A_{0}^{(2)}[4][3]$ and $S L A_{0}^{(2)}$ [5][3] is $P_{4}$. According to Lemma 4, each basic-block has the same communication patterns. Processor $P_{0}$ will also send $S L A_{0}^{(2)}[3][3+4]$, $S L A_{0}^{(2)}[4][3+4]$, and $S L A_{0}^{(2)}[5][3+4]$ in the second basic-block, $S L A_{0}^{(2)}[3+6][3]$, $S L A_{0}^{(2)}[4+6][3]$, and $S L A_{0}^{(2)}[5+6][3]$ in the third basic-block, and $S L A_{0}^{(2)}[3+$ 6][3+4], $S L A_{0}^{(2)}[4+6][3+4]$, and $S L A_{0}^{(2)}[5+6][3+4]$ in the fourth basic-block to the destination processor $P_{4}$. Note that array elements are packed in a row-major manner for the techniques presented in this paper.

In the receive phase, according to Lemma 1, a destination processor $P_{j}$ only needs to determine the source processor sets for the two basic-cycles, $D L A_{j, 0}^{(2)}\left[1: B C_{0}\right]$ and $D L A_{j, 0}^{(2)}\left[1: B C_{1}\right]$. Then it can generate the source processor sets for the basic-block, $D L A_{j}^{(2)}\left[1: B C_{0}, 1: B C_{1}\right]$. For example, if the source processors of $D L A_{j, 0}^{(2)}\left[r_{0}\right]$ and $D L A_{j, 1}^{(2)}\left[r_{1}\right]$ are $\tilde{p}_{i_{0}, 0}$ and $\tilde{p}_{0, i_{1}}$, respectively, the source processor $P_{i}$ of $D L A_{j}^{(2)}\left[r_{0}\right]\left[r_{1}\right]$ can be determined by the following equation,

$$
\begin{equation*}
\operatorname{Rank}\left(P_{i}\right)=i_{0} m_{1}+i_{1} \tag{4}
\end{equation*}
$$

where $\operatorname{rank}\left(P_{i}\right)$ is the rank of source processor $P_{i}$.
For a destination processor $P_{j}$, if $P_{j}=\tilde{p}_{j_{0}, j_{1}}$, according to Definition 3, we have $j_{0}=\left\lfloor j / m_{1}\right\rfloor$ and $j_{1}=j \bmod m_{1}$, where $0 \leq j_{0} \leq m_{0}-1$ and $0 \leq j_{1} \leq m_{1}-1$. Since the value of $j_{0}$ and $j_{1}$ are known, the source processors of $D L A_{j, 0}^{(2)}\left[1: B C_{0}\right]$ and $D L A_{j, 1}^{(2)}\left[1: B C_{1}\right]$ can be determined by the following equation,

$$
S P_{(\ell)}=\text { Source_Processors }\left(D L A_{j, \ell}^{(2)}\left[1: B C_{\ell}\right]\right)=\left[\begin{array}{c}
G(1)  \tag{5}\\
G(2) \\
\vdots \\
G\left(B C_{\ell}\right)
\end{array}\right]_{B C_{\ell} \times 1}
$$

where $\ell=0,1$. The function $G(x)$ is defined as follows,

$$
\begin{equation*}
G(x)=\left\lfloor\frac{\bmod \left(\left(j_{\ell}+m_{\ell} \times\left\lfloor\frac{x}{t_{\ell}}\right\rfloor\right) \times t_{\ell}, m_{\ell} \times s_{\ell}\right)}{s_{\ell}}\right\rfloor \tag{6}
\end{equation*}
$$

where $x=1$ to $B C_{\ell}, j_{\ell}$ is the rank of destination processor $P_{j}$ in the $\ell$ th dimension, $\ell=0,1$.

For a two-dimensional array redistribution, from Equation 5, we can obtain $S P_{(0)}$ and $S P_{(1)}$ that represent the source processors of $D L A_{j, 0}^{(2)}\left[1: B C_{0}\right]$ and $D L A_{j, 1}^{(2)}\left[1: B C_{1}\right]$, respectively. According to $S P_{(0)}$ and $S P_{(1)}$, we can also construct the Receive_Tables for the destination processor $P_{j}$ as we construct the Send_Tables in the send phase. Based on the Receive_Tables, we can unpack array elements from the received messages to their appropriate destination local array positions. The algorithm of the basic-block calculation technique is given as follows.

```
Algorithm Basic_Block_Calculation \(\left(s_{0}, \ldots, s_{n-1}, t_{0}, \ldots, t_{n-1}, n_{0}, \ldots, n_{n-1}\right.\),
    \(m_{0}, \ldots, m_{n-1}\) )
        1. Construct Send_Tables (STs);
    2. For \(\left(j=\right.\) myrank, \(\left.z=0 ; z<|M| ; j^{++}, z^{++}\right)\)
    3. \(\quad j=\bmod (j,|M|)\);
    4. Pack the message for destination processor \(P_{j}\) to out_buffer
        according to the STs;
    5. If (out_buffer ! = NULL)
    6. \(\quad\) Send out_buffer to destination processor \(P_{j}\);
    7. Construct Receive_Tables (RTs);
    8. \(x=\) the number of messages to be received;
    . For \(\left(z=0 ; z<x ; z^{++}\right)\)
    10. Receive data sets in buffer from any source processor;
    11. Unpack the received messages according to the RTs;
end_of_Basic_Block_Calculation
```


### 4.2. The complete-dimension calculation technique

In Section 4.1, we stated that each basic-block has the same communication patterns. Therefore, a processor only needs to construct the Send_Tables and the Receive_Tables for the first basic-cycle in each dimension of its local array. Then it can perform a multi-dimensional array redistribution. In this section, we will present a complete-dimension calculation (CDC) technique. In the complete-dimension calculation technique, a processor constructs the Send_Tables and the Receive_Tables not only for array elements that are in the first basic-cycle of each dimension of its local array, but also for array elements in the first row of each dimension of its local array, i.e., $S L A_{i, \ell}^{(n)}\left[1: n_{\ell}\right]$, where $\ell=0$ to $n-1$. In the following, we will describe the complete-dimension calculation technique in details.

Assume that a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on an $n$-dimensional array $A^{(n)}=A\left[1: n_{0}, 1: n_{1}, \ldots, 1: n_{n-1}\right]$ over $M^{(n)}=M\left[m_{0}, m_{1}, \ldots\right.$, $\left.m_{n-1}\right]$ is given. For the complete-dimension calculation technique, in the send phase, a source processor $P_{i}$ computes the destination processors for array elements in $S L A_{i, 0}^{(n)}\left[1: L_{0}\right], S L A_{i, 1}^{(n)}\left[1: L_{1}\right], \ldots, S L A_{i, n-1}^{(n)}\left[1: L_{n-1}\right]$, where $L_{k}$ is the local array size in each dimension, i.e., $L_{k}=\left(n_{k} / m_{k}\right), k=0$ to $n-1$. The destination processors of $S L A_{i, 0}^{(n)}\left[1: L_{0}\right], S L A_{i, 1}^{(n)}\left[1: L_{1}\right], \ldots, S L A_{i, n-1}^{(n)}\left[1: L_{n-1}\right]$ can be determined by the following equation:

$$
D P_{(\ell)}=\text { Destination_Processors }\left(\operatorname{SL} A_{i, \ell}^{(n)}\left[1: L_{\ell}\right]\right)=\left[\begin{array}{c}
F(1)  \tag{7}\\
F(2) \\
\vdots \\
F\left(L_{\ell}\right)
\end{array}\right]_{L_{\ell} \times 1}
$$

where $\ell=0$ to $n-1$. The function $F(x)$ is defined in Equation 3.
For a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, from Equation 7, we can obtain $D P_{(0)}, D P_{(1)}, \ldots, D P_{(n-1)}$ that represent destination processors of $S L A_{i, 0}^{(n)}\left[1: L_{0}\right], S L A_{i, 1}^{(n)}\left[1: L_{1}\right], \ldots, S L A_{i, n-1}^{(n)}\left[1: L_{n-1}\right]$, respectively. Since the destination processors of $S L A_{i, 0}^{(n)}\left[1: L_{0}\right], S L A_{i, 1}^{(n)}\left[1: L_{1}\right], \ldots, S L A_{i, n-1}^{(n)}\left[1: L_{n-1}\right]$ are known, we can construct the Send_Tables for $S L A_{i, 0}^{(n)}\left[1: L_{0}\right], S L A_{i, 1}^{(n)}\left[1: L_{1}\right], \ldots$, $S L A_{i, n-1}^{(n)}\left[1: L_{n-1}\right]$. For example, consider the redistribution shown in Figure 2. The Send_Tables constructed by the source processor $P_{0}$ are shown in Figure 4. Based on the Send_Tables, one can pack array elements in source local arrays to messages.

In the receive phase, a destination processor $P_{j}$ computes the source processors for array elements in $D L A_{j, 0}^{(n)}\left[1: L_{0}\right], D L A_{j, 1}^{(n)}\left[1: L_{1}\right], \ldots, D L A_{j, n-1}^{(n)}\left[1: L_{n-1}\right]$, where $L_{k}$ is the local array size in each dimension, i.e., $L_{k}=\left(n_{k} / m_{k}\right), k=0$ to $n-1$. The source processors of $D L A_{j, 0}^{(n)}\left[1: L_{0}\right], D L A_{j, 1}^{(n)}\left[1: L_{1}\right], \ldots, D L A_{j, n-1}^{(n)}\left[1: L_{n-1}\right]$ can be determined by the following equation,

$$
S P_{(\ell)}=\operatorname{Source} \__{-P r o c e s s o r s}\left(D L A_{j, \ell}^{(n)}\left[1: L_{\ell}\right]\right)=\left[\begin{array}{c}
G(1)  \tag{8}\\
G(2) \\
\vdots \\
G\left(L_{\ell}\right)
\end{array}\right]_{L_{\ell} \times 1}
$$

where $\ell=1$ to $n-1$. The function $G(x)$ is defined in Equation 6.
For a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution, from Equation 8, we can obtain $S P_{(0)}, S P_{(1)}, \ldots, S P_{(n-1)}$ that represent the source processors

| Send_Tables | $S T_{\theta}$ | $D P$ | Indices set | $S T_{1}$ | $D P$ | Indices set |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\widetilde{p}_{0,0}$ | $1,2,6,7,8,12$ |  | $\widetilde{p}_{0,0}$ | $\mathbf{1 , 2 , 5 , 6}$ |
|  | $\widetilde{p}_{1,0}$ | $3,4,5,9,10,11$ |  | $\widetilde{p}_{0,1}$ | 3,7 |  |
|  |  |  |  | $\widetilde{p}_{0,2}$ | $\mathbf{4 , 8}$ |  |

Figure 4. The Send_Tables for $S L A_{0,0}^{(2)}\left[1: L_{0}\right]$ and $S L A_{0,1}^{(2)}\left[1: L_{1}\right]$.
of $S L A_{i, 0}^{(n)}\left[1: L_{0}\right], S L A_{i, 1}^{(n)}\left[1: L_{1}\right], \ldots, S L A_{i, n-1}^{(n)}\left[1: L_{n-1}\right]$, respectively. Since the source processors of $D L A_{j, 0}^{(n)}\left[1: L_{0}\right], D L A_{j, 1}^{(n)}\left[1: L_{1}\right], \ldots, D L A_{j, n-1}^{(n)}\left[1: L_{n-1}\right]$ are known, we can construct the Receive_Tables for $D L A_{j, 0}^{(n)}\left[1: L_{0}\right], D L A_{j, 1}^{(n)}\left[1: L_{1}\right], \ldots$, $D L A_{j, n-1}^{(n)}\left[1: L_{n-1}\right]$. Based on the Receive_Tables, we can unpack array elements in the received messages to their appropriate local array positions.

### 4.3. Performance comparisons of $B B C$ and $C D C$

The complete-dimension calculation technique has higher indexing cost than that of the basic-block calculation technique because it constructs larger Send_Tables and Receive_Tables. However, the complete-dimension calculation technique provides more packing/unpacking information than the basic-block calculation technique. It may have lower packing/unpacking costs. Therefore, there is a performance tradeoff between the indexing and packing/unpacking overheads of these two techniques. In this section, we derive a theoretical model to analyze the tradeoff between these two methods.

Given a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on an $n$ dimensional array $A^{(n)}$ over $M^{(n)}$, the computational cost for an algorithm to perform the redistribution, in general, can be modeled as follows:

$$
\begin{equation*}
T_{\text {comp }}=T_{\text {indexing }}+T_{(\text {un) packing }} \tag{9}
\end{equation*}
$$

where $T_{\text {indexing }}$ is the time for an algorithm to compute the source/destination processors of local array elements. $T_{(u n) \text { packing }}$ is the time to pack/unpack array elements. We said that $T_{\text {indexing }}$ and $T_{\text {(un)packing }}$ is the indexing and packing/unpacking time of an algorithm to perform a redistribution, respectively. In the following discussion, since the sending phase and the receiving phase have the same time complexity, we only construct a model for the send phase. We will first construct a model for two-dimensional array redistribution, then, extend the model to multidimensional array redistribution.

Given a $\mathrm{BC}\left(s_{0}, s_{1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}\right)$ redistribution on a two-dimensional array $A\left[1: n_{0}, 1: n_{1}\right]$ over $M\left[m_{0}, m_{1}\right]$, the indexing time of the basic-block calculation and the complete-dimension calculation can be modeled as follows,

$$
\begin{align*}
& T_{\text {indexing }}(B B C)=O\left(B C_{0}\right)+O\left(B C_{1}\right)  \tag{10}\\
& T_{\text {indexing }}(C D C)=O\left(L_{0}\right)+O\left(L_{1}\right) \tag{11}
\end{align*}
$$

where $B C_{k}$ is the size of basic-cycle in each dimension; $L_{k}$ is the local array size in each dimension, $L_{k}=\left(n_{k} / m_{k}\right), k=0,1$.

In the basic-block calculation technique, the Send_Tables only store the indices of local array elements in the first basic-cycle. A processor needs to calculate the stride distance when it packs local array elements that are in the rest of basic-cycles into messages. Assume that array elements were packed in a row-major manner. The time for a processor to pack array elements to messages in each row is $O\left(L_{1} / B C_{1}\right)$,
where $L_{1} / B C_{1}$ is the number of basic-cycles in dimension 1 . Since there are $L_{0}$ rows in a local array, the time for a processor to pack array elements in dimension 1 to messages is $O\left(\left(L_{0} \times L_{1}\right) / B C_{1}\right)$. Since a processor packs local array elements to messages in a row-major manner, the time for a processor to pack array elements in dimension 0 to messages is $O\left(L_{0} / B C_{0}\right)$. Therefore, the time for a processor to pack array elements to messages can be modeled as follows,

$$
\begin{equation*}
T_{\text {(un)packing }}(B B C)=O\left(\frac{L_{0} \times L_{1}}{B C_{1}}\right)+O\left(\frac{L_{0}}{B C_{0}}\right) \tag{12}
\end{equation*}
$$

In the complete-dimension calculation technique, the Send_Tables store the indices of local array elements in $S L A_{i, 0}^{(n)}\left[1: L_{0}\right]$ and $S L A_{i, 1}^{(n)}\left[1: L_{1}\right]$. According to the Send_Tables, a processor can pack local array elements into messages directly. It does not need to calculate the stride distance when it packs array elements that are not in the first basic-cycle.

According to Equations 9 to 12, the computation time of the complete-dimension calculation is less than that of the basic-block calculation technique if and only if the following equation is true.

$$
\begin{align*}
T_{\text {comp }}(C D C) & <T_{\text {comp }}(B B C) \Leftrightarrow O\left(L_{0}+L_{1}\right) \\
& <O\left(B C_{0}+B C_{1}+\frac{L_{0} \times L_{1}}{B C_{1}}+\frac{L_{0}}{B C_{0}}\right) \tag{13}
\end{align*}
$$

By truncating $B C_{0}, B C_{1}$ and $L_{0} / B C_{0}$ from $T_{\text {comp }}(B B C)$, we obtain the following equation:

$$
\begin{equation*}
T_{\text {comp }}(C D C)<T_{\text {comp }}(B B C) \Leftrightarrow O\left(L_{0}+L_{1}\right)<O\left(\frac{L_{0} \times L_{1}}{B C_{1}}\right) \tag{14}
\end{equation*}
$$

Given a $\mathrm{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right) \rightarrow \mathrm{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution on an $n$ dimensional array $A^{(n)}=A\left[1: n_{0}, 1: n_{1}, \ldots, 1: n_{n-1}\right]$ over $M^{(n)}=M\left[m_{0}, m_{1}, \ldots\right.$, $m_{n-1}$ ], according to Equation 14, the computation time of the complete-dimension calculation is less than that of the basic-block calculation technique if and only if the following equation is true.

$$
\begin{align*}
T_{\text {comp }}(C D C) & <T_{\text {comp }}(B B C) \Leftrightarrow O\left(L_{0}+L_{1}+\cdots+L_{n-1}\right) \\
& <O\left(\frac{L_{n-1}}{B C_{n-1}} \times L_{n-2} \times \cdots \times L_{0}\right) \tag{15}
\end{align*}
$$

From Equation 15, we can evaluate the tradeoff between the indexing and the packing/unpacking overheads. The performance of the basic-block calculation and the complete-dimension calculation techniques can be also predicted by Equation 15.

## 5. Performance evaluation and experimental results

To evaluate the performance of the basic-block calculation and the completedimension calculation techniques, we have implemented these two techniques along with the PITFALLS method [23] and the Prylli's method [21] on an IBM SP2 parallel machine. All algorithms were written in the single program multiple data (SPMD) programming paradigm with $\mathrm{C}+$ MPI codes. To get the experimental results, we used different redistribution as test samples. For these redistribution samples, we roughly classify them into three types.

- Dimension shift redistribution:
$\mathrm{Ex}: \mathrm{BC}(x, y)$ to $\mathrm{BC}(y, x)$ of two-dimensional arrays, and $\mathrm{BC}(x, y, z)$ to $\mathrm{BC}(y, z, x)$ of three-dimensional arrays, where $x, y$ and $z$ are positive integers.
- Refinement redistribution:
$\mathrm{Ex}: \mathrm{BC}(x, y)$ to $\mathrm{BC}(x / p, y / q)$ of two-dimensional arrays, and $\mathrm{BC}(x, y, z)$ to $\mathrm{BC}(x / p, y / q, z / r)$ of three-dimensional arrays, where $p, q$ and $r$ are factors of $x, y$ and $z$, respectively.
- Block-cyclic redistribution:

Ex: (BLOCK, BLOCK) to (CYCLIC, CYCLIC) of two-dimensional arrays, and (BLOCK, BLOCK, BLOCK) to (CYCLIC, CYCLIC, CYCLIC) of three-dimensional arrays.

Table 1 shows the execution time of these four algorithms to perform a $\operatorname{BC}(5,8)$ to $\mathrm{BC}(8,5)$ (i.e., dimension shift) redistribution with fixed array size on different numbers of processors. From Table 1, we can see that the indexing time of the basic-block calculation technique is independent of the number of processors. The indexing time of the Prylli's method and the PITFALLS method depends on the number of processors. When the number of processors increases, the indexing time of the Prylli's method and the PITFALLS method increases as well. The indexing time of the complete-dimension calculation technique decreases when the number of processors increases. The reason is that when the array size is fixed and the number of processors is increased, the number of array elements that will be processed by the complete-dimension calculation technique decreases.

For the same test sample, the complete-dimension calculation technique has smaller packing/unpacking time than that of other methods. The reason is that the complete-dimension calculation technique provides more packing/unpacking information than other methods. This packing/unpacking information allows the complete-dimension calculation technique to pack/unpack array elements directly. Other methods need to spend time to calculation stride distance of array elements when packing/unpacking array elements. The packing/unpacking time of the basic-block calculation technique, the PITFALLS method and the Prylli's method are similar.
Table 1. The time $(\mathrm{ms})$ of four algorithms to execute a $\mathrm{BC}(5,8)$ to $\mathrm{BC}(8,5)$ redistribution on different number of processors with fixed array size $\left(N_{0}, N_{1}\right)=(400,640)$

| Processor grid | Prylli's |  |  | PITFALLS |  |  | BBC |  |  | $C D C$ |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total |
| $8 \times 2$ | 1.498 | 25.617 | 70.930 | 2.236 | 26.423 | 72.63 | 1.396 | 25.120 | 68.408 | 3.648 | 20.954 | 66.675 |
| $8 \times 3$ | 2.038 | 18.604 | 59.882 | 3.110 | 18.442 | 60.226 | 1.386 | 18.781 | 56.085 | 3.489 | 15.894 | 55.856 |
| $8 \times 4$ | 2.381 | 12.558 | 50.664 | 4.166 | 12.893 | 49.955 | 1.403 | 11.077 | 46.21 | 3.013 | 10.171 | 48.053 |
| $8 \times 5$ | 2.445 | 11.346 | 41.245 | 4.585 | 11.637 | 43.96 | 1.383 | 9.771 | 38.43 | 2.607 | 8.945 | 42.022 |
| $8 \times 6$ | 3.748 | 8.226 | 31.417 | 5.175 | 8.259 | 34.378 | 1.388 | 7.179 | 28.538 | 2.241 | 6.542 | 31.824 |
| $8 \times 7$ | 4.492 | 6.389 | 22.372 | 5.421 | 6.351 | 23.221 | 1.394 | 5.304 | 18.914 | 2.152 | 4.869 | 21.555 |

All of these four methods use asynchronous communication schemes. Therefore, the computation and the communication overheads can be overlapped. However, the basic-block calculation and the complete-dimension calculation techniques unpack any received messages in the receiving phase while the PITFALLS and the Prylli's methods unpack messages in a specific order. Therefore, in general, we can expect that the communication time of the basic-block calculation and the completedimension calculation techniques is less than or equal to that of the PITFALLS and the Prylli's methods.

From Table 1, we can see that the complete-dimension calculation technique has the smallest execution time when the number of processors is less than or equal to $24(8 \times 3)$. The basic-block calculation technique has the smallest execution time when the number of processors is greater than or equal to $32(8 \times 4)$. These phenomena match the theoretical analysis given in Equation 15. We also observe that the execution time of the basic-block calculation technique is smaller than that of the PITFALLS and the Prylli's methods for all test samples.

Table 2 shows the performance of these four algorithms to execute a $\operatorname{BC}(10,20)$ to $\mathrm{BC}(5,10)$ (i.e., Refinement) redistribution with fixed array size on different numbers of processors. From Table 2, we have similar observations as those described for Table 1.

Table 3 shows the execution time of these four algorithms to perform a (BLOCK, BLOCK) to (CYCLIC, CYCLIC) redistribution. In this case, the Send_Tables and Receive_Tables constructed by the basic-block calculation technique and the complete-dimension calculation technique are the same. Therefore, they have almost the same execution time. The execution time of both methods for this redistribution is less than that of the PITFALLS method and the Prylli's method.

Table 4 shows the performance of these four algorithms to execute these three redistribution with various array size on a processor grid $M[8,7]$. From Table 4, for the $\operatorname{BC}(5,8)$ to $\operatorname{BC}(8,5)$ and $\operatorname{BC}(10,20)$ to $\operatorname{BC}(5,10)$ redistribution, we can see that the execution time of the complete-dimension calculation technique is less than that of the basic-block calculation technique for all test samples. The reason can be explained by Equation 15. Moreover, the execution time of both methods is less than that of the PITFALLS method and the Prylli's method for all test samples.

For the (BLOCK, BLOCK) to (CYCLIC, CYCLIC) redistribution, the execution time of these four algorithms has the order $T_{\text {exec }}(C D C) \approx T_{\text {exec }}(B B C) \ll$ $T_{\text {exec }}$ (Prylli's $)<T_{\text {exec }}($ PITFALLS $)$. In this case, the PITFALLS method and the Prylli's method have very large execution time compared to that of the $B B C$ method and the $C D C$ method. The reason is that each processor needs to find out all intersections between source and destination distribution with all other processors in the PITFALLS and the Prylli's methods. The computation time of the PITFALLS and the Prylli's methods depends on the number of intersections. In this case, there are $N_{0} / m_{0}+N_{1} / m_{1}$ intersections between each source and destination processor. Therefore, a processor needs to compute $\left\lfloor N_{0} / m_{0}\right\rfloor \times m_{0}+\left\lfloor N_{1} / m_{1}\right\rfloor \times m_{1}$ intersections which demands a lot of computation time when $N_{0}$ and $N_{1}$ are large.

Table 5 shows the performance of these four algorithms to execute different redistribution on three-dimensional arrays. Each redistribution with various array
Table 2. The time $(\mathrm{ms})$ of four algorithms to execute a $\mathrm{BC}(10,20)$ to $\mathrm{BC}(5,10)$ redistribution on different number of processors with fixed array size $\left(N_{0}, N_{1}\right)=(400,640)$

| Processor grid | Prylli's |  |  | PITFALLS |  |  | BBC |  |  | $C D C$ |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total |
| $8 \times 2$ | 1.414 | 24.140 | 69.143 | 2.138 | 26.089 | 71.119 | 1.126 | 23.137 | 66.149 | 3.128 | 20.100 | 64.112 |
| $8 \times 3$ | 2.142 | 17.233 | 58.265 | 3.252 | 18.162 | 60.207 | 1.165 | 16.229 | 54.290 | 2.545 | 14.178 | 53.210 |
| $8 \times 4$ | 2.241 | 12.583 | 49.585 | 4.148 | 12.345 | 49.438 | 1.141 | 11.263 | 44.584 | 2.453 | 10.350 | 47.426 |
| $8 \times 5$ | 2.409 | 11.136 | 40.191 | 4.523 | 10.556 | 43.100 | 1.178 | 8.131 | 37.932 | 2.120 | 8.553 | 41.105 |
| $8 \times 6$ | 3.125 | 8.148 | 30.465 | 5.122 | 8.140 | 34.133 | 1.162 | 6.551 | 27.079 | 1.710 | 6.305 | 30.405 |
| $8 \times 7$ | 4.220 | 6.302 | 21.868 | 5.508 | 6.216 | 23.317 | 1.156 | 5.004 | 17.156 | 1.413 | 4.868 | 20.733 |

Table 3. The time ( ms ) of four algorithms to execute a (BLOCK, BLOCK) to (CYCLIC, CYCLIC) redistribution on different number of processors with fixed array size $\left(N_{0}, N_{1}\right)=(400,640)$

| Processor grid | Prylli's |  |  | PITFALLS |  |  | BBC |  |  | $C D C$ |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total | Indexing | Packing/ unpacking | Total |
| $8 \times 2$ | 5.093 | 26.105 | 74.802 | 6.112 | 26.860 | 76.758 | 3.242 | 19.479 | 63.530 | 3.673 | 19.516 | 63.572 |
| $8 \times 3$ | 5.121 | 19.113 | 62.119 | 6.126 | 19.907 | 63.766 | 3.170 | 13.511 | 52.643 | 3.132 | 13.593 | 52.557 |
| $8 \times 4$ | 5.149 | 13.138 | 53.171 | 6.134 | 13.105 | 52.115 | 2.114 | 9.636 | 43.857 | 2.174 | 9.621 | 43.121 |
| $8 \times 5$ | 5.231 | 11.310 | 44.248 | 6.301 | 11.152 | 46.208 | 2.184 | 7.250 | 35.190 | 2.272 | 7.166 | 35.201 |
| $8 \times 6$ | 5.363 | 8.551 | 33.348 | 6.515 | 8.237 | 36.399 | 1.396 | 5.482 | 25.416 | 1.510 | 5.336 | 25.328 |
| $8 \times 7$ | 5.903 | 7.108 | 24.943 | 6.603 | 7.685 | 25.729 | 1.804 | 3.984 | 16.894 | 1.934 | 3.962 | 16.644 |

Table 4. The time (ms) of different algorithms to execute different redistribution on a two-dimensional array with various array size on a 56 -node $\mathrm{SP} 2,\left(N_{0}, N_{1}\right)=(1200,1600)$

|  | Prylli's | PITFALLS | BBC | CDC |
| :--- | ---: | ---: | ---: | ---: |
| Array size |  | BC(5, 8) to BC $(8,5)$ |  |  |
| $\left(N_{0}, N_{1}\right)$ | 28.367 | 35.202 | 27.771 | 26.763 |
| $\left(2 N_{0}, 2 N_{1}\right)$ | 144.630 | 150.013 | 130.002 | 123.407 |
| $\left(3 N_{0}, 3 N_{1}\right)$ | 321.218 | 335.736 | 317.212 | 297.565 |
| $\left(4 N_{0}, 4 N_{1}\right)$ | 511.111 | 534.277 | 503.035 | 489.143 |


| $\mathrm{BC}(10,20)$ to $\mathrm{BC}(5,10)$ |  |  |  |  |
| :--- | ---: | ---: | ---: | ---: |
| $\left(N_{0}, N_{1}\right)$ | 27.326 | 33.454 | 27.277 | 25.968 |
| $\left(2 N_{0}, 2 N_{1}\right)$ | 144.408 | 168.581 | 134.247 | 120.319 |
| $\left(3 N_{0}, 3 N_{1}\right)$ | 327.077 | 342.153 | 305.005 | 291.011 |
| $\left(4 N_{0}, 4 N_{1}\right)$ | 518.172 | 539.914 | 508.474 | 484.268 |

(BLOCK, BLOCK) to (CYCLIC, CYCLIC)

| $\left(N_{0}, N_{1}\right)$ | 29.545 | 32.238 | 24.565 | 24.192 |
| :--- | ---: | ---: | ---: | ---: |
| $\left(2 N_{0}, 2 N_{1}\right)$ | 150.153 | 153.357 | 135.406 | 135.497 |
| $\left(3 N_{0}, 3 N_{1}\right)$ | 451.118 | 491.118 | 402.924 | 402.799 |
| $\left(4 N_{0}, 4 N_{1}\right)$ | 931.347 | 1045.838 | 566.802 | 565.324 |

Table 5. The time (ms) of different algorithms to execute different redistribution on a three-dimensional array with various array size on a 56 -node $\mathrm{SP} 2,\left(N_{0}, N_{1}, N_{2}\right)=(120,180,160)$

|  | Prylli's | PITFALLS | $B B C$ | $C D C$ |
| :---: | :---: | :---: | :---: | :---: |
| Array size | $\mathrm{BC}(5,10,20)$ to $\mathrm{BC}(10,20,5)$ |  |  |  |
| $\left(N_{0}, N_{1}, N_{2}\right)$ | 50.961 | 52.100 | 45.476 | 44.423 |
| $\left(2 N_{0}, 2 N_{1}, 2 N_{2}\right)$ | 236.156 | 240.086 | 229.271 | 225.303 |
| $\left(3 N_{0}, 3 N_{1}, 3 N_{2}\right)$ | 409.062 | 427.057 | 361.258 | 343.309 |
| $\left(4 N_{0}, 4 N_{1}, 4 N_{2}\right)$ | 910.413 | 973.718 | 869.111 | 807.249 |
| $\mathrm{BC}(10,20,30)$ to $\mathrm{BC}(1,2,3)$ |  |  |  |  |
| $\left(N_{0}, N_{1}, N_{2}\right)$ | 51.319 | 51.292 | 49.134 | 43.952 |
| $\left(2 N_{0}, 2 N_{1}, 2 N_{2}\right)$ | 244.283 | 255.721 | 238.697 | 227.676 |
| $\left(3 N_{0}, 3 N_{1}, 3 N_{2}\right)$ | 445.187 | 469.731 | 410.987 | 368.073 |
| $\left(4 N_{0}, 4 N_{1}, 4 N_{2}\right)$ | 812.320 | 873.900 | 750.708 | 631.445 |
| (BLOCK, BLOCK, BLOCK) to (CYCLIC, CYCLIC, CYCLIC) |  |  |  |  |
| $\left(N_{0}, N_{1}, N_{2}\right)$ | 61.545 | 77.990 | 37.964 | 37.414 |
| $\left(2 N_{0}, 2 N_{1}, 2 N_{2}\right)$ | 363.723 | 383.345 | 250.983 | 249.725 |
| $\left(3 N_{0}, 3 N_{1}, 3 N_{2}\right)$ | 552.444 | 623.724 | 326.750 | 326.750 |
| $\left(4 N_{0}, 4 N_{1}, 4 N_{2}\right)$ | 1411.378 | 1493.714 | 918.662 | 918.226 |

size on a processor grid $M[2,4,7]$ with 56 processors were tested. From Table 5, we have similar observations as those described for Table 4.

## 6. Conclusions and future work

In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access on distributed memory multicomputers. Since the redistribution is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we have presented efficient algorithms for performing multi-dimensional array redistribution. Based on the basic-cycle calculation technique, we presented a basic-block calculation technique and a complete-dimension calculation technique. In these two methods, the Send_Tables and the Receive_Tables are used to store the packing/unpacking information of a redistribution. From the information of Send_Tables and Receive_Tables, we can efficiently perform a $\operatorname{BC}\left(s_{0}, s_{1}, \ldots, s_{n-1}\right)$ to $\operatorname{BC}\left(t_{0}, t_{1}, \ldots, t_{n-1}\right)$ redistribution of multidimensional arrays. The theoretical model shows that the $B B C$ method has smaller indexing costs and performs well for the redistribution with small array size. The $C D C$ method has smaller packing/unpacking costs and performs well when the array size is large. The experimental results also show that our algorithms can provide better performance than the PITFALLS method and the Prylli's method.

Our techniques can only handle dense arrays and In-core programs. There are some possible extensions could be made. One of the issues would be to consider out-of-core external array redistribution. Another important future research direction would be to investigate the redistribution techniques in irregular scientific computation programs. it would also be interesting to consider the array redistribution of sparse arrays.

## Acknowledgments

The work of this paper was partially supported by NSC of R.O.C. under contract NSC-87-2213-E035-011.

## References

1. S. Benkner. Handling block-cyclic distribution arrays in Vienna Fortran 90. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Limassol, Cyprus, pp. 224-253, June 1995.
2. B. Chapman, P. Mehrotra, H. Moritsch, and H. Zima. Dynamic data distribution in Vienna Fortran. In Proceedings of Supercomputing '93, pp. 284-293, November 1993.
3. S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating local address and communication sets for data parallel programs. Journal of Parallel and Distributed Computing, 26:72-84, 1995.
4. Y.-C. Chung, C.-H. Hsu, and S.-W. Bai. A basic-cycle calculation technique for efficient dynamic data redistribution. IEEE Transactions on Parallel and Distributed Systems, 9(4):359-377, April 1998.
5. F. Desprez, J. Dongarra, and A. Petitet. Scheduling block-cyclic array redistribution. IEEE Transactions on Parallel and Distributed Systems, 9(2):192-205, February 1998.
6. G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C.-W. Tseng, and M. Wu. Fortran-D language specification. Technical Report TR-91-170, Department of Computer Science, Rice University, December 1991.
7. S. K. S. Gupta, S. D. Kaushik, C.-H. Huang, and P. Sadayappan. On the generation of efficient data communication for distributed-memory machines. Proceedings of International Computing Symposium, pp. 504-513, 1992.
8. S. K. S. Gupta, S. D. Kaushik, C.-H. Huang, and P. Sadayappan. On compiling array expressions for efficient execution on distributed-memory machines. Journal of Parallel and Distributed Computing, 32:155-172, 1996.
9. High Performance Fortran Forum. High performance Fortran language specification, version 1.1, Rice University, November 1994.
10. S. Hiranandani, K. Kennedy, J. Mellor-Crammey, and A. Sethi. Compilation technique for block-cyclic distribution. In Proceedings of the ACM International Conference on Supercomputing, pp. 392-403, July 1994.
11. C.-H. Hsu and Y.-C. Chung. Efficient methods for $k r \rightarrow r$ and $r \rightarrow k r$ array redistribution. The Journal of Supercomputing, 12(2):253-276, May 1998.
12. E. T. Kalns and L. M. Ni. Processor mapping technique toward efficient data redistribution. IEEE Transactions on Parallel and Distributed Systems, 6(12):1234-1247, December 1995.
13. L. M. Ni, H. Xu, and E. T. Kalns. Issues in scalable library design for massively parallel computers. In Supercomputing '93, pp. 181-190, November 1993.
14. S. D. Kaushik, C. H. Huang, R. W. Johnson, and P. Sadayappan. An approach to communication efficient data redistribution. In Proceedings of the International Conference on Supercomputing, pp. 364-373, July 1994.
15. S. D. Kaushik, C. H. Huang, J. Ramanujam, and P. Sadayappan. Multiphase array redistribution: modeling and evaluation. In Proceedings of the International Parallel Processing Symposium, pp. 441-445, 1995.
16. S. D. Kaushik, C. H. Huang, and P. Sadayappan. Efficient index set generation for compiling HPF array statements on distributed-memory machines. Journal of Parallel and Distributed Computing, 38:237-247, 1996.
17. K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient address generation for block-cyclic distribution. In Proceedings of the International Conference on Supercomputing, Barcelona, pp. 180-184, July 1995.
18. P.-Z. Lee and W. Y. Chen. Compiler techniques for determining data distribution and generating communication sets on distributed-memory multicomputers. In 29th IEEE Hawaii International Conference on System Sciences, Maui, Hawaii, pp. 537-546, January 1996.
19. Y. W. Lim, P. B. Bhat, and V. K. Prasanna. Efficient algorithms for block-cyclic redistribution of arrays. In Proceedings of the Eighth IEEE Symposium on Parallel and Distributed Processing, pp. 74-83, 1996.
20. Y. W. Lim, N. Park, and V. K. Prasanna. Efficient algorithms for multi-dimensional block-cyclic redistribution of arrays. In Proceedings of the 26th International Conference on Parallel Processing, pp. 234-241, 1997.
21. L. Prylli and B. Touranchean. Fast runtime block cyclic data redistribution on multiprocessors. Journal of Parallel and Distributed Computing, 45:63-72, August 1997.
22. S. Ramaswamy and P. Banerjee. Automatic generation of efficient array redistribution routines for distributed memory multicomputers. In Frontier '95: The Fifth Symposium on the Frontiers of Massively Parallel Computation, McLean, Va., pp. 342-349, February 1995.
23. S. Ramaswamy, B. Simons, and P. Banerjee. Optimization for efficient array redistribution on distributed memory multicomputers. Journal of Parallel and Distributed Computing, 38:217-228, 1996.
24. J. M. Stichnoth, D. O'Hallaron, and T. R. Gross. Generating communication for array statements: design, implementation, and evaluation. Journal of Parallel and Distributed Computing, 21:150-159, 1994.
25. R. Thakur, A. Choudhary, and G. Fox. Runtime array redistribution in HPF programs. In Proceedings of the 1994 Scalable High Performance Computing Conference, pp. 309-316, May 1994.
26. R. Thakur, A. Choudhary, and J. Ramanujam. Efficient algorithms for array redistribution. IEEE Transactions on Parallel and Distributed Systems, 7(6):587-594, June 1996.
27. A. Thirumalai and J. Ramanujam. Efficient computation of address sequence in data parallel programs using closed forms for basis vectors. Journal of Parallel and Distributed Computing, 38:188-203, 1996.
28. V. Van Dongen, C. Bonello, and C. Freehill. High performance C-language specification version 0.8.9. Technical Report CRIM-EPPP-94/04-12, 1994.
29. C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM, Philadelphia, Pa., 1992.
30. D. W. Walker and S. W. Otto. Redistribution of BLOCK-CYCLIC data distributions using MPI. Concurrency: Practice and Experience, 8(9):707-728, November 1996.
31. A. Wakatani and M. Wolfe. A new approach to array redistribution: strip mining redistribution. In Proceeding of Parallel Architectures and Languages Europe, July 1994.
32. A. Wakatani and M. Wolfe. Optimization of array redistribution for distributed memory multicomputers (short communication). Parallel Computing, 21(9):1485-1490, September 1995.
33. H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald. Vienna Fortran-a language specification version 1.1. ICASE Interim Report 21, ICASE NASA Langley Research Center, Hampton, Va., March 1992.
