On Microprocessors, Memory Hierarchies and Amdahl’s Law

David O’Neal
CSM Site Lead
Aeronautical Systems Center MSRC

The National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

oneal@ncsa.edu
Origins

- Research was initiated in 1997 while working on a paper describing fundamental differences between Cray PVP and MPP architectures.
- Core methods were determined just prior to leaving Carnegie Mellon for UIUC.
- The recent decommissioning of the C916 coupled with the acquisition of a 64-processor Compaq machine regarded by some as a replacement platform prompted me to dredge it up in order to test the indicated 4:1 purchase ratio of DEC EV6 microprocessors to the Cray proprietary design.

Goals

- Originally developed for estimating cache efficiencies of codes designed to run on the Cray T3D and early versions of the Cray T3E.
- Formalize, analyze, and validate the method in order to determine its suitability for (i) making throughput comparisons and (ii) estimating factors used to convert service unit allocations between platforms designed around dissimilar processors.
Amdahl’s Method

The classical form of Amdahl's Law (1967) can be stated as follows: if the serial component of an algorithm accounts for $1 / S$ of its execution time, then the maximum speedup that can be attained by running it on a parallel computer is $S$.

Let $T$ represent execution time as a function of the processor count $P$ and problem size $N$. Then given a sequential code that scales as $N + N^2$, any number of forms can be written.

Examples:
1. $T = N + N^2 / P$  
   partition $O(N^2)$ part, replicate $O(N)$ part.
2. $T = (N + N^2) / P + 100$  
   partition all computations; add fixed cost.
3. $T = (N + N^2) / P + P^2 / 2$  
   partition all computations; add $O(P^2)$ cost of data communications.
Amdahl’s Law for Cache-based Microprocessors

A variation of Amdahl's Law can be written in terms of the ratio of cycle counts required to complete data fetch operations from memory and cache. Consider a basic hierarchy comprised of a CPU, a cache structure, and a monolithic memory system. In this setting, the ratio of fetch times for memory, $t_M$, and cache, $t_C$, resident data determine speedup. The supposition is that every code requires memory-to-processor bandwidth that will eventually limit its performance on a cache-based system. If this part of the code were to suddenly fit into cache, the potential for performance improvement would be $S$. Other definitions we will need to represent the method include:

- $f_C$: fraction of time spent operating on cached data (cache efficiency)
- $f_M$: fraction of time spent operating on memory resident data, or $1-f_C$
- $T_N$: total time required to perform $N$ operations
- $P_N$: performance rate in operations per unit time, or $N / T_N$
- $S$: ratio of data fetch times $t_M / t_C$ (speedup)
Amdahl’s Law for Cache-based Microprocessors

An expression for the time required to perform \( N \) fetch operations on register-sized data of arbitrary residence can be written immediately from the definitions:

\[
T_N = N \left( f_C \, t_C + f_M \, t_M \right)
\]

Substitution of referenced identities followed by a rearrangement of terms yields:

\[
T_N = t_M \, N \left( 1 - f_C \left( S - 1 \right) / S \right)
\]

Dividing both sides by \( T_N \) and then solving for \( P_N \) produces a relationship for estimating absolute performance:

\[
P_N = \left( \frac{t_M \left( 1 - f_C \left( S - 1 \right) / S \right)}{S} \right)^{-1}
\]

Without loss of generality, we can normalize the time unit with respect to \( t_C \) which implies \( t_M = S \) and an expression for estimating relative performance emerges:

\[
p_N = \left( S - f_C \left( S - 1 \right) \right)^{-1}
\]

A related formula for cache efficiency can be written by solving the relative performance function for \( f_C \):

\[
f_C = \left( S - 1 / p_N \right) / \left( S - 1 \right)
\]
• **Analysis**

The ratio of fetch times ($S$) for a given architecture has a significant effect on the relationship between relative performance ($p_N$) and cache efficiency ($f_C$) as illustrated in Fig. 1.

![Figure 1. Relative performance as a function of cache efficiency for selected speedups](image)
Analysis (continued)

The curves of Fig. 2 characterize cache efficiency as a function of relative performance, i.e. the inverse of the relative performance function.

Figure 2. Cache efficiency as a function of relative performance for selected speedups
• **Analysis (continued)**

Finally, consider the curves presented by Fig. 3. in which relative performance is regarded as a function of $S$. This view provides further indication of the way in which speedup affects relative performance. If a speedup improvement ($S \rightarrow 1$) were to be implemented somehow, the least efficient codes would clearly benefit the most. In general, this suggests that it is easier to extract higher levels of relative performance from architectures characterized by lower speedup values.

![Figure 3. Relative performance as a function of speedup for selected cache efficiencies](image-url)
Consideration of the Network Design

A final item worth mentioning is the potential effect of networking design on the performance of the memory channel. As shown by Fig. 4, some of the most popular distributed shared memory (DSM) machines position a network interface (NI) between the processors and memory. The transfer rate across this device cannot be neglected when studying the single processor performance of DSM machines.
Validation

Tools required to validate functions were available in the form of the `perfex` and `hinv` utilities featured by Silicon Graphics computing systems. We were also privy to accounts on similar systems featuring different CPU boards (195 MHz and 250 MHz) which allowed a basic throughput analysis to be completed as well.

<table>
<thead>
<tr>
<th>MIPS R10000</th>
<th>CPU (MHz)</th>
<th>Regs-L1 (GB/s)</th>
<th>L1</th>
<th>L1-L2 (GB/s)</th>
<th>L2</th>
<th>L2 Clock (MHz)</th>
<th>L2-Mem (GB/s)</th>
<th>FPU</th>
<th>FPU mult add</th>
<th>Peak (MFlop/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Rev 2.6</td>
<td>195</td>
<td>1.56</td>
<td>32</td>
<td>32</td>
<td>1.06</td>
<td>128</td>
<td>4</td>
<td>130</td>
<td>0.45</td>
<td>R10010</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Rev 0.0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>390</td>
</tr>
<tr>
<td>Rev 3.4</td>
<td>250</td>
<td>2.00</td>
<td>32</td>
<td>32</td>
<td>1.63</td>
<td>128</td>
<td>4</td>
<td>200</td>
<td>0.47</td>
<td>R999999</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Rev 0.0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>500</td>
</tr>
<tr>
<td>Δ</td>
<td>28%</td>
<td>28%</td>
<td>0%</td>
<td>0%</td>
<td>54%</td>
<td>0%</td>
<td>0%</td>
<td>54%</td>
<td>4%</td>
<td>Δ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>28%</td>
</tr>
</tbody>
</table>

Table 1. Physical characteristics and performance data for the R10000 microprocessor
Validation (continued)

In Table 2, a test value for cache efficiency is established and the corresponding relative and absolute performance figures are developed for comparison purposes. Values for $t_C$ and $t_M$ were derived from the transfer rates given in Table 1. The effect of increasing the clock rate without an accompanying increase in memory bandwidth is clearly indicated by the increase in speedup and the accompanying decrease in relative performance.

<table>
<thead>
<tr>
<th>MIPS R10000</th>
<th>CPU (MHz)</th>
<th>Flops per cycle</th>
<th>Peak (MFlop/s)</th>
<th>$t_C$ (cycles)</th>
<th>$t_M$ (cycles)</th>
<th>S (speedup)</th>
<th>$f_C$ (%)</th>
<th>$p_N$ (%)</th>
<th>$P_N$ (MFlop/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Rev 2.6</td>
<td>195</td>
<td>2</td>
<td>390</td>
<td>1.0</td>
<td>3.5</td>
<td>3.5</td>
<td>50%</td>
<td>44.8%</td>
<td>175</td>
</tr>
<tr>
<td>Rev 3.4</td>
<td>250</td>
<td>2</td>
<td>500</td>
<td>1.0</td>
<td>4.3</td>
<td>4.3</td>
<td>50%</td>
<td>38.1%</td>
<td>190</td>
</tr>
<tr>
<td>Δ</td>
<td>28%</td>
<td>0%</td>
<td>28%</td>
<td>0%</td>
<td>-23%</td>
<td>-23%</td>
<td>Δ</td>
<td>-15%</td>
<td>9%</td>
</tr>
</tbody>
</table>

Table 2. R10000 performance data (alternate units) and throughput comparison for $f_C = 0.5$
Validation (continued)

Table 3 summarizes measurements and estimates for two simple benchmarking codes. The horner example evaluates a 12th degree polynomial. The vector triad code evaluates \( a_i = k a_i + b_i \). Note that the peak rate associated with the triad benchmark reflects the two loads must be performed to complete a multadd operation which corresponds to one flop per cycle.

<table>
<thead>
<tr>
<th>R10000 Rev 2.6</th>
<th>Bench Applications</th>
<th>Memory Statistics (perfex)</th>
<th>Estimated Error</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>L1 (hit rate)</td>
<td>L2 (hit rate)</td>
</tr>
<tr>
<td>N = 1E3</td>
<td>horner</td>
<td>0.9708</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>vector triad</td>
<td>0.8000</td>
<td>-</td>
</tr>
<tr>
<td>N = 1E6</td>
<td>horner</td>
<td>0.3360</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>vector triad</td>
<td>0.1520</td>
<td>-</td>
</tr>
</tbody>
</table>

Table 3. 195 MHz R10000 benchmarking statistics and estimates
• **Validation (continued)**

In an effort to confirm the results shown by Table 3, a view of the faster R10000 chip is provided by Table 4. As expected, all of the estimates exceed measured values.

<table>
<thead>
<tr>
<th>R10000 Rev 3.4</th>
<th>Bench Applications</th>
<th>Memory Statistics (perfex)</th>
<th>Estimated Error</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>L1 (hit rate)</td>
<td>L2 (hit rate)</td>
</tr>
<tr>
<td>N = 1E3</td>
<td>homer</td>
<td>0.9740</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>vector triad</td>
<td>0.8050</td>
<td>-</td>
</tr>
<tr>
<td>N = 1E6</td>
<td>homer</td>
<td>0.4000</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>vector triad</td>
<td>0.1650</td>
<td>-</td>
</tr>
</tbody>
</table>

Table 4. 250 MHz R10000 benchmarking statistics and estimates
We’ll note in closing that a sensitivity study of the relative performance and cache efficiency functions has not yet been completed. However, results for the larger horner cases in particular suggest that the method is more susceptible to error at lower values of $f_c$. The curves of Fig. 3 serve to confirm this hypothesis. At the indicated speedup values, diminishing cache efficiencies correlate to steeper rates of change.

Figure 3. Relative performance as a function of speedup for selected cache efficiencies
Summary

- The difference between the rates at which contemporary microprocessors can fetch words from memory and cache can be used in conjunction with an estimated (or targeted) performance figure to approximate the corresponding cache efficiency. Conversely, an estimated (or targeted) cache efficiency can be used to predict relative performance.

- Hardware design data for a similar study of the IBM POWER2 and POWER3 microprocessors has been gathered and we are currently awaiting installation of a library developed by the RS/6000 Division that provides the same functionality as SGI's perfex utility. Validation of the method on other platforms is considered to be essential.
Summary (continued)

- Increases in caching efficiency can be realized externally through vendor-initiated improvements in hardware or software design, but for fixed set of conditions, the burden of affecting enhancements in cache utilization must be taken on by the user community. Any presumption that implies a need to achieve higher cache efficiencies must consider the amount of work required to accomplish the task. In any case, unrealistic efficiencies in the form of near-peak performance levels should never be used for the purpose of evaluating equipment throughput.

- The set of curves presented within clearly show that the production of efficient code becomes increasingly difficult as the difference between memory and cache fetch rates grows. However, as processor clock rates are quickly approaching physical limits, we expect that this situation will reverse itself. In the mean time, larger caches, improved logic, and additional layers of memory hierarchy must serve to mask imbalances.