Please written by computer source Question 1: Suppose we wish to write a procedure that computes the inner product of two vectors u and v. An abstract version of the function has a CPE of 14–18 with x86- 64 for different types of integer and floating-point data. By doing the same sort of transformations we did to transform the abstract program combine1 into the more efficient combine4, we get the following code: Our measurements show that this function has CPEs of 1.50 for integer data and 3.00 for floating-point data. For data type double, the x86-64 assembly code for the inner loop is as follows: Assume that the functional units have the characteristics listed in Figure 5.12. **See last page for figures A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of textbook Figures 5.13 and 5.14. vmovsd vmovsd vmulsd vaddsd | | | | V V V V Get udata(i) Load vdata(i) Multiply Add to sum | V | 5. Increment i | V 6. Compare limit B. For data type double, what lower bound on the CPE is determined by the critical path? C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data? D. Explain how the floating-point versions can have CPEs of 3.00, even though the multiplication operation requires 5 clock cycles. The processor can issue one multiplication per cycle if there are no data dependencies between the multiplications. Processors also have multiple functional units for performing floating-point operations, which can further increase the parallelism and reduce the latency of the critical path. Question 2: Write a version of the inner product procedure described in Question 1 that uses 6 × 6 loop unrolling. Our measurements for this function with x86-64 give a CPE of 1.06 for integer data and 1.01 for floating-point data. What factor limits the performance to a CPE of 1.00? Question 3: Write a version of the inner product procedure described in Question 1 that uses 6 × 1a loop unrolling to enable greater parallelism. Our measurements for this function give a CPE of 1.10 for integer data and 1.05 for floating-point data.
Please written by computer source
Question 1:
Suppose we wish to write a procedure that computes the inner product of two
Our measurements show that this function has CPEs of 1.50 for integer data and 3.00 for floating-point data. For data type double, the x86-64 assembly code for the inner loop is as follows:
Assume that the functional units have the characteristics listed in Figure 5.12.
**See last page for figures
A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of textbook Figures 5.13 and 5.14.
vmovsd vmovsd vmulsd vaddsd
| | | |
V V V V
Get udata(i) Load vdata(i) Multiply Add to sum
|
V |
5. Increment i |
V
6. Compare limit
B. For data type double, what lower bound on the CPE is determined by the critical path?
C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data?
D. Explain how the floating-point versions can have CPEs of 3.00, even though the multiplication operation requires 5 clock cycles.
The processor can issue one multiplication per cycle if there are no data dependencies between the multiplications. Processors also have multiple functional units for performing floating-point operations, which can further increase the parallelism and reduce the latency of the critical path.
Question 2:
Write a version of the inner product procedure described in Question 1 that uses 6 × 6 loop unrolling. Our measurements for this function with x86-64 give a CPE of 1.06 for integer data and 1.01 for floating-point data.
What factor limits the performance to a CPE of 1.00?
Question 3:
Write a version of the inner product procedure described in Question 1 that uses 6 × 1a loop unrolling to enable greater parallelism. Our measurements for this function give a CPE of 1.10 for integer data and 1.05 for floating-point data.
Trending now
This is a popular solution!
Step by step
Solved in 4 steps