Homework 6

docx

School

Montclair State University *

*We aren’t endorsed by this school

Course

280

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

4

Uploaded by PresidentTeamOkapi31

Report
CSIT 545 (Fall 2023) Homework # 6 (Due December 13,2023 by 11:59 PM on Canvas “Assignment”) 1. [ 40 Points ] Assuming a benchmark program (i.e., workload) with p% of the instructions parallelizable, please analyze the speedup of a 10-processor system vs. 1-processor system using the following two methods: a. [ 15 Points ] Strong scaling (i.e., assuming fixed workload when analyzing the two systems). a. Strong Scaling: In strong scaling, we assume a fixed workload and analyze the speedup when increasing the number of processors. The speedup (S) is given by: S=T1/Tp where T1 is the execution time on a single processor and Tp is the execution time on p processors. b. [ 15 Points ] Weak scaling (i.e., assuming the workload of the parallelizable part grows proportionally to the increase in the number of processors). b. Weak Scaling: In weak scaling, we assume that the workload of the parallelizable part grows proportionally to the increase in the number of processors. The speedup (W) is given by: W= T1/ p.Tp where T1 is the execution time on a single processor and Tp is the execution time on p processors, and p is the number of processors. c. [10 points] Are the speedup results you obtained in (a) and (b) the same? If they are the same, please explain why you obtained the same results using two different methods; if they are different, do they lead to opposite conclusions about whether the performance is improved when scaling from 1 to 10 processors? Please justify your answer. c. Comparison and Justification: The speedup results obtained in (a) and (b) may or may not be the same. If the percentage of parallelizable instructions (p) is relatively high, strong scaling may result in better speedup because the fixed workload benefits from the increased parallelism on multiple processors. However, in weak scaling, the workload is allowed to grow with the number of processors. If the increase in the workload compensates for the increase in processors, weak scaling may result in a more modest speedup compared to strong scaling. In general, the results from (a) and (b) may be different. If the speedup in (a) is greater than in (b), it suggests that the program does not scale well with an increasing number of processors when the workload grows proportionally. Conversely, if the speedup in (b) is greater than in (a), it suggests that the program benefits from increased parallelism when the workload remains fixed. It's important to note that the results do not necessarily lead to opposite conclusions about whether the performance is improved when scaling from 1 to 10 processors. The
interpretation depends on the specific values obtained for speedup in each case and the characteristics of the workload and parallelization. 2. [ 60 Points ] Given the following C program: for (i = 0; i < 50; i = i + 1) A[i] = A[i] + s; We assume that the base address of array A in memory is in register x18, and variable s is in register x22. a. [20 Points] Please write the functionally equivalent assembly code using the regular single processor version of RISC-V, with a comment (using “//”) to explain each line of the code. // Load the base address of array A into a register lui x19, %hi(A) // Load upper immediate: x19 = upper 20 bits of A's address addi x19, x19, %lo(A) // Add immediate: x19 = x19 + lower 12 bits of A's address // Load the value of s into a register ld x23, 0(x22) // Load doubleword: x23 = *(x22 + 0) // Loop: Iterate 50 times li x30, 0 // Initialize loop counter x30 = 0 loop_start: bge x30, 50, loop_end // Branch if x30 >= 50 to loop_end ld x24, 0(x19) // Load doubleword: x24 = *(x19 + 0) add x24, x24, x23 // Add: x24 = x24 + x23 sd x24, 0(x19) // Store doubleword: *(x19 + 0) = x24 addi x19, x19, 8 // Add immediate: x19 = x19 + 8 (assuming 8 bytes per doubleword) addi x30, x30, 1 // Increment loop counter j loop_start // Jump to loop_start loop_end: b. [20 Points] Please write the functionally equivalent assembly code using the vector extension of RISC-V, with a comment (using “//”) to explain each line of the code. (Note: You can make assumptions about the vector length and element size for the vector processor, which can be expressed in either a comment in English or using a RISC-V instruction. Also, the RISC-V vector instruction descriptions in the page https://github.com/riscv/riscv - v spec/blob/master/v - spec.adoc may help you understand and choose the right instructions to use for this program.) // Load the base address of array A into a vector register vsetvli t0, 4, e8 // Set vector length to 4, element size to 8 bytes (assuming 8-byte doublewords) lv v19, 0(x18) // Load vector: v19 = *(x18 + 0)
// Load the value of s into a scalar register ld x23, 0(x22) // Load doubleword: x23 = *(x22 + 0) // Loop: Iterate 50 times li x30, 0 // Initialize loop counter x30 = 0 loop_start: bge x30, 50, loop_end // Branch if x30 >= 50 to loop_end vadd v19, v19, x23 // Vector add: v19 = v19 + x23 addi x30, x30, 1 // Increment loop counter j loop_start // Jump to loop_start loop_end: sv v19, 0(x18) // Store vector: *(x18 + 0) = v19 c. [20 Points] Comparing your code in (a) and (b) , please discuss at least two benefits of using vector processor and justify your answers using the actual code you wrote. (Hint: Page 14 of Slides 8 may provide you some insights about the benefits you may consider discussing. Benefits of Using Vector Processor: Parallelism: Vector processors can perform the same operation on multiple data elements simultaneously. In the vectorized version, the addition operation is applied to the entire vector in a single instruction (vadd v19, v19, x23), leading to parallel execution. This parallelism reduces the loop overhead and improves overall throughput. Reduced Instruction Overhead: In the regular single processor version, we need multiple instructions to load, operate, and store data for each element in the loop. In the vector processor version, a single vector instruction (vadd) handles multiple elements at once, reducing the number of instructions executed in the loop. This reduction in instruction count can lead to improved efficiency and reduced program execution time. By examining the vectorized code, we can see that the vector processor performs the same computation on multiple data elements in parallel, demonstrating the advantages of parallelism and reduced instruction overhead. These benefits contribute to enhanced performance in scenarios where vectorization is applicable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help