Q4. Floating-point numbers and arithmetic (a) For a B-based floating-point number system of p digits of precision, with exponent ranging from L to U, find the total number of normalized floating-point number in terms of 3, p, L, U. (b) Let a be a nonzero real number and fl(a) be a k-digit rounding approximation to a in base 2. Show that the relative error of round k-digits satisfy the bound - fl(a)| <2=k. a - a Hint: Look at the two cases of rounding separately. At least how many digits would one need in order to guarantee the rounding approximation fl(a) in binary has a relative error of at most 3 x 10-16? Comment on how this relates to the number of precision digits for double-precision floating-point numbers. (c) Recall from the IEEE 754 standard that fl(x) = x(1+ 8) with |8| < e for some machine precision e. Show that |FI(fl(x+ y) + z) – (x+ y + 2)| < (\x + y| + |x + y + zl)e + |x+ yle², |S(x + fl(y+ 2)) – (x+ y + z)| < (ly + 2| + |x + y + z[)e + \y + z|e². Hint: Let fl(x + y) = (x + y)(1+ d1) and expand out fl(fl(x+y) + z) = (fl(x+ y) + z)(1+ d2) = (d) Conclude from part (c) that floating-point addition is in general not associative, i.e. =.... fl(fl(r+ y) + 2) # fl(x+ fl(y+ z)). If |r+y| < |y+ z|, which order of summation from part (c) will give a smaller bound for the absolute error?

Computer Networking: A Top-Down Approach (7th Edition)
7th Edition
ISBN:9780133594140
Author:James Kurose, Keith Ross
Publisher:James Kurose, Keith Ross
Chapter1: Computer Networks And The Internet
Section: Chapter Questions
Problem R1RQ: What is the difference between a host and an end system? List several different types of end...
icon
Related questions
Question
solve question (c) (d)
Q4. Floating-point numbers and arithmetic
(a) For a B-based floating-point number system of p digits of precision, with exponent ranging from L to
U, find the total number of normalized floating-point number in terms of 3, p, L, U.
(b) Let a be a nonzero real number and fl(a) be a k-digit rounding approximation to a in base 2.
Show that the relative error of round k-digits satisfy the bound
- fl(a)|
a -
a
Hint: Look at the two cases of rounding separately.
At least how many digits would one need in order to guarantee the rounding approximation fl(a) in binary
has a relative error of at most 3 x 10-16? Comment on how this relates to the number of precision digits
for double-precision floating-point numbers.
(c) Recall from the IEEE 754 standard that fl(x) = x(1+ d) with |8| < e for some machine precision e.
Show that
|FI(fl(x+ y) + z) – (x+ y + 2)| < (\x + y] + |x + y + z\)e + |x+ y|e²,
|F(x + fl(y+ 2)) – (x+ y + z)| < (ly + 2| + |x + y + zl)e + \y + 2|e².
Hint: Let fl(x + y) = (x + y)(1+ d1) and expand out fl(fl(x+y) + z) = (fl(x+ y) + z)(1+ d2) =
(d) Conclude from part (c) that floating-point addition is in general not associative, i.e.
=....
fl(fl(x + y) + 2) # fl(x+ fl(y+ z)).
If |r+y| < |y+ z|, which order of summation from part (c) will give a smaller bound for the absolute error?
Transcribed Image Text:Q4. Floating-point numbers and arithmetic (a) For a B-based floating-point number system of p digits of precision, with exponent ranging from L to U, find the total number of normalized floating-point number in terms of 3, p, L, U. (b) Let a be a nonzero real number and fl(a) be a k-digit rounding approximation to a in base 2. Show that the relative error of round k-digits satisfy the bound - fl(a)| a - a Hint: Look at the two cases of rounding separately. At least how many digits would one need in order to guarantee the rounding approximation fl(a) in binary has a relative error of at most 3 x 10-16? Comment on how this relates to the number of precision digits for double-precision floating-point numbers. (c) Recall from the IEEE 754 standard that fl(x) = x(1+ d) with |8| < e for some machine precision e. Show that |FI(fl(x+ y) + z) – (x+ y + 2)| < (\x + y] + |x + y + z\)e + |x+ y|e², |F(x + fl(y+ 2)) – (x+ y + z)| < (ly + 2| + |x + y + zl)e + \y + 2|e². Hint: Let fl(x + y) = (x + y)(1+ d1) and expand out fl(fl(x+y) + z) = (fl(x+ y) + z)(1+ d2) = (d) Conclude from part (c) that floating-point addition is in general not associative, i.e. =.... fl(fl(x + y) + 2) # fl(x+ fl(y+ z)). If |r+y| < |y+ z|, which order of summation from part (c) will give a smaller bound for the absolute error?
Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 3 steps with 1 images

Blurred answer
Recommended textbooks for you
Computer Networking: A Top-Down Approach (7th Edi…
Computer Networking: A Top-Down Approach (7th Edi…
Computer Engineering
ISBN:
9780133594140
Author:
James Kurose, Keith Ross
Publisher:
PEARSON
Computer Organization and Design MIPS Edition, Fi…
Computer Organization and Design MIPS Edition, Fi…
Computer Engineering
ISBN:
9780124077263
Author:
David A. Patterson, John L. Hennessy
Publisher:
Elsevier Science
Network+ Guide to Networks (MindTap Course List)
Network+ Guide to Networks (MindTap Course List)
Computer Engineering
ISBN:
9781337569330
Author:
Jill West, Tamara Dean, Jean Andrews
Publisher:
Cengage Learning
Concepts of Database Management
Concepts of Database Management
Computer Engineering
ISBN:
9781337093422
Author:
Joy L. Starks, Philip J. Pratt, Mary Z. Last
Publisher:
Cengage Learning
Prelude to Programming
Prelude to Programming
Computer Engineering
ISBN:
9780133750423
Author:
VENIT, Stewart
Publisher:
Pearson Education
Sc Business Data Communications and Networking, T…
Sc Business Data Communications and Networking, T…
Computer Engineering
ISBN:
9781119368830
Author:
FITZGERALD
Publisher:
WILEY