Suppose we have a set of data consisting of ordered pairs and we suspect the x and y coordinates are related. It is natural to try to find the best line that fits the data points. If we can find this line, then we can use it to make all sorts of other predictions. In this project, we're going to use several functions to find this line using a technique called least squares regression. The result will be what we call the least squares regression line (or LSRL for short). In order to do this, you'll need to program a statistical computation called the correlation coefficient, denoted by r in statistical symbols: NOTE: Equation is written assuming you start at the value 1. Lists start at index 0. Once you have the correlation coefficient, you use it along with the sample means and sample standard deviations of the x and y-coordinates to compute the slope and y-intercept of your regression line via these formulas: Tasks: In this project, you must read the x- and y-coordinate pairs in from a data file of unknown length. Each line in the file must contain both coordinates, separated by whitespace, as shown here. In addition, you must use functions in this project, splitting the work up into smaller components and reinforcing your skills with parameter passing and lists. You are required to create the following functions: # Role Function’s Objective Input Parameters Output Return Values 1 Input Read the input file and store the x- and y-coordinates in parallel lists N/A N/A -List of x-coordinates. -List of y-coordinates. 2 Process Compute the mean of the data set -List of data N/A -The mean of the data in the list 3 Process Compute the standard deviation of the data set. -List of data -The mean of the data N/A -The standard deviation of the data in the list 4 Process Compute the correlation coefficient. -Call the mean and Standard Deviation function, where needed. -List of x-coordinates -List of y-coordinates N/A -The correlation coefficient of the input lists 5 Process Compute the slope. -Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed. -List of x-coordinates -List of y-coordinates N/A -The slope of the line 6 Process Compute the y-intercept. -Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed. -List of x-coordinates -List of y-coordinates N/A -The y-intercept of the line 7 Output Display a mathematical representation of a line to the screen. -The y-intercept of the line -the slope of the line The y-intercept and slope of the line N/A Use the following code as the main and used for testing: if __name__ == "__main__": #Expected Results from calculations expectedSlopes = [-0.59, 3884.98, -6.24] expectYIntercepts = [1173.21, -25433.81, 152.06] resultCount = 0 #Used for looping through results #Rename these files to be your 3 input files, may need full path for f in ["data1.txt", "data2.txt", "data3.txt"]: #Read in data x_vals, y_vals = readfile(f) #Calculate the Slope slope = round(calcSlope(x_vals, y_vals), 2) assert slope == expectedSlopes[resultCount], "Got {} but expected slope of {}".format(slope, expectedSlopes[resultCount]) #Calculate the Y Intercept y_int = round(calcYint(x_vals, y_vals, slope), 2) assert y_int == expectYIntercepts[resultCount], "Got {} but expected Y Intercept of {}".format(y_int, expectYIntercepts[resultCount]) #Output the Regression Line output_line(y_int, slope) resultCount+=1 NOTE: The statistics module can be used to find the mean and standard deviation. Sample Screen Output Regression line: y = 1166.93 + -0.586788x Testing: When you are finished, test your program with four different input files: Data File 1 Data File 2 (density in pounds per cubic foot vs. stiffness in pounds per square inch of particleboards; taken from p. 391 of Probability and Statistics for Scientists and Engineers, 6th ed., Walpole/Myers/Myers) Data File 3 (daily rainfall in 0.01 cm vs. air pollution particulate removed in mcg/cum; taken from p. 365 of Walpole/Myers/Myers) A data file you've created yourself. Ideally this will be something in the context of your major. Provide information on where the data came from. What to Submit: The code Sample runs for each Data File. Data File 4 and a brief description of where you found it and what the data represents.
Theoretical Overview
Suppose we have a set of data consisting of ordered pairs and we suspect the x and y coordinates are related. It is natural to try to find the best line that fits the data points. If we can find this line, then we can use it to make all sorts of other predictions. In this project, we're going to use several functions to find this line using a technique called least squares regression. The result will be what we call the least squares regression line (or LSRL for short).
In order to do this, you'll need to program a statistical computation called the correlation coefficient, denoted by r in statistical symbols:
NOTE: Equation is written assuming you start at the value 1. Lists start at index 0.
Once you have the correlation coefficient, you use it along with the sample means and sample standard deviations of the x and y-coordinates to compute the slope and y-intercept of your regression line via these formulas:
Tasks:
In this project, you must read the x- and y-coordinate pairs in from a data file of unknown length. Each line in the file must contain both coordinates, separated by whitespace, as shown here. In addition, you must use functions in this project, splitting the work up into smaller components and reinforcing your skills with parameter passing and lists.
You are required to create the following functions:
# |
Role |
Function’s Objective |
Input Parameters |
Output |
Return Values |
1 |
Input |
Read the input file and store the x- and y-coordinates in parallel lists |
N/A |
N/A |
-List of x-coordinates. -List of y-coordinates. |
2 |
Process |
Compute the mean of the data set |
-List of data |
N/A |
-The mean of the data in the list |
3 |
Process |
Compute the standard deviation of the data set. |
-List of data -The mean of the data |
N/A |
-The standard deviation of the data in the list |
4 |
Process |
Compute the correlation coefficient. -Call the mean and Standard Deviation function, where needed. |
-List of x-coordinates -List of y-coordinates |
N/A |
-The correlation coefficient of the input lists |
5 |
Process |
Compute the slope. -Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed. |
-List of x-coordinates -List of y-coordinates |
N/A |
-The slope of the line |
6 |
Process |
Compute the y-intercept. -Call the mean, Standard Deviation, and Correlation Coefficient functions, where needed. |
-List of x-coordinates -List of y-coordinates |
N/A |
-The y-intercept of the line |
7 |
Output |
Display a mathematical representation of a line to the screen. |
-The y-intercept of the line -the slope of the line |
The y-intercept and slope of the line |
N/A |
Use the following code as the main and used for testing:
if __name__ == "__main__":
#Expected Results from calculations
expectedSlopes = [-0.59, 3884.98, -6.24]
expectYIntercepts = [1173.21, -25433.81, 152.06]
resultCount = 0 #Used for looping through results
#Rename these files to be your 3 input files, may need full path
for f in ["data1.txt", "data2.txt", "data3.txt"]:
#Read in data
x_vals, y_vals = readfile(f)
#Calculate the Slope
slope = round(calcSlope(x_vals, y_vals), 2)
assert slope == expectedSlopes[resultCount], "Got {} but expected slope of {}".format(slope, expectedSlopes[resultCount])
#Calculate the Y Intercept
y_int = round(calcYint(x_vals, y_vals, slope), 2)
assert y_int == expectYIntercepts[resultCount], "Got {} but expected Y Intercept of {}".format(y_int, expectYIntercepts[resultCount])
#Output the Regression Line
output_line(y_int, slope)
resultCount+=1
NOTE: The statistics module can be used to find the mean and standard deviation.
Sample Screen Output
Regression line: y = 1166.93 + -0.586788x
Testing:
When you are finished, test your program with four different input files:
- Data File 1
- Data File 2 (density in pounds per cubic foot vs. stiffness in pounds per square inch of particleboards; taken from p. 391 of Probability and Statistics for Scientists and Engineers, 6th ed., Walpole/Myers/Myers)
- Data File 3 (daily rainfall in 0.01 cm vs. air pollution particulate removed in mcg/cum; taken from p. 365 of Walpole/Myers/Myers)
- A data file you've created yourself. Ideally this will be something in the context of your major. Provide information on where the data came from.
What to Submit:
- The code
- Sample runs for each Data File.
- Data File 4 and a brief description of where you found it and what the data represents.
Step by step
Solved in 2 steps with 2 images