Lab2_F22

pdf

School

York University *

*We aren’t endorsed by this school

Course

3G03

Subject

Mathematics

Date

Feb 20, 2024

Type

pdf

Pages

4

Uploaded by SuperHumanOtterMaster523

Report
McMaster University Department of Mathematics and Statistics STATS 3A03: Applied Regression Analysis with SAS Fall 2022 SAS Lab 2: Week of September 19-23, 2022 Topics Covered in this Lab 1. PROC PLOT 2. PROC SGPLOT 3. PROC CORR 4. PROC REG 5. Find quantiles of distributions via the QUANTILE function. Find p -values via the CDF function. 1. PROC PLOT This procedure creates scatter plots. The basic form of the PROC is PROC PLOT DATA=lab2.iris; plot petal_length*sepal_length; run; In the Plot statement, the first variable specified goes on the vertical (Y) axis and the second goes on the horizontal (X) axis. The SAS default is to use the letters A Z as plotting symbols. An A is plotted when there is only one point at (or very close) to the plotting position. When two points need to be plotted on top of each other, a B is printed, etc. Rather that using these characters for plots, we could use * or + by modifying the plot command PROC PLOT DATA=lab2.iris; plot petal_length*sepal_length="*"; run; All plots are required to have a title and labels. The SAS default is to print out the variable names as the axis labels. This is rarely useful. The Label statement can be used to override this default. PROC PLOT DATA=lab2.iris; plot petal_length*sepal_length="*"; title "Iris Data Plot"; label petal_length="Petal Length"; label sepal_length="Sepal Length"; run; 1
2. PROC SGPLOT The PLOT procedure is the basic plotting method in SAS. The output from PLOT is part of the regular SAS output so it often does not look great. A better approach is using the SGPLOT procedure which produces a separate graphics plot. This plot can be saved as a number of different types of image files for later use. Plots produced from PROC SGPLOT are usually much nicer than those produced by PROC PLOT . Here, is an example: PROC SGPLOT DATA=lab2.iris; scatter x=sepal_length y=petal_length; run; Since there are two species of flowers in the dataset. We can use group=species to give them different colours. PROC SGPLOT DATA=lab2.iris; scatter x=sepal_length y=petal_length / group=species; title "Iris Data Plot"; label petal_length="Petal Length"; label sepal_length="Sepal Length"; run; The SGPLOT procedure produces a variety of graphs including bar charts, scatter plots, and line graphs. We can plot the scatterplot with the fitted line for the Iris data in the following way. PROC SGPLOT DATA=lab2.iris; title "Scatterplot with Regression Line: Iris Data"; reg y=petal_length x=sepal_length; run; 3. PROC CORR This is the procedure to find sample correlation coefficients and to test the null hypothesis that the population correlation is 0. PROC CORR DATA=lab2.iris; VAR sepal_length petal_length petal_width; run; The output from this includes basic summary statistics on each variable in the VAR statement. The correlations are displayed in a matrix. The diagonal elements are always 1 and the off- diagonal elements give the observed sample correlation coefficient and the p -value for the hypothesis test (Lecture 4). An alternative way that is sometimes useful and gives a more compact output is: PROC CORR DATA=lab2.iris; VAR sepal_length; with petal_length petal_width; run; 2
This will correlate each of the variables in the Var statement with each variable in the WITH statement. 4. PROC REG This is the basic regression procedure in SAS. PROC REG DATA=lab2.iris Plot=none; title "Simple Linear Regression Model -- Iris Data"; model petal_length =sepal_length; run; The displayed output includes the Analysis of Variance Table (ANOVA), some important quantities such as the Root MSE (ˆ σ ), R 2 and the adjusted R 2 , and the table of parameter estimates, their standard errors, the t -statistic for testing if the population parameter is 0 and the p -value of the two sided test. We can also use PROC REG to construct a dataset that includes such things as fitted values and the corresponding residuals for each point. This is often very useful for subsequent analysis of the fitted model. The OUTPUT statement will do this as follows: PROC REG DATA=lab2.iris plots=none; model petal_length =sepal_length ; output out=lab2.iris_predicted predicted=fitted residual=residuals; run; The Out=lab2.iris predicted gives the name of the output dataset and this dataset will be created in the lab2 library. The predicted=fitted tells SAS to store the predicted values in a column called fitted in the output dataset and residual=residuals tells it to save the residuals in a column called residuals in the output dataset. The names after the = can be any valid name, but those before must be exactly as given above. Later we will see other things that can be added to this dataset. Note that the original data columns are automatically included in the output dataset as well. We can save the estimates and also get confidence intervals for the intercept and slope in a new data set using outest= and Tableout options in the PROC REG statement: PROC REG DATA=lab2.iris PLOTS=none outest=lab2.iris_est tableout; model petal_length =sepal_length /alpha=0.01; run; quit; We can then examine this dataset as any other dataset to extract the estimates, standard error, p -values, t-statistics, and 100(1- α )% confidence intervals for the intercept and each X variable. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5. Find quantiles of distributions via the QUANTILE function. Find p -values via the CDF function. The QUANTILE function returns the quantile from a distribution when you specify the left probability (CDF). The "T" indicates we use a t distribution. We then specify the left probability (e.g. 0 . 01 ). For a t distribution, we also need to put in degrees of freedom (6 and 5) here. data quantile; q1=QUANTILE("T", 0.01,6); q2=QUANTILE("T", 0.025,5); run; We can find the p -value using the CDF function. Again, the first element in the function is to specify the distribution you wish to use, the second element is the quantile, and the rest are model parameters. Please note that, since t distribution is symmetric, we find the extreme probability on one side and then multiply by two. data pvalue; p1=(1-CDF("T",3.14,6))*2; p2=(1-CDF("T",2.57,5))*2; run; Exercise: (a) Retrieve the dataset oldfaithful from Lab 1. (b) Draw a scatterplot of the data. (c) Calculate the sample correlation r . (d) Perform a linear regression to see how well waiting time can be used to predict the eruption time and find the estimated intercept, slope, and the estimate of the error standard deviation σ . (e) What is the p -value for the test of H 0 : β 1 = 0 vs H 1 : β 1 6 = 0? (f) Give a 99% confidence interval for the slope. 4