Lab2_F22
pdf
keyboard_arrow_up
School
York University *
*We aren’t endorsed by this school
Course
3G03
Subject
Mathematics
Date
Feb 20, 2024
Type
Pages
4
Uploaded by SuperHumanOtterMaster523
McMaster University
Department of Mathematics and Statistics
STATS 3A03: Applied Regression Analysis with SAS
Fall 2022
SAS Lab 2: Week of September 19-23, 2022
Topics Covered in this Lab
1. PROC PLOT
2. PROC SGPLOT
3. PROC CORR
4. PROC REG
5. Find quantiles of distributions via the QUANTILE function.
Find
p
-values via the
CDF function.
1. PROC PLOT
This procedure creates scatter plots. The basic form of the PROC is
PROC PLOT DATA=lab2.iris;
plot petal_length*sepal_length;
run;
In the
Plot
statement, the first variable specified goes on the vertical (Y) axis and the second
goes on the horizontal (X) axis. The SAS default is to use the letters
A
–
Z
as plotting symbols.
An
A
is plotted when there is only one point at (or very close) to the plotting position. When
two points need to be plotted on top of each other, a
B
is printed, etc. Rather that using
these characters for plots, we could use
*
or
+
by modifying the plot command
PROC PLOT DATA=lab2.iris;
plot petal_length*sepal_length="*";
run;
All plots are required to have a title and labels. The SAS default is to print out the variable
names as the axis labels. This is rarely useful. The
Label
statement can be used to override
this default.
PROC PLOT DATA=lab2.iris;
plot petal_length*sepal_length="*";
title "Iris Data Plot";
label petal_length="Petal Length";
label sepal_length="Sepal Length";
run;
1
2. PROC SGPLOT
The
PLOT
procedure is the basic plotting method in SAS. The output
from
PLOT
is part of the regular SAS output so it often does not look great. A better approach
is using the
SGPLOT
procedure which produces a separate graphics plot. This plot can be
saved as a number of different types of image files for later use. Plots produced from
PROC
SGPLOT
are usually much nicer than those produced by
PROC PLOT
. Here, is an example:
PROC SGPLOT DATA=lab2.iris;
scatter x=sepal_length y=petal_length;
run;
Since there are two species of flowers in the dataset.
We can use
group=species
to give
them different colours.
PROC SGPLOT DATA=lab2.iris;
scatter x=sepal_length y=petal_length / group=species;
title "Iris Data Plot";
label petal_length="Petal Length";
label sepal_length="Sepal Length";
run;
The SGPLOT procedure produces a variety of graphs including bar charts, scatter plots,
and line graphs.
We can plot the scatterplot with the fitted line for the Iris data in the
following way.
PROC SGPLOT DATA=lab2.iris;
title "Scatterplot with Regression Line: Iris Data";
reg y=petal_length x=sepal_length;
run;
3. PROC CORR
This is the procedure to find sample correlation coefficients and to test
the null hypothesis that the population correlation is 0.
PROC CORR DATA=lab2.iris;
VAR sepal_length petal_length petal_width;
run;
The output from this includes basic summary statistics on each variable in the
VAR
statement.
The correlations are displayed in a matrix. The diagonal elements are always 1 and the off-
diagonal elements give the observed sample correlation coefficient and the
p
-value for the
hypothesis test (Lecture 4).
An alternative way that is sometimes useful and gives a more compact output is:
PROC CORR DATA=lab2.iris;
VAR sepal_length;
with petal_length petal_width;
run;
2
This will correlate each of the variables in the
Var
statement with each variable in the
WITH
statement.
4. PROC REG
This is the basic regression procedure in SAS.
PROC REG DATA=lab2.iris Plot=none;
title "Simple Linear Regression Model -- Iris Data";
model petal_length =sepal_length;
run;
The displayed output includes the Analysis of Variance Table (ANOVA), some important
quantities such as the Root MSE (ˆ
σ
),
R
2
and the adjusted
R
2
, and the table of parameter
estimates, their standard errors, the
t
-statistic for testing if the population parameter is 0
and the
p
-value of the two sided test.
We can also use
PROC REG
to construct a dataset that includes such things as fitted values
and the corresponding residuals for each point.
This is often very useful for subsequent
analysis of the fitted model. The
OUTPUT
statement will do this as follows:
PROC REG DATA=lab2.iris plots=none;
model petal_length =sepal_length ;
output out=lab2.iris_predicted
predicted=fitted
residual=residuals;
run;
The
Out=lab2.iris
predicted
gives the name of the output dataset and this dataset will
be created in the lab2 library. The
predicted=fitted
tells SAS to store the predicted values
in a column called
fitted
in the output dataset and
residual=residuals
tells it to save
the residuals in a column called
residuals
in the output dataset. The names after the
=
can be any valid name, but those before
must
be exactly as given above. Later we will see
other things that can be added to this dataset.
Note that the original data columns are
automatically included in the output dataset as well.
We can save the estimates and also get confidence intervals for the intercept and slope in a
new data set using
outest=
and
Tableout
options in the
PROC REG
statement:
PROC REG DATA=lab2.iris PLOTS=none outest=lab2.iris_est tableout;
model petal_length =sepal_length /alpha=0.01;
run;
quit;
We can then examine this dataset as any other dataset to extract the estimates, standard
error,
p
-values, t-statistics, and 100(1-
α
)% confidence intervals for the intercept and each
X
variable.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
5.
Find quantiles of distributions via the QUANTILE function.
Find
p
-values
via the CDF function.
The
QUANTILE
function returns the quantile from a distribution when you specify the left
probability (CDF). The
"T"
indicates we use a
t
distribution.
We then specify the left
probability (e.g. 0
.
01 ). For a
t
distribution, we also need to put in degrees of freedom (6
and 5) here.
data quantile;
q1=QUANTILE("T", 0.01,6);
q2=QUANTILE("T", 0.025,5);
run;
We can find the
p
-value using the CDF function. Again, the first element in the function
is to specify the distribution you wish to use, the second element is the quantile, and the
rest are model parameters. Please note that, since
t
distribution is symmetric, we find the
extreme probability on one side and then multiply by two.
data pvalue;
p1=(1-CDF("T",3.14,6))*2;
p2=(1-CDF("T",2.57,5))*2;
run;
Exercise:
(a) Retrieve the dataset
oldfaithful
from Lab 1.
(b) Draw a scatterplot of the data.
(c) Calculate the sample correlation
r
.
(d) Perform a linear regression to see how well waiting time can be used to predict the
eruption time and find the estimated intercept, slope, and the estimate of the error
standard deviation
σ
.
(e) What is the
p
-value for the test of
H
0
:
β
1
= 0 vs
H
1
:
β
1
6
= 0?
(f) Give a 99% confidence interval for the slope.
4