The link of the cheat sheet is here
I have benefited a lot from the UCLA SAS tutorial, especially the chapter of regression diagnostics
. However, the content on the webpage seems to be outdated. The great thing for PROC REG is that it creates a beautiful and concise 3X3 plot panel for residual analysis.
I created a cheat sheet to try to interpret the diagnosis panel. Thanks to the ESS project
, the BASEBALL data set used by SAS
is available for public. Hereby I borrowed this data set as an example, and the cheat sheet also contains the data and the SAS program. The regression model attempts to predict the baseball players’ salary by their performance statistics. The plot panel can be partitioned into four functional zones:
OLS assumption check
The three OLS assumption is essential to linear regression for BLUE estimators
. However, the residual plot above on the left-top panel has a funnel-like shape, which is usually corrected by a
transformation in practice.
In reality the normality is not required for linear regression. However, most people like to see t-test, F-test or P value which needs the normality of residual. The histogram and Q-Q plot on the left-bottom are good reference.
Outlier and influential points check
The three top-right plots can be used to rule out some extraordinary data points by leverage, Cook’s D and R-studentized residues.
Rick Wicklin has a thorough introduction
about the fit-mean plot. We can also look at r-square in the most bottom-right plot . If the linearity is not very satisfied, SAS/STAT has a few powerful procedures to correct non-linearity and increase the fitting performance, such as the latest ADAPTIVEREG procedure (see a diagram in my previous post
There are still a few other concerns that need to be addressed for linear regressio such as multicolinearity (diagnosed by the
VIF and other options in PROC REG
) and overfitting (PROC GLMSELECT now weights in).
The PROC procedure in SAS solves the parameters by the normal equation instead of the gradient descent, which makes it always an ideal tool for linear regression diagnosis.
/*I. Grab the football data set from web */
filename myfile url 'https://svn.r-project.org/ESS/trunk/fontlock-test/baseball.sas';
proc contents data=baseball position;
ods output position = pos;
/*II. Diagnose the multiple linear regression for the players’ salaries*/
select variable into: regressors separated by ' '
where num between 5 and 20;
proc reg data=baseball;
model salary = ®ressors;
/*III. Deal with heteroscedasticity*/
proc reg data=baseball_t;
model logsalary =