\part{Linear Discriminant Analysis vs Random Forests}
Package: \texttt{randomForest}
\vspace*{3pt}
<>=
options(SweaveHooks=list(fig=function(){par(cex.main=1.1,
mar=c(4.1,4.1,2.6,2.1),
mgp=c(2.25,0.5,0), tck=-0.02)}))
graphics.off()
@
For linear discriminant analysis, we will use the function
\texttt{lda()} (\textit{MASS} package). Covariates are assumed to
have a common multivariate normal distribution. It may have poor
predictive power where there are complex forms of dependence on the
explanatory factors and variables. Where it is effective, it has the
virtue of simplicity.
The function \texttt{qda()} weakens the assumptions underlying
\texttt{lda()} to allow different variance-covariance matrices
for different groups within the data. This sets limits on the minimum
group size.
Where there are two classes, generalized linear models
(\texttt{glm()}) have very similar properties to linear discriminant
analysis. It makes weaker assumptions. The trade-off is that
estimated group membership probabilities are conditional on the
observed matrix.
With all these ``linear'' methods, the model matrix can replace
columns of covariate data by a set of spline (or other) basis columns
that allow for effects that are nonlinear in the covariates. Use
\texttt{termplot()} with a \texttt{glm} object, with the argument
\texttt{smooth=panel.smooth}, to check for hints of nonlinear
covariate effects. Detection of nonlinear effects typically requires
very extensive data.
A good first check on whether these ``linear'' methods are
adequate is comparison with the highly nonparametric
analysis of the function \texttt{randomForest()} (
\textit{randomForest} package). Random Forests may do well when complex
interactions are required to explain the dependence.
Here, attention will mostly be limited to data where there
are two groups only.
\section{Accuracy for Classification Models -- the \texttt{Pima} Data}
Type \texttt{help(Pima.tr2)} (\textit{MASS} package)
to get a description of these data. They are relevant to the
investigation of conditions that may pre-dispose to diabetes.
All the explanatory variables can be treated as continuous variables.
There are no factor columns, or columns (e.g. of 0/1 data) that might
be regarded as factors.
\subsection{Fitting an lda model}
First try linear discriminant analysis, using the model formula
\texttt{type $\sim$ .}. This takes as explanatory variables
all columns of \texttt{Pima.tr} except \texttt{type}.
A first run of the calculations has \texttt{CV=TRUE}, thus using
leave-one-out cross-validation to to predictions class membership and
hence get an accuracy estimate. The second run of the calculations
has \texttt{CV=FALSE} (the default), allowing the use of
\texttt{predict()} to obtain (among other information) discriminant scores.
<>=
library(MASS)
PimaCV.lda <- lda(type ~ ., data=Pima.tr, CV=TRUE)
tab <- table(Pima.tr$type, PimaCV.lda$class)
## Calculate confusion matrix from cross-validation
conCV1 <- rbind(tab[1,]/sum(tab[1,]), tab[2,]/sum(tab[2,]))
dimnames(conCV1) <- list(Actual=c("No", "Yes"),
"Predicted (cv)"=c("No","Yes"))
print(round(conCV1,3))
## Now refit the model and get prediction scores
Pima.lda <- lda(type ~ ., data=Pima.tr)
## Get a training set accuracy estimate; can be highly optimistic
Pima.hat <- predict(Pima.lda)
tabtrain <- table(Pima.tr$type, Pima.hat$class)
conTrain <- rbind(tab[1,]/sum(tab[1,]), tab[2,]/sum(tab[2,]))
dimnames(conTrain) <- list(Actual=c("No", "Yes"),
"Predicted (cv)"=c("No","Yes"))
print(round(conTrain,3))
@
Notice that, here, the two accuracy measures are the same. In general,
the training set accuracy can be optimistic.
Now plot the discriminant scores. As there are two groups only,
there is just one set of scores.
<>=
library(lattice)
densityplot(~Pima.hat$x, groups=Pima.tr$type)
@ %
A function that calculates the confusion matrices and overall
accuracy would be helpful:
<>=
confusion <- function(actual, predicted, names=NULL,
printit=TRUE, prior=NULL){
if(is.null(names))names <- levels(actual)
tab <- table(actual, predicted)
acctab <- t(apply(tab, 1, function(x)x/sum(x)))
dimnames(acctab) <- list(Actual=names,
"Predicted (cv)"=names)
if(is.null(prior)){
relnum <- table(actual)
prior <- relnum/sum(relnum)
acc <- sum(tab[row(tab)==col(tab)])/sum(tab)
} else
{
acc <- sum(prior*diag(acctab))
names(prior) <- names
}
if(printit)print(round(c("Overall accuracy"=acc,
"Prior frequency"=prior),4))
if(printit){ cat("\nConfusion matrix", "\n")
print(round(acctab,4))
}
invisible(acctab)
}
@ %
\subsection{The model that includes first order interactions}
Is the outcome influenced by the combined effect of covariates, e.g.,
by whether they increase or decrease together. A check is to
include the effects of all products of variable values such as
\texttt{npreg*glu}, \texttt{npreg*bp}, etc. In this instance, it will
turn out that this leads to a model that is over-fitted.
The model formula \verb!(a+b+c)^2! expands to
\texttt{a+b+c+a:b+a:c+b:c}. Note the following:
\begin{itemize}
\item \texttt{a:a} is the same as \texttt{a}.
\item If \texttt{a} and \texttt{b} are (different) factors, then
\texttt{a:b} is the interaction between \texttt{a} and \texttt{b},
i.e., it allows for effects that are due to the specific combination
of level of \texttt{a} with level of \texttt{b}.
\item If \texttt{a} is a factor and \texttt{b} is a variable,
the interaction \texttt{a:b} allows for different coefficients
of the variable for different levels of the factor.
\item If \texttt{a} and \texttt{b} are (different) variables, then
\texttt{a:b} is the result of multiplying \texttt{a} by \texttt{b},
element by element.
\end{itemize}
\noindent
\begin{fmpage}{36pc}
\exhead{1}
Try adding interaction terms to the model fitted above:
<>=
## Accuracy estimated by leave-one-out CV
PimaCV2.lda <- lda(type ~ .^2, data=Pima.tr, CV=TRUE)
confusion(Pima.tr$type, PimaCV2.lda$class)
## Now estimate "Accuracy" on training data
Pima2.hat <- predict(lda(type ~ .^2, data=Pima.tr))$class
confusion(Pima.tr$type, Pima2.hat)
@ %
Observe that the training set measure (\textit{resubstitution} accuracy
or \textit{apparent} accuracy) now substantially
exaggerates the accuracy. The model that includes all
interactions terms is in truth giving lower predictive accuracy;
it overfits.
\end{fmpage}
\subsection{Proportion correctly classified}
Consider the fit
<>=
PimaCV.lda <- lda(type ~ ., data=Pima.tr, CV=TRUE)
confusion(Pima.tr$type, PimaCV.lda$class)
@ %
The overall accuracy is estimated as 0.755. If however we keep the same
rule, but change the prior proportions of the two classes, the overall
accuracy will change. If for example, the two classes are in the
ratio 0.9:0.1, the overall accuracy will be 0.9 $\times$ 0.8636 +
0.1 $\times$ 0.5441 $\simeq$ 0.83. The No's are easier to classify;
with a higher proportion of No's the classification accuracy increases.
However the classification rule that is optimal also changes if the
prior proportions change. The function \texttt{lda()} allows
specification of a prior, thus:
<>=
prior <- c(0.9, 0.1)
PimaCVp.lda <- lda(type ~ ., data=Pima.tr, CV=TRUE, prior=prior)
confusion(Pima.tr$type, PimaCVp.lda$class, prior=c(0.9, 0.1))
@ %
If the rule is modified to be optimal relative to the new prior
proportions, the accuracy thus increases to 0.91, approximately.
\noindent
\begin{fmpage}{36pc}
\exhead{2} Now assume prior proportions of 0.85 and 0.15. Repeat
the above calculations, i.e.
\begin{itemize}
\item Estimate the accuracy using the rule that is designed to be optimal
when the prior proportions are as in the sample.
\item Estimate the accuracy using the rule that is designed to be optimal
when the prior proportions are 0.85:0.15.
\end{itemize}
\end{fmpage}
\subsection{The ROC (receiver operating characteristic)}
This introduces the terms \textit{sensitivity} and
\textit{specificity}. With prior proportions as in the sample
(0.755:0.245), the sensitivity (true positive rate) was estimated as
0.544; this is the probability of correctly identifying a person who
is a diabetic as a diabetic. The false positive rate (1 -
Specificity) was estimated as 0.136. There is a trade-off between
sensitivity and specificity. The ROC curve, which is a plot of
sensitivity against specificity, displays this trade-off graphically.
The analysis assumes that the cost of both types of mis-classification
are equal. Varying the costs, while keeping the prior probabilities
the same, is equivalent to keeping the costs equal, but varying the
prior probabilities. The following calculation takes advantage of
this equivalence.
\noindent
\begin{fmpage}{36pc}
\exhead{3}
Run the following calculations:
<>=
truepos <- numeric(19)
falsepos <- numeric(19)
p1 <- (1:19)/20
for(i in 1:19){
p <- p1[i]
Pima.CV1p <- lda(type ~ ., data=Pima.tr, CV=TRUE, prior=c(p, 1-p))
confmat <- confusion(Pima.tr$type, Pima.CV1p$class, printit=FALSE)
falsepos[i] <- confmat[1,2]
truepos[i] <- confmat[2,2]
}
@ %
Now plot the curve.
<>=
plot(truepos ~ falsepos, type="l", xlab="False positive rate",
ylab="True positive rate (Sensitivity)")
@ %
Read off the sensitivity at a low false positive rate (e.g., 0.1),
and at a rate around the middle of the range, and comment on the
tradeoff.
\end{fmpage}
The ROC curve allows assessment of the effects of
different trade-offs between the two types of cost.
\subsection{Accuracy on test data}
There is an additional data set -- \texttt{Pima.te} -- that has been
set aside for testing. The following checks the accuracy on these ``test''
data.
<>=
Pima.lda <- lda(type ~ ., data = Pima.tr)
testhat <- predict(Pima.lda, newdata=Pima.te)
confusion(Pima.te$type, testhat$class)
@ %
This improves on the leave-one-out CV accuracy on \texttt{Pima.tr}.
The difference in the prior proportions is too small to have much
effect on the overall accuracy. The apparent improvement may be
a chance effect. Another possibility is that the
division of the data between \texttt{Pima.tr} and \texttt{Pima.te} may
not have been totally random, and \texttt{Pima.te} may have fewer hard
to classify points. There are two checks that may provide insight:
\begin{itemize}
\item Swap the roles of training and test data, and note whether
the relative accuracies are similar.
\item Repeat the calculations on a bootstrap sample of the training data,
to get an indication of the uncertainty in the accuracy assessment.
\end{itemize}
\enlargethispage{36pt}
\noindent
\begin{fmpage}{36pc}
\exhead{4}
Try the effect of swapping the role of training and test data.
<>=
swapCV.lda <- lda(type ~ ., data = Pima.te, CV=TRUE)
confusion(Pima.te$type, swapCV.lda$class)
swap.lda <- lda(type ~ ., data = Pima.te)
otherhat <- predict(Pima.lda, newdata=Pima.tr)
confusion(Pima.tr$type, otherhat$class)
@ %
\end{fmpage}
Note that, again, the accuracy is greater for \texttt{Pima.te} than
for \texttt{Pima.tr}, but the difference is smaller.
\begin{fmpage}{36pc}
\exhead{5}
Now check the accuracy on a bootstrap sample:
<>=
prior <- table(Pima.tr$type)
prior <- prior/sum(prior)
index <- sample(1:dim(Pima.tr)[1], replace=TRUE)
boot.lda <- lda(type ~ ., data = Pima.tr[index, ], CV=TRUE)
cmat <- confusion(Pima.tr[index, "type"], boot.lda$class, printit=FALSE)
print(c(acc=round(prior[1]*cmat[1,1]+prior[2]*cmat[2,2],4)))
@ %N
The calculations should be repeated several times. The changes in the
predictive accuracy estimates are substantial.
Note the need to relate all accuracies back to the same prior
probabilities, to ensure comparability.
Annotate the code to explain what it does.
\end{fmpage}
(From running this code five times, I obtained results of 0.77, 0.74,
0.72, 0.71 and 0.82.)
\section{Logistic regression -- an alternative to \texttt{lda}}
As the Pima data have only two classes (levels of \texttt{type})
the calculation can be handled as a regression problem, albeit
with the reponse on a logit scale, i.e., the linear model
predicts $\log(\frac{\pi}{1-\pi})$, where $\pi$ is the probability
of having diabetes.
\begin{fmpage}{36pc}
\exhead{6}
Fit a logistic regression model and check the accuracy.
<>=
Pima.glm <- glm(I(unclass(type)-1) ~ ., data=Pima.tr, family=binomial)
testhat <- round(predict(Pima.glm, newdata=Pima.te, type="response"))
confusion(Pima.te$type, testhat)
@ %
Compare the accuracy with that obtained from \texttt{lda()}.
A cross-validation estimate of accuracy, based on the training
data, can be obtained thus:
<>=
library(DAAG)
CVbinary(Pima.glm)
@%
This should be repeated several times. How consistent are the results?
\end{fmpage}
One advantage of \texttt{glm()} is that asymptotic
standard error estimates are available for parameter estimates:
<>=
round(summary(Pima.glm)$coef, 3)
@ %
These results suggest that \texttt{npreg}, \texttt{bp} and \texttt{skin}
can be omitted without much change to predictive accuracy. Predictive
accuracy may actually increase. There is however, no guarantee of this,
and it is necessary to check. Even though there is individually no
detectable effect, the combined effect of two or more of them may
be of consequence.
Using this logistic regression approach, there is no built-in
provision to adjust for prior probabilities. Users can however
make their own adjustments.
One advantage of \texttt{glm()} is that the \texttt{termplot()}
function is available to provide a coarse check on possible nonlinear
effects of covariates. Use \texttt{termplot()} with a \texttt{glm}
object as the first argument, and with the argument
\texttt{smooth=panel.smooth}. The resulting graphs can be examined
for hints of nonlinear covariate effects. Detection of nonlinear
effects may require very extensive data.
\section{Data that are More Challenging -- the \texttt{crx} Dataset}
The data can be copied from the web:
<>=
webpage <- "http://mlearn.ics.uci.edu/databases/credit-screening/crx.data"
webn <- "http://mlearn.ics.uci.edu/databases/credit-screening/crx.names"
test <- try(readLines(webpage)[1])
if (!inherits(test, "try-error")){
download.file(webpage, destfile="crx.data")
crx <- read.csv("crx.data", header=FALSE, na.strings="?")
download.file(webn, destfile="crx.names")
}
@ %
Column 16 is the outcome variable. Factors can be identified as follows:
<>=
if(exists("crx"))
sapply(crx, function(x)if(is.factor(x))levels(x))
@ %
These data have a number of factor columns. It will be important
to understand how they are handled.
\subsection{Factor terms -- contribution of the model matrix}
As with normal theory linear models, the matrix has an initial column
of ones that allows for a constant term. (In all rows, the relevant
parameter is muitiplied by 1.0, so that the contribution to the fitted
value is the same in all rows.) For terms that correspond directly
to variables, the model matrix incoporates the variable directly
as one of its columns. With the default handling of a factor term
\begin{itemize}
\item Implicitly there is a column that corresponds to the initial
level of the factor, but as it has all elements 0 it can be
omitted;
\item For the third and any subsequent levels, the model matrix
has a column that is zeros except for rows where the factor is
at that level.
\end{itemize}
A factor that has only two levels will generate a single column,
with 0s correponding to the first level, and 1s for the second level.
The Pima data has, except for the response variable \texttt{type},
no binary variables.
\subsection{Fitting the model}
\noindent
\begin{fmpage}{36pc}
\exhead{7}
Now fit a linear discriminant model:
<>=
if(exists("crx")){
crxRed <- na.omit(crx)
crxCV.lda <- lda(V16 ~ ., data=crxRed, CV=TRUE)
confusion(crxRed$V16, crxCV.lda$class)
}
@ %
Note the message
\begin{verbatim}
Warning message:
In lda.default(x, grouping, ...) : variables are collinear
\end{verbatim}
Now, for comparison, fit the model using \texttt{glm()}. This is one
way to get details on the reasons for collinearity. Also, for using
\texttt{glm()}, the argument \texttt{na.action=na.exclude} is
available, which omits missing values when the model is fit, but then
places \texttt{NA}s in those positions when fitted values, predicted
values, etc., are calculated. This ensures that predicted values
match up with the rows of the orginal data.
<>=
if(exists("crx")){
crx.glm <- glm(V16 ~ ., data=crx, family=binomial, na.action=na.exclude)
alias(crx.glm)
confusion(crx$V16, round(fitted(crx.glm)))
summary(crx.glm)$coef
}
@ %
From the output from \texttt{alias(crx.glm)}, what can one say about
the reasons for multi-collinearity?
\end{fmpage}
Now display the scores from the linear discriminant calculations:
<>=
if(exists("crx")){
crxRed <- na.omit(crx)
crx.lda <- lda(V16 ~ ., data=crxRed)
crx.hat <- predict(crx.lda)
densityplot(~crx.hat$x, groups=crxRed$V16)
}
@ %
This plot is actually quite interesting. What does it tell you?
\section{Use of Random Forest Results for Comparison}
A good strategy is to use results from the random forests method for
comparison. The accuracy of this algorithm, when it does give a
worthwhile improvement over \texttt{lda()}, is often hard to beat.
This method has the advantage that it can be applied pretty much
automatically. It is good at handling situations where explanatory
variables and factors interact in a relatively complex manner.
Here are results for \texttt{Pima.tr} as training data, at the
same time applying predictions to \texttt{Pima.te} as test data.
Notice that there are two confusion matrices, one giving the OOB
estaimtes for \texttt{Pima.tr}, and the other for \texttt{Pima.te}.
<>=
library(randomForest)
Pima.rf <- randomForest(type ~ ., xtest=Pima.te[,-8], ytest=Pima.te[,8],
data=Pima.tr)
Pima.rf
@ %
Look at the OOB estimate of accuracy, which is pretty much equivalent
to a cross-validation estimate of accuracy. This error will be
similar to the error on test data that are randomly chosen from the
same population.
The accuracy is poorer than for \texttt{lda()}. As before, the error
rate is lower on \texttt{Pima.te} than on \texttt{Pima.te}. Note
however the need to re-run the calculation several times, as the
accuracy will vary from run to run.
Here are results for \texttt{crx}.
<>=
if(exists("crxRed")){
crx.rf <- randomForest(V16 ~ ., data=crxRed)
crx.rf
}
@ %
Accuracy is similar to that from use of \texttt{lda()}.
\section{Note -- The Handling of NAs}
The assumption that underlies any analysis that omits missing values
is that, for purposes of the analysis, missingness is
uninformative. This may be incorrect, and it is necessary to ask: Are
the subjects where there are missing values different in some way?
The missing value issue is pertinent both to the Pima data and to the
\texttt{crx} data. There is a further dataset, \texttt{Pima.tr2},
that augments \texttt{Pima.tr} with 100 subjects that have missing
values in one or more of the explanatory variables. The question
then arises: Is the pattern of missingness the same for those
without diabetes as for those with diabetes?
The following shows the numbers of missing values for each of the variables
<>=
if(exists("Pima.tr2", where=".GlobalEnv", inherits=FALSE))
rm(Pima.tr2)
sapply(Pima.tr2,function(x)sum(is.na(x)))
sum(!complete.cases(Pima.tr2))
@ %
Note that the variable \texttt{skin} accounts for 98 of the 100
subjects where there is one or more missing value.
A first step is to check whether the subjects with one or more missing
values differ in some systematic manner from subjects with no missing
values. The major issue is for values that are missing for
\texttt{skin}. We start by creating
a new variable -- here named \texttt{complete} -- that distinguishes
subjects with missing values for \texttt{skin} from others.
We omit observations that are missing on any
of the other variables.
<>=
newPima <- subset(Pima.tr2, complete.cases(bp) & complete.cases(bmi))
newPima$NOskin <- factor(is.na(newPima$skin), labels=c("skin","NOskin"))
## NB: FALSE (skin is not NA) precedes TRUE in alphanumeric order
newPima$skin <- NULL # Omit the column for skin
@ %
The argument \texttt{labels=c("skin","NOskin")} takes the values
(here \texttt{FALSE} and \texttt{TRUE}) in alphanumeric order, then
making \texttt{skin} and \texttt{NOskin} the levels. Omission of this
argument would result in levels \texttt{FALSE} and
\texttt{TRUE}.\footnote{NB also
\texttt{factor(is.na(newPima\$skin), levels=c(TRUE, FALSE))};
levels would then be \texttt{TRUE} and \texttt{FALSE}, in that
order.}
We now do a linear discriminant analysis in which variables other than
\texttt{skin} are explanatory variables.
<>=
completeCV.lda <- lda(NOskin ~ npreg+glu+bp+bmi+ped+age+type,
data=newPima, CV=TRUE)
confusion(newPima$NOskin, completeCV.lda$class)
@ %
A linear discriminant analysis seems unable to distinguish the two
groups. The overall accuracy does not reach what could be achieved by
predicting all rows as complete.
\subsection{Does the missingness give information on the diagnosis?}
If there is a suspicion that it does, then a valid analysis may be
possible as follows. Missing values of continuous variables are
replaced by 0 (or some other arbitrary number). For each such
variable, and each observation for which it is missing, there must be
a factor, e.g.\ with levels \texttt{miss} and \texttt{nomiss}, that
identifies the subject for whom the value is missing. Where values of
several variables are missing for the one subject, the same factor may
be used. This allows an analysis in which all variables, together
with the newly created factors, are ``present'' for all subjects.