\part{Discriminant Methods \& Associated Ordinations}
Packages: \texttt{e1071}, \texttt{ggplot2}, \texttt{mlbench} (this has the
\texttt{Vehicle} dataset), \texttt{mclust},
texttt{randomForest}
<<2plus-gp, echo=F>>=
options(SweaveHooks=list(fig=function(){par(cex.main=1.1,
mar=c(4.1,4.1,2.6,2.1),
mgp=c(2.25,0.5,0), tck=-0.02)}))
graphics.off()
x11(width=8, height=8)
@
These exercises will introduce classification into three or more
groups. They examine and compare three methods -- linear discriminant
analysis, Support Vector machines, and random forests.
As noted in an earlier laboratory, linear discriminant analysis
generates scores that can be plotted directly. Support Vector Machines
generate decision scores, one set for each of the $g$ groups; these
can be approximated in $g - 1$ dimensional and plotted. For random
forests the pairwise proximity of points, calculated as the proportion
of trees for whiah the two observations end up in the same terminal
node, is a natural measure of the relative nearness. These can
be subtracted from 1.0 to yield relative distances, with
non-metric scaling then be used to obtain a representation in
a low-dimensional space. In favorable cases, metric
scaling may yield a workable represention, without going to further
step to non-metric scaling.
Again, it will be handy to have the function \texttt{confusion()}
available.
<>=
confusion <- function(actual, predicted, names=NULL,
printit=TRUE, prior=NULL){
if(is.null(names))names <- levels(actual)
tab <- table(actual, predicted)
acctab <- t(apply(tab, 1, function(x)x/sum(x)))
dimnames(acctab) <- list(Actual=names,
"Predicted (cv)"=names)
if(is.null(prior)){
relnum <- table(actual)
prior <- relnum/sum(relnum)
acc <- sum(tab[row(tab)==col(tab)])/sum(tab)
} else
{
acc <- sum(prior*diag(acctab))
names(prior) <- names
}
if(printit)print(round(c("Overall accuracy"=acc,
"Prior frequency"=prior),4))
if(printit){
cat("\nConfusion matrix", "\n")
print(round(acctab,4))
}
invisible(acctab)
}
@ %
\section{Discrimination with Multiple Groups}
With three groups, there are three group centroids. These centroids
determine a plane, onto which data points can be projected. Linear
discrinimant analysis assumes, effectively, that discriminants can be
represented without loss of information in this plane. It yields two
sets of linear discriminant scores that can be plotted, giving an
accurate graphical summary of the analysis.
More generally, with $g \ge 4$ groups, there are $max(g-1, p)$
dimensions, and $max(g-1, p)$ sets of linear discriminant scores.
With three (or more) sets of scores, the function \texttt{plot3d()} in
the \textit{rgl} package can be used to give a 3D scatterplot that
rotates dynamically under user control.
The output from printing an \texttt{lda} object that is called with
\texttt{CV=FALSE} (the default) includes ``proportion of trace''
information. The successive linear discriminants explain successively
smaller proportions of the trace, i.e., of the ratio of the between to
within group variance. It may turn out that the final discriminant or
the final few discriminants explain a very small proportion of the
trace, so that they can be dropped.
With methods other than \texttt{lda()}, representation in a
low-dimensional space is still in principle possible. However any
such representation arises less directly from the analysis, and there
is nothing directly comparable to the ``proportion of trace''
information.
\subsection{The \texttt{diabetes} dataset}
The package \textit{mclust} has the data set \texttt{diabetes}.
Datasets in this package do not automatically become available when
the package is attached.\footnote{This package does not implement the
lazy data mechanism.} Thus, it is necessary to bring the data set
\texttt{diabetes}it into the workspace by typing
<>=
library(mclust)
data(diabetes)
@%
Here is a 3-dimensional cloud plot:
<>=
library(lattice)
cloud(insulin ~ glucose+sspg, groups=class, data=diabetes)
@%
\subsection{Linear discriminant analysis}
First try linear discriminant analysis. We specify \texttt{CV=TRUE},
in order to obtain predictions of class membership that are derived from
leave-one-out cross-validation. We then run the calculations a second
time, now with \texttt{CV=FALSE}, in order to obtain an object from which
we can obtain the discriminant scores
<>=
library(MASS)
library(lattice)
diabetesCV.lda <- lda(class ~ insulin+glucose+sspg, data=diabetes, CV=TRUE)
confusion(diabetes$class, diabetesCV.lda$class)
diabetes.lda <- lda(class ~ insulin+glucose+sspg, data=diabetes)
diabetes.lda # Linear discriminant 1 explains most of the variance
hat.lda <- predict(diabetes.lda)$x
xyplot(hat.lda[,2] ~ hat.lda[,1], groups=diabetes$class,
auto.key=list(columns=3), par.settings=simpleTheme(pch=16))
@ %
\subsection{Quadratic discriminant analysis}
The plot of linear discriminant scores makes it clear that the
variance-covariance structure is very different between the three
classes. It is therefore worthwhile to try \texttt{qda()}.
<>=
diabetes.qda <- qda(class ~ insulin+glucose+sspg, data=diabetes, CV=TRUE)
confusion(diabetes$class, diabetes.qda$class)
@ %
The prediction accuracy improves substantially.
\noindent
\begin{fmpage}{36pc}
\exhead{1}
Suppose now that the model is used to make predictions for a similar
population, but with a different mix of the different classes of diabetes.
\begin{enumerate}
\item What would be the expected error rate if the three classes occur
with equal frequency?
\item What would be the expected error rate in a population with a mix
of 50\% chemical, 25\% normal and 25\% overt?
\end{enumerate}
\end{fmpage}
\subsection{Support vector machines}
This requires the \textit{e1071} package. Here, we can coonveniently
specify tenfold cross-validation, in order to estimate predictve accuracy.
A plot of points can be based on a low-dimensional representation of
the ``decision values''.
<>=
library(e1071)
diabetes.svm <- svm(class ~ insulin+glucose+sspg, data=diabetes, cross=10)
summary(diabetes.svm)
pred <- predict(diabetes.svm, diabetes[,-1], decision.values = TRUE)
decision <- attributes(pred)$decision.values
decision.dist <- dist(decision)
decision.cmd <- cmdscale(decision.dist)
library(lattice)
xyplot(decision.cmd[,2] ~ decision.cmd[,1], groups=diabetes$class)
@ %
The cross-validated prediction accuracy is lower than using
\texttt{lda()}. There is however substantial scope for using another
kernel, and/or setting various tuning parameters. If there is such
tuning, care is needed to ensure that the tuning does not bias the
accuracy measure. Two possibilities are: (1) divide
the data into training and test sets, and use the test set accuracy
rather than the cross-validation accuracy; or (2) repeat the tuning at
each cross-validation fold. Automation of option (2) will require the
writing of special code. Depending on the nature of the tuning, it may
not be straightforward.
\subsubsection{A rubber sheet representation of the decision values}
The plot given above projected the decision values on to a Euclidean
space. A non-metric representation may be preferable. The following
uses the \textit{MASS} program \texttt{isoMDS()} to obtain a
2-dimensional representation. The function \texttt{isoMDS()} requires
a starting configuration; we use the values from \texttt{cmdscale()}
for that purpose:
<>=
diabetes.mds <- isoMDS(decision.dist, decision.cmd)
@ %
It turns out that observations 4 and 80 have zero distance,
which \texttt{isoMDS()} is unable to handle. We therefore
add a small positive quantity to the distance between these two
observations.
<>=
eps <- min(decision.dist[decision.dist > 0])
## We add half the smallest non-zero distance
decision.dist[decision.dist == 0] <- eps
diabetes.mds <- isoMDS(decision.dist, decision.cmd)
xyplot(diabetes.mds$points[,2] ~ diabetes.mds$points[,1],
groups=diabetes$class)
@%
\subsection{Use of \texttt{randomForest()}}
First, fit a randomForest discriminant model, calculating at the same
time the proximities. The proximity of any pair of points is the
proportion of trees in which the two points appear in the same terminal
node:
<>=
## Use randomForest(), obtain proximities
library(randomForest)
diabetes.rf <- randomForest(class~., data=diabetes, proximity=TRUE)
print(diabetes.rf)
@ %
Note the overall error rate.
\noindent
\begin{fmpage}{36pc}
\exhead{2}
Suppose that the model is used to make predictions for a similar
population, but with a different mix of the different classes of diabetes.
\begin{enumerate}
\item What would be the expected error rate if the three classes occur
with equal frequency?
\item What would be the expected error rate in a population with a mix
of 50\% chemical, 25\% normal and 25\% overt?
\end{enumerate}
\end{fmpage}
Points that in a large proportion of trees appear at the same terminal
node are in some sense ``close together'', whereas points that rarely
appear in the same terminal node are ``far apart''. This is the
motivation for subtracting the proximities from 1.0, and treating the
values obtained as distances in Euclidean space. Inititially, a
two-dimensional representation will be tried. If this representation
reproduces the distances effectively, the result will be a plot in
which the visual separation of the points reflects the accuracy with
which the algorithm has been able to separate the points:
<>=
## Euclidean metric scaling
diabetes.rf.cmd <- cmdscale(1-diabetes.rf$proximity)
plot(diabetes.rf.cmd, col=unclass(diabetes$class)+1)
@%
The clear separation between the three groups, on this graph, is remarkable,
though perhaps to be expected given that the OOB error rate is $\sim$2\%.
\subsubsection{A rubber sheet representation of the proximity-based
"distances"}
There is no necessary reason why distances that have been calculated as
above should be Euclidean distances. It is more reasonable to
treat them as relative distances. The \texttt{isoMDS()}
function in the \textit{MASS} package uses the ordinates that are given
by \texttt{cmdscale()} as a starting point for Kruskal's "non-metric"
multi-dimensional scaling. In effect distances are represented on a rubber
sheet that can be stretched or shrunk in local parts of the sheet, providing
only that relative distances are unchanged.
Almost certainly, some distances will turn out to be zero.
In order to proceed, zero distances will be replaced by 0.5 divided
by the number of points. (The minimum non-zero distance must be at
least \texttt{1/dim(diabetes)[1]}. Why?)
<>=
sum(1-diabetes.rf$proximity==0)
distmat <- as.dist(1-diabetes.rf$proximity)
distmat[distmat==0] <- 0.5/dim(diabetes)[1]
@ %
We now proceed with the plot.
<>=
diabetes.rf.mds <- isoMDS(distmat, y=diabetes.rf.cmd)
xyplot(diabetes.rf.mds$points[,2] ~ diabetes.rf.mds$points[,1],
groups=diabetes$class, auto.key=list(columns=3))
@%
Note that these plots exaggerate the separation between the groups.
They represent visually the effectiveness of the algorithm in classifying
the training data. To get a fair assessment, the plot should show
points for a separate set of test data.
\paragraph{Exercise 10:}
Make a table that compares \texttt{lda()}, \texttt{svm()} and
\texttt{randomForest()}, with respect to error rate for the three classes
separately.
\subsection{Vehicle dataset}
Repeat the above, now with the \texttt{Vehicle} data frame from the
\textit{mlbench} package. For the plots, ensure that the \textit{ggplot2}
package (and dependencies) is installed.
Plots will use \texttt{quickplot()}, from the \textit{ggplot2} package.
This makes it easy to get density plots. With respect to arguments
to \texttt{quickplot()}, note that:
\begin{itemize}
\item \texttt{quickplot()} has the \texttt{geom} (not \texttt{geometry})
argument, where base and lattice graphics would use \texttt{type};
\item If different \texttt{colour}s are specified, data will
be grouped according to those colours.
\end{itemize}
<>=
library(ggplot2)
library(mlbench)
data(Vehicle)
VehicleCV.lda <- lda(Class ~ ., data=Vehicle, CV=TRUE)
confusion(Vehicle$Class, VehicleCV.lda$class)
Vehicle.lda <- lda(Class ~ ., data=Vehicle)
Vehicle.lda
hat.lda <- predict(Vehicle.lda)$x
quickplot(hat.lda[,1], hat.lda[,2], colour=Vehicle$Class,
geom=c("point","density2d"))
@ %
\subsubsection{Vehicle dataset -- Quadratic discriminant analysis}
<>=
Vehicle.qda <- qda(Class ~ ., data=Vehicle, CV=TRUE)
confusion(Vehicle$Class, Vehicle.qda$class)
@
\noindent \textbf{Support Vector Machines}
<>=
Vehicle.svm <- svm(Class ~ ., data=Vehicle, cross=10)
summary(Vehicle.svm)
pred <- predict(Vehicle.svm, Vehicle, decision.values = TRUE)
decision <- attributes(pred)$decision.values
decision.dist <- dist(decision)
eps <- min(decision.dist[decision.dist>0])/2
decision.cmd <- cmdscale(decision.dist)
quickplot(decision.cmd[,1], decision.cmd[,2], colour=Vehicle$Class,
geom=c("point","density2d"))
## Now try Kruskal's non-metric MDS
decision.dist[decision.dist == 0] <- eps
decision.mds <- isoMDS(decision.dist, decision.cmd)$points
quickplot(decision.mds[,1], decision.mds[,2], colour=Vehicle$Class,
geom=c("point","density2d"))
@ %
Compare the metric and non-metric plots. Do they tell the same story?
\subsection{Vehicle dataset -- random forests}
<>=
Vehicle.rf <- randomForest(Class~., data=Vehicle, proximity=TRUE)
print(Vehicle.rf)
distmat <- 1-Vehicle.rf$proximity
Vehicle.rf.cmd <- cmdscale(distmat)
quickplot(Vehicle.rf.cmd[,1], Vehicle.rf.cmd[,2], colour=Vehicle$Class,
geom=c("point","density2d"))
eps <- min(distmat[distmat>0])/2
distmat[distmat == 0] <- eps
Vehicle.rf.mds <- isoMDS(distmat, Vehicle.rf.cmd)$points
quickplot(Vehicle.rf.mds[,1], Vehicle.rf.mds[,2], colour=Vehicle$Class,
geom=c("point","density2d"))
@ %
Now calculate the distances for the points generated by \texttt{isoMDS()},
and plot these distances against distances (in \texttt{distmat}) generated
by subtracting the proximities from 1.
<>=
mdsdist <- dist(Vehicle.rf.mds)
plot(as.dist(distmat), mdsdist)
@ %