\part{Discriminant Methods \& Associated Ordinations} Packages: \texttt{e1071}, \texttt{ggplot2}, \texttt{mlbench} (this has the \texttt{Vehicle} dataset), \texttt{mclust}, texttt{randomForest} <<2plus-gp, echo=F>>= options(SweaveHooks=list(fig=function(){par(cex.main=1.1, mar=c(4.1,4.1,2.6,2.1), mgp=c(2.25,0.5,0), tck=-0.02)})) graphics.off() x11(width=8, height=8) @ These exercises will introduce classification into three or more groups. They examine and compare three methods -- linear discriminant analysis, Support Vector machines, and random forests. As noted in an earlier laboratory, linear discriminant analysis generates scores that can be plotted directly. Support Vector Machines generate decision scores, one set for each of the $g$ groups; these can be approximated in $g - 1$ dimensional and plotted. For random forests the pairwise proximity of points, calculated as the proportion of trees for whiah the two observations end up in the same terminal node, is a natural measure of the relative nearness. These can be subtracted from 1.0 to yield relative distances, with non-metric scaling then be used to obtain a representation in a low-dimensional space. In favorable cases, metric scaling may yield a workable represention, without going to further step to non-metric scaling. Again, it will be handy to have the function \texttt{confusion()} available. <>= confusion <- function(actual, predicted, names=NULL, printit=TRUE, prior=NULL){ if(is.null(names))names <- levels(actual) tab <- table(actual, predicted) acctab <- t(apply(tab, 1, function(x)x/sum(x))) dimnames(acctab) <- list(Actual=names, "Predicted (cv)"=names) if(is.null(prior)){ relnum <- table(actual) prior <- relnum/sum(relnum) acc <- sum(tab[row(tab)==col(tab)])/sum(tab) } else { acc <- sum(prior*diag(acctab)) names(prior) <- names } if(printit)print(round(c("Overall accuracy"=acc, "Prior frequency"=prior),4)) if(printit){ cat("\nConfusion matrix", "\n") print(round(acctab,4)) } invisible(acctab) } @ % \section{Discrimination with Multiple Groups} With three groups, there are three group centroids. These centroids determine a plane, onto which data points can be projected. Linear discrinimant analysis assumes, effectively, that discriminants can be represented without loss of information in this plane. It yields two sets of linear discriminant scores that can be plotted, giving an accurate graphical summary of the analysis. More generally, with $g \ge 4$ groups, there are $max(g-1, p)$ dimensions, and $max(g-1, p)$ sets of linear discriminant scores. With three (or more) sets of scores, the function \texttt{plot3d()} in the \textit{rgl} package can be used to give a 3D scatterplot that rotates dynamically under user control. The output from printing an \texttt{lda} object that is called with \texttt{CV=FALSE} (the default) includes proportion of trace'' information. The successive linear discriminants explain successively smaller proportions of the trace, i.e., of the ratio of the between to within group variance. It may turn out that the final discriminant or the final few discriminants explain a very small proportion of the trace, so that they can be dropped. With methods other than \texttt{lda()}, representation in a low-dimensional space is still in principle possible. However any such representation arises less directly from the analysis, and there is nothing directly comparable to the proportion of trace'' information. \subsection{The \texttt{diabetes} dataset} The package \textit{mclust} has the data set \texttt{diabetes}. Datasets in this package do not automatically become available when the package is attached.\footnote{This package does not implement the lazy data mechanism.} Thus, it is necessary to bring the data set \texttt{diabetes}it into the workspace by typing <>= library(mclust) data(diabetes) @% Here is a 3-dimensional cloud plot: <>= library(lattice) cloud(insulin ~ glucose+sspg, groups=class, data=diabetes) @% \subsection{Linear discriminant analysis} First try linear discriminant analysis. We specify \texttt{CV=TRUE}, in order to obtain predictions of class membership that are derived from leave-one-out cross-validation. We then run the calculations a second time, now with \texttt{CV=FALSE}, in order to obtain an object from which we can obtain the discriminant scores <>= library(MASS) library(lattice) diabetesCV.lda <- lda(class ~ insulin+glucose+sspg, data=diabetes, CV=TRUE) confusion(diabetes$class, diabetesCV.lda$class) diabetes.lda <- lda(class ~ insulin+glucose+sspg, data=diabetes) diabetes.lda # Linear discriminant 1 explains most of the variance hat.lda <- predict(diabetes.lda)$x xyplot(hat.lda[,2] ~ hat.lda[,1], groups=diabetes$class, auto.key=list(columns=3), par.settings=simpleTheme(pch=16)) @ % \subsection{Quadratic discriminant analysis} The plot of linear discriminant scores makes it clear that the variance-covariance structure is very different between the three classes. It is therefore worthwhile to try \texttt{qda()}. <>= diabetes.qda <- qda(class ~ insulin+glucose+sspg, data=diabetes, CV=TRUE) confusion(diabetes$class, diabetes.qda$class) @ % The prediction accuracy improves substantially. \noindent \begin{fmpage}{36pc} \exhead{1} Suppose now that the model is used to make predictions for a similar population, but with a different mix of the different classes of diabetes. \begin{enumerate} \item What would be the expected error rate if the three classes occur with equal frequency? \item What would be the expected error rate in a population with a mix of 50\% chemical, 25\% normal and 25\% overt? \end{enumerate} \end{fmpage} \subsection{Support vector machines} This requires the \textit{e1071} package. Here, we can coonveniently specify tenfold cross-validation, in order to estimate predictve accuracy. A plot of points can be based on a low-dimensional representation of the decision values''. <>= library(e1071) diabetes.svm <- svm(class ~ insulin+glucose+sspg, data=diabetes, cross=10) summary(diabetes.svm) pred <- predict(diabetes.svm, diabetes[,-1], decision.values = TRUE) decision <- attributes(pred)$decision.values decision.dist <- dist(decision) decision.cmd <- cmdscale(decision.dist) library(lattice) xyplot(decision.cmd[,2] ~ decision.cmd[,1], groups=diabetes$class) @ % The cross-validated prediction accuracy is lower than using \texttt{lda()}. There is however substantial scope for using another kernel, and/or setting various tuning parameters. If there is such tuning, care is needed to ensure that the tuning does not bias the accuracy measure. Two possibilities are: (1) divide the data into training and test sets, and use the test set accuracy rather than the cross-validation accuracy; or (2) repeat the tuning at each cross-validation fold. Automation of option (2) will require the writing of special code. Depending on the nature of the tuning, it may not be straightforward. \subsubsection{A rubber sheet representation of the decision values} The plot given above projected the decision values on to a Euclidean space. A non-metric representation may be preferable. The following uses the \textit{MASS} program \texttt{isoMDS()} to obtain a 2-dimensional representation. The function \texttt{isoMDS()} requires a starting configuration; we use the values from \texttt{cmdscale()} for that purpose: <>= diabetes.mds <- isoMDS(decision.dist, decision.cmd) @ % It turns out that observations 4 and 80 have zero distance, which \texttt{isoMDS()} is unable to handle. We therefore add a small positive quantity to the distance between these two observations. <>= eps <- min(decision.dist[decision.dist > 0]) ## We add half the smallest non-zero distance decision.dist[decision.dist == 0] <- eps diabetes.mds <- isoMDS(decision.dist, decision.cmd) xyplot(diabetes.mds$points[,2] ~ diabetes.mds$points[,1], groups=diabetes$class) @% \subsection{Use of \texttt{randomForest()}} First, fit a randomForest discriminant model, calculating at the same time the proximities. The proximity of any pair of points is the proportion of trees in which the two points appear in the same terminal node: <>= ## Use randomForest(), obtain proximities library(randomForest) diabetes.rf <- randomForest(class~., data=diabetes, proximity=TRUE) print(diabetes.rf) @ % Note the overall error rate. \noindent \begin{fmpage}{36pc} \exhead{2} Suppose that the model is used to make predictions for a similar population, but with a different mix of the different classes of diabetes. \begin{enumerate} \item What would be the expected error rate if the three classes occur with equal frequency? \item What would be the expected error rate in a population with a mix of 50\% chemical, 25\% normal and 25\% overt? \end{enumerate} \end{fmpage} Points that in a large proportion of trees appear at the same terminal node are in some sense close together'', whereas points that rarely appear in the same terminal node are far apart''. This is the motivation for subtracting the proximities from 1.0, and treating the values obtained as distances in Euclidean space. Inititially, a two-dimensional representation will be tried. If this representation reproduces the distances effectively, the result will be a plot in which the visual separation of the points reflects the accuracy with which the algorithm has been able to separate the points: <>= ## Euclidean metric scaling diabetes.rf.cmd <- cmdscale(1-diabetes.rf$proximity) plot(diabetes.rf.cmd, col=unclass(diabetes$class)+1) @% The clear separation between the three groups, on this graph, is remarkable, though perhaps to be expected given that the OOB error rate is$\sim$2\%. \subsubsection{A rubber sheet representation of the proximity-based "distances"} There is no necessary reason why distances that have been calculated as above should be Euclidean distances. It is more reasonable to treat them as relative distances. The \texttt{isoMDS()} function in the \textit{MASS} package uses the ordinates that are given by \texttt{cmdscale()} as a starting point for Kruskal's "non-metric" multi-dimensional scaling. In effect distances are represented on a rubber sheet that can be stretched or shrunk in local parts of the sheet, providing only that relative distances are unchanged. Almost certainly, some distances will turn out to be zero. In order to proceed, zero distances will be replaced by 0.5 divided by the number of points. (The minimum non-zero distance must be at least \texttt{1/dim(diabetes)[1]}. Why?) <>= sum(1-diabetes.rf$proximity==0) distmat <- as.dist(1-diabetes.rf$proximity) distmat[distmat==0] <- 0.5/dim(diabetes)[1] @ % We now proceed with the plot. <>= diabetes.rf.mds <- isoMDS(distmat, y=diabetes.rf.cmd) xyplot(diabetes.rf.mds$points[,2] ~ diabetes.rf.mds$points[,1], groups=diabetes$class, auto.key=list(columns=3)) @% Note that these plots exaggerate the separation between the groups. They represent visually the effectiveness of the algorithm in classifying the training data. To get a fair assessment, the plot should show points for a separate set of test data. \paragraph{Exercise 10:} Make a table that compares \texttt{lda()}, \texttt{svm()} and \texttt{randomForest()}, with respect to error rate for the three classes separately. \subsection{Vehicle dataset} Repeat the above, now with the \texttt{Vehicle} data frame from the \textit{mlbench} package. For the plots, ensure that the \textit{ggplot2} package (and dependencies) is installed. Plots will use \texttt{quickplot()}, from the \textit{ggplot2} package. This makes it easy to get density plots. With respect to arguments to \texttt{quickplot()}, note that: \begin{itemize} \item \texttt{quickplot()} has the \texttt{geom} (not \texttt{geometry}) argument, where base and lattice graphics would use \texttt{type}; \item If different \texttt{colour}s are specified, data will be grouped according to those colours. \end{itemize} <>= library(ggplot2) library(mlbench) data(Vehicle) VehicleCV.lda <- lda(Class ~ ., data=Vehicle, CV=TRUE) confusion(Vehicle$Class, VehicleCV.lda$class) Vehicle.lda <- lda(Class ~ ., data=Vehicle) Vehicle.lda hat.lda <- predict(Vehicle.lda)$x quickplot(hat.lda[,1], hat.lda[,2], colour=Vehicle$Class, geom=c("point","density2d")) @ % \subsubsection{Vehicle dataset -- Quadratic discriminant analysis} <>= Vehicle.qda <- qda(Class ~ ., data=Vehicle, CV=TRUE) confusion(Vehicle$Class, Vehicle.qda$class) @ \noindent \textbf{Support Vector Machines} <>= Vehicle.svm <- svm(Class ~ ., data=Vehicle, cross=10) summary(Vehicle.svm) pred <- predict(Vehicle.svm, Vehicle, decision.values = TRUE) decision <- attributes(pred)$decision.values decision.dist <- dist(decision) eps <- min(decision.dist[decision.dist>0])/2 decision.cmd <- cmdscale(decision.dist) quickplot(decision.cmd[,1], decision.cmd[,2], colour=Vehicle$Class, geom=c("point","density2d")) ## Now try Kruskal's non-metric MDS decision.dist[decision.dist == 0] <- eps decision.mds <- isoMDS(decision.dist, decision.cmd)$points quickplot(decision.mds[,1], decision.mds[,2], colour=Vehicle$Class, geom=c("point","density2d")) @ % Compare the metric and non-metric plots. Do they tell the same story? \subsection{Vehicle dataset -- random forests} <>= Vehicle.rf <- randomForest(Class~., data=Vehicle, proximity=TRUE) print(Vehicle.rf) distmat <- 1-Vehicle.rf$proximity Vehicle.rf.cmd <- cmdscale(distmat) quickplot(Vehicle.rf.cmd[,1], Vehicle.rf.cmd[,2], colour=Vehicle$Class, geom=c("point","density2d")) eps <- min(distmat[distmat>0])/2 distmat[distmat == 0] <- eps Vehicle.rf.mds <- isoMDS(distmat, Vehicle.rf.cmd)$points quickplot(Vehicle.rf.mds[,1], Vehicle.rf.mds[,2], colour=Vehicle$Class, geom=c("point","density2d")) @ % Now calculate the distances for the points generated by \texttt{isoMDS()}, and plot these distances against distances (in \texttt{distmat}) generated by subtracting the proximities from 1. <>= mdsdist <- dist(Vehicle.rf.mds) plot(as.dist(distmat), mdsdist) @ %