\part{Linear Discriminant Analysis vs Random Forests} Package: \texttt{randomForest} \vspace*{3pt} <>= options(SweaveHooks=list(fig=function(){par(cex.main=1.1, mar=c(4.1,4.1,2.6,2.1), mgp=c(2.25,0.5,0), tck=-0.02)})) graphics.off() @ For linear discriminant analysis, we will use the function \texttt{lda()} (\textit{MASS} package). Covariates are assumed to have a common multivariate normal distribution. It may have poor predictive power where there are complex forms of dependence on the explanatory factors and variables. Where it is effective, it has the virtue of simplicity. The function \texttt{qda()} weakens the assumptions underlying \texttt{lda()} to allow different variance-covariance matrices for different groups within the data. This sets limits on the minimum group size. Where there are two classes, generalized linear models (\texttt{glm()}) have very similar properties to linear discriminant analysis. It makes weaker assumptions. The trade-off is that estimated group membership probabilities are conditional on the observed matrix. With all these linear'' methods, the model matrix can replace columns of covariate data by a set of spline (or other) basis columns that allow for effects that are nonlinear in the covariates. Use \texttt{termplot()} with a \texttt{glm} object, with the argument \texttt{smooth=panel.smooth}, to check for hints of nonlinear covariate effects. Detection of nonlinear effects typically requires very extensive data. A good first check on whether these linear'' methods are adequate is comparison with the highly nonparametric analysis of the function \texttt{randomForest()} ( \textit{randomForest} package). Random Forests may do well when complex interactions are required to explain the dependence. Here, attention will mostly be limited to data where there are two groups only. \section{Accuracy for Classification Models -- the \texttt{Pima} Data} Type \texttt{help(Pima.tr2)} (\textit{MASS} package) to get a description of these data. They are relevant to the investigation of conditions that may pre-dispose to diabetes. All the explanatory variables can be treated as continuous variables. There are no factor columns, or columns (e.g. of 0/1 data) that might be regarded as factors. \subsection{Fitting an lda model} First try linear discriminant analysis, using the model formula \texttt{type $\sim$ .}. This takes as explanatory variables all columns of \texttt{Pima.tr} except \texttt{type}. A first run of the calculations has \texttt{CV=TRUE}, thus using leave-one-out cross-validation to to predictions class membership and hence get an accuracy estimate. The second run of the calculations has \texttt{CV=FALSE} (the default), allowing the use of \texttt{predict()} to obtain (among other information) discriminant scores. <>= library(MASS) PimaCV.lda <- lda(type ~ ., data=Pima.tr, CV=TRUE) tab <- table(Pima.tr$type, PimaCV.lda$class) ## Calculate confusion matrix from cross-validation conCV1 <- rbind(tab[1,]/sum(tab[1,]), tab[2,]/sum(tab[2,])) dimnames(conCV1) <- list(Actual=c("No", "Yes"), "Predicted (cv)"=c("No","Yes")) print(round(conCV1,3)) ## Now refit the model and get prediction scores Pima.lda <- lda(type ~ ., data=Pima.tr) ## Get a training set accuracy estimate; can be highly optimistic Pima.hat <- predict(Pima.lda) tabtrain <- table(Pima.tr$type, Pima.hat$class) conTrain <- rbind(tab[1,]/sum(tab[1,]), tab[2,]/sum(tab[2,])) dimnames(conTrain) <- list(Actual=c("No", "Yes"), "Predicted (cv)"=c("No","Yes")) print(round(conTrain,3)) @ Notice that, here, the two accuracy measures are the same. In general, the training set accuracy can be optimistic. Now plot the discriminant scores. As there are two groups only, there is just one set of scores. <>= library(lattice) densityplot(~Pima.hat$x, groups=Pima.tr$type) @ % A function that calculates the confusion matrices and overall accuracy would be helpful: <>= confusion <- function(actual, predicted, names=NULL, printit=TRUE, prior=NULL){ if(is.null(names))names <- levels(actual) tab <- table(actual, predicted) acctab <- t(apply(tab, 1, function(x)x/sum(x))) dimnames(acctab) <- list(Actual=names, "Predicted (cv)"=names) if(is.null(prior)){ relnum <- table(actual) prior <- relnum/sum(relnum) acc <- sum(tab[row(tab)==col(tab)])/sum(tab) } else { acc <- sum(prior*diag(acctab)) names(prior) <- names } if(printit)print(round(c("Overall accuracy"=acc, "Prior frequency"=prior),4)) if(printit){ cat("\nConfusion matrix", "\n") print(round(acctab,4)) } invisible(acctab) } @ % \subsection{The model that includes first order interactions} Is the outcome influenced by the combined effect of covariates, e.g., by whether they increase or decrease together. A check is to include the effects of all products of variable values such as \texttt{npreg*glu}, \texttt{npreg*bp}, etc. In this instance, it will turn out that this leads to a model that is over-fitted. The model formula \verb!(a+b+c)^2! expands to \texttt{a+b+c+a:b+a:c+b:c}. Note the following: \begin{itemize} \item \texttt{a:a} is the same as \texttt{a}. \item If \texttt{a} and \texttt{b} are (different) factors, then \texttt{a:b} is the interaction between \texttt{a} and \texttt{b}, i.e., it allows for effects that are due to the specific combination of level of \texttt{a} with level of \texttt{b}. \item If \texttt{a} is a factor and \texttt{b} is a variable, the interaction \texttt{a:b} allows for different coefficients of the variable for different levels of the factor. \item If \texttt{a} and \texttt{b} are (different) variables, then \texttt{a:b} is the result of multiplying \texttt{a} by \texttt{b}, element by element. \end{itemize} \noindent \begin{fmpage}{36pc} \exhead{1} Try adding interaction terms to the model fitted above: <>= ## Accuracy estimated by leave-one-out CV PimaCV2.lda <- lda(type ~ .^2, data=Pima.tr, CV=TRUE) confusion(Pima.tr$type, PimaCV2.lda$class) ## Now estimate "Accuracy" on training data Pima2.hat <- predict(lda(type ~ .^2, data=Pima.tr))$class confusion(Pima.tr$type, Pima2.hat) @ % Observe that the training set measure (\textit{resubstitution} accuracy or \textit{apparent} accuracy) now substantially exaggerates the accuracy. The model that includes all interactions terms is in truth giving lower predictive accuracy; it overfits. \end{fmpage} \subsection{Proportion correctly classified} Consider the fit <>= PimaCV.lda <- lda(type ~ ., data=Pima.tr, CV=TRUE) confusion(Pima.tr$type, PimaCV.lda$class) @ % The overall accuracy is estimated as 0.755. If however we keep the same rule, but change the prior proportions of the two classes, the overall accuracy will change. If for example, the two classes are in the ratio 0.9:0.1, the overall accuracy will be 0.9 $\times$ 0.8636 + 0.1 $\times$ 0.5441 $\simeq$ 0.83. The No's are easier to classify; with a higher proportion of No's the classification accuracy increases. However the classification rule that is optimal also changes if the prior proportions change. The function \texttt{lda()} allows specification of a prior, thus: <>= prior <- c(0.9, 0.1) PimaCVp.lda <- lda(type ~ ., data=Pima.tr, CV=TRUE, prior=prior) confusion(Pima.tr$type, PimaCVp.lda$class, prior=c(0.9, 0.1)) @ % If the rule is modified to be optimal relative to the new prior proportions, the accuracy thus increases to 0.91, approximately. \noindent \begin{fmpage}{36pc} \exhead{2} Now assume prior proportions of 0.85 and 0.15. Repeat the above calculations, i.e. \begin{itemize} \item Estimate the accuracy using the rule that is designed to be optimal when the prior proportions are as in the sample. \item Estimate the accuracy using the rule that is designed to be optimal when the prior proportions are 0.85:0.15. \end{itemize} \end{fmpage} \subsection{The ROC (receiver operating characteristic)} This introduces the terms \textit{sensitivity} and \textit{specificity}. With prior proportions as in the sample (0.755:0.245), the sensitivity (true positive rate) was estimated as 0.544; this is the probability of correctly identifying a person who is a diabetic as a diabetic. The false positive rate (1 - Specificity) was estimated as 0.136. There is a trade-off between sensitivity and specificity. The ROC curve, which is a plot of sensitivity against specificity, displays this trade-off graphically. The analysis assumes that the cost of both types of mis-classification are equal. Varying the costs, while keeping the prior probabilities the same, is equivalent to keeping the costs equal, but varying the prior probabilities. The following calculation takes advantage of this equivalence. \noindent \begin{fmpage}{36pc} \exhead{3} Run the following calculations: <>= truepos <- numeric(19) falsepos <- numeric(19) p1 <- (1:19)/20 for(i in 1:19){ p <- p1[i] Pima.CV1p <- lda(type ~ ., data=Pima.tr, CV=TRUE, prior=c(p, 1-p)) confmat <- confusion(Pima.tr$type, Pima.CV1p$class, printit=FALSE) falsepos[i] <- confmat[1,2] truepos[i] <- confmat[2,2] } @ % Now plot the curve. <>= plot(truepos ~ falsepos, type="l", xlab="False positive rate", ylab="True positive rate (Sensitivity)") @ % Read off the sensitivity at a low false positive rate (e.g., 0.1), and at a rate around the middle of the range, and comment on the tradeoff. \end{fmpage} The ROC curve allows assessment of the effects of different trade-offs between the two types of cost. \subsection{Accuracy on test data} There is an additional data set -- \texttt{Pima.te} -- that has been set aside for testing. The following checks the accuracy on these test'' data. <>= Pima.lda <- lda(type ~ ., data = Pima.tr) testhat <- predict(Pima.lda, newdata=Pima.te) confusion(Pima.te$type, testhat$class) @ % This improves on the leave-one-out CV accuracy on \texttt{Pima.tr}. The difference in the prior proportions is too small to have much effect on the overall accuracy. The apparent improvement may be a chance effect. Another possibility is that the division of the data between \texttt{Pima.tr} and \texttt{Pima.te} may not have been totally random, and \texttt{Pima.te} may have fewer hard to classify points. There are two checks that may provide insight: \begin{itemize} \item Swap the roles of training and test data, and note whether the relative accuracies are similar. \item Repeat the calculations on a bootstrap sample of the training data, to get an indication of the uncertainty in the accuracy assessment. \end{itemize} \enlargethispage{36pt} \noindent \begin{fmpage}{36pc} \exhead{4} Try the effect of swapping the role of training and test data. <>= swapCV.lda <- lda(type ~ ., data = Pima.te, CV=TRUE) confusion(Pima.te$type, swapCV.lda$class) swap.lda <- lda(type ~ ., data = Pima.te) otherhat <- predict(Pima.lda, newdata=Pima.tr) confusion(Pima.tr$type, otherhat$class) @ % \end{fmpage} Note that, again, the accuracy is greater for \texttt{Pima.te} than for \texttt{Pima.tr}, but the difference is smaller. \begin{fmpage}{36pc} \exhead{5} Now check the accuracy on a bootstrap sample: <>= prior <- table(Pima.tr$type) prior <- prior/sum(prior) index <- sample(1:dim(Pima.tr)[1], replace=TRUE) boot.lda <- lda(type ~ ., data = Pima.tr[index, ], CV=TRUE) cmat <- confusion(Pima.tr[index, "type"], boot.lda$class, printit=FALSE) print(c(acc=round(prior[1]*cmat[1,1]+prior[2]*cmat[2,2],4))) @ %N The calculations should be repeated several times. The changes in the predictive accuracy estimates are substantial. Note the need to relate all accuracies back to the same prior probabilities, to ensure comparability. Annotate the code to explain what it does. \end{fmpage} (From running this code five times, I obtained results of 0.77, 0.74, 0.72, 0.71 and 0.82.) \section{Logistic regression -- an alternative to \texttt{lda}} As the Pima data have only two classes (levels of \texttt{type}) the calculation can be handled as a regression problem, albeit with the reponse on a logit scale, i.e., the linear model predicts $\log(\frac{\pi}{1-\pi})$, where $\pi$ is the probability of having diabetes. \begin{fmpage}{36pc} \exhead{6} Fit a logistic regression model and check the accuracy. <>= Pima.glm <- glm(I(unclass(type)-1) ~ ., data=Pima.tr, family=binomial) testhat <- round(predict(Pima.glm, newdata=Pima.te, type="response")) confusion(Pima.te$type, testhat) @ % Compare the accuracy with that obtained from \texttt{lda()}. A cross-validation estimate of accuracy, based on the training data, can be obtained thus: <>= library(DAAG) CVbinary(Pima.glm) @% This should be repeated several times. How consistent are the results? \end{fmpage} One advantage of \texttt{glm()} is that asymptotic standard error estimates are available for parameter estimates: <>= round(summary(Pima.glm)$coef, 3) @ % These results suggest that \texttt{npreg}, \texttt{bp} and \texttt{skin} can be omitted without much change to predictive accuracy. Predictive accuracy may actually increase. There is however, no guarantee of this, and it is necessary to check. Even though there is individually no detectable effect, the combined effect of two or more of them may be of consequence. Using this logistic regression approach, there is no built-in provision to adjust for prior probabilities. Users can however make their own adjustments. One advantage of \texttt{glm()} is that the \texttt{termplot()} function is available to provide a coarse check on possible nonlinear effects of covariates. Use \texttt{termplot()} with a \texttt{glm} object as the first argument, and with the argument \texttt{smooth=panel.smooth}. The resulting graphs can be examined for hints of nonlinear covariate effects. Detection of nonlinear effects may require very extensive data. \section{Data that are More Challenging -- the \texttt{crx} Dataset} The data can be copied from the web: <>= webpage <- "http://mlearn.ics.uci.edu/databases/credit-screening/crx.data" webn <- "http://mlearn.ics.uci.edu/databases/credit-screening/crx.names" test <- try(readLines(webpage)[1]) if (!inherits(test, "try-error")){ download.file(webpage, destfile="crx.data") crx <- read.csv("crx.data", header=FALSE, na.strings="?") download.file(webn, destfile="crx.names") } @ % Column 16 is the outcome variable. Factors can be identified as follows: <>= if(exists("crx")) sapply(crx, function(x)if(is.factor(x))levels(x)) @ % These data have a number of factor columns. It will be important to understand how they are handled. \subsection{Factor terms -- contribution of the model matrix} As with normal theory linear models, the matrix has an initial column of ones that allows for a constant term. (In all rows, the relevant parameter is muitiplied by 1.0, so that the contribution to the fitted value is the same in all rows.) For terms that correspond directly to variables, the model matrix incoporates the variable directly as one of its columns. With the default handling of a factor term \begin{itemize} \item Implicitly there is a column that corresponds to the initial level of the factor, but as it has all elements 0 it can be omitted; \item For the third and any subsequent levels, the model matrix has a column that is zeros except for rows where the factor is at that level. \end{itemize} A factor that has only two levels will generate a single column, with 0s correponding to the first level, and 1s for the second level. The Pima data has, except for the response variable \texttt{type}, no binary variables. \subsection{Fitting the model} \noindent \begin{fmpage}{36pc} \exhead{7} Now fit a linear discriminant model: <>= if(exists("crx")){ crxRed <- na.omit(crx) crxCV.lda <- lda(V16 ~ ., data=crxRed, CV=TRUE) confusion(crxRed$V16, crxCV.lda$class) } @ % Note the message \begin{verbatim} Warning message: In lda.default(x, grouping, ...) : variables are collinear \end{verbatim} Now, for comparison, fit the model using \texttt{glm()}. This is one way to get details on the reasons for collinearity. Also, for using \texttt{glm()}, the argument \texttt{na.action=na.exclude} is available, which omits missing values when the model is fit, but then places \texttt{NA}s in those positions when fitted values, predicted values, etc., are calculated. This ensures that predicted values match up with the rows of the orginal data. <>= if(exists("crx")){ crx.glm <- glm(V16 ~ ., data=crx, family=binomial, na.action=na.exclude) alias(crx.glm) confusion(crx$V16, round(fitted(crx.glm))) summary(crx.glm)$coef } @ % From the output from \texttt{alias(crx.glm)}, what can one say about the reasons for multi-collinearity? \end{fmpage} Now display the scores from the linear discriminant calculations: <>= if(exists("crx")){ crxRed <- na.omit(crx) crx.lda <- lda(V16 ~ ., data=crxRed) crx.hat <- predict(crx.lda) densityplot(~crx.hat$x, groups=crxRed$V16) } @ % This plot is actually quite interesting. What does it tell you? \section{Use of Random Forest Results for Comparison} A good strategy is to use results from the random forests method for comparison. The accuracy of this algorithm, when it does give a worthwhile improvement over \texttt{lda()}, is often hard to beat. This method has the advantage that it can be applied pretty much automatically. It is good at handling situations where explanatory variables and factors interact in a relatively complex manner. Here are results for \texttt{Pima.tr} as training data, at the same time applying predictions to \texttt{Pima.te} as test data. Notice that there are two confusion matrices, one giving the OOB estaimtes for \texttt{Pima.tr}, and the other for \texttt{Pima.te}. <>= library(randomForest) Pima.rf <- randomForest(type ~ ., xtest=Pima.te[,-8], ytest=Pima.te[,8], data=Pima.tr) Pima.rf @ % Look at the OOB estimate of accuracy, which is pretty much equivalent to a cross-validation estimate of accuracy. This error will be similar to the error on test data that are randomly chosen from the same population. The accuracy is poorer than for \texttt{lda()}. As before, the error rate is lower on \texttt{Pima.te} than on \texttt{Pima.te}. Note however the need to re-run the calculation several times, as the accuracy will vary from run to run. Here are results for \texttt{crx}. <>= if(exists("crxRed")){ crx.rf <- randomForest(V16 ~ ., data=crxRed) crx.rf } @ % Accuracy is similar to that from use of \texttt{lda()}. \section{Note -- The Handling of NAs} The assumption that underlies any analysis that omits missing values is that, for purposes of the analysis, missingness is uninformative. This may be incorrect, and it is necessary to ask: Are the subjects where there are missing values different in some way? The missing value issue is pertinent both to the Pima data and to the \texttt{crx} data. There is a further dataset, \texttt{Pima.tr2}, that augments \texttt{Pima.tr} with 100 subjects that have missing values in one or more of the explanatory variables. The question then arises: Is the pattern of missingness the same for those without diabetes as for those with diabetes? The following shows the numbers of missing values for each of the variables <>= if(exists("Pima.tr2", where=".GlobalEnv", inherits=FALSE)) rm(Pima.tr2) sapply(Pima.tr2,function(x)sum(is.na(x))) sum(!complete.cases(Pima.tr2)) @ % Note that the variable \texttt{skin} accounts for 98 of the 100 subjects where there is one or more missing value. A first step is to check whether the subjects with one or more missing values differ in some systematic manner from subjects with no missing values. The major issue is for values that are missing for \texttt{skin}. We start by creating a new variable -- here named \texttt{complete} -- that distinguishes subjects with missing values for \texttt{skin} from others. We omit observations that are missing on any of the other variables. <>= newPima <- subset(Pima.tr2, complete.cases(bp) & complete.cases(bmi)) newPima$NOskin <- factor(is.na(newPima$skin), labels=c("skin","NOskin")) ## NB: FALSE (skin is not NA) precedes TRUE in alphanumeric order newPima$skin <- NULL # Omit the column for skin @ % The argument \texttt{labels=c("skin","NOskin")} takes the values (here \texttt{FALSE} and \texttt{TRUE}) in alphanumeric order, then making \texttt{skin} and \texttt{NOskin} the levels. Omission of this argument would result in levels \texttt{FALSE} and \texttt{TRUE}.\footnote{NB also \texttt{factor(is.na(newPima\$skin), levels=c(TRUE, FALSE))}; levels would then be \texttt{TRUE} and \texttt{FALSE}, in that order.} We now do a linear discriminant analysis in which variables other than \texttt{skin} are explanatory variables. <>= completeCV.lda <- lda(NOskin ~ npreg+glu+bp+bmi+ped+age+type, data=newPima, CV=TRUE) confusion(newPima$NOskin, completeCV.lda$class) @ % A linear discriminant analysis seems unable to distinguish the two groups. The overall accuracy does not reach what could be achieved by predicting all rows as complete. \subsection{Does the missingness give information on the diagnosis?} If there is a suspicion that it does, then a valid analysis may be possible as follows. Missing values of continuous variables are replaced by 0 (or some other arbitrary number). For each such variable, and each observation for which it is missing, there must be a factor, e.g.\ with levels \texttt{miss} and \texttt{nomiss}, that identifies the subject for whom the value is missing. Where values of several variables are missing for the one subject, the same factor may be used. This allows an analysis in which all variables, together with the newly created factors, are present'' for all subjects.