<>= options(SweaveHooks=list(fig=function(){par(cex.main=1.1, mar=c(4.1,4.1,2.6,2.1), mgp=c(2.25,0.5,0), tck=-0.02)})) @ \part{Extending the Linear Model} Package: \texttt{DAAG}, \vspace*{6pt} Ideas that will be important in the expansive view of linear models that will now be illustrated include: \textit{basis function}, \textit{factor}, and \textit{interaction}. The reach of R's \textit{model formulae} is wide. \section{A One-way Classification -- Eggs in the Cuckoo's Nest} This demonstrates the use of linear models to fit qualitative effects. Like many of nature's creatures, cuckoos have a nasty habit. They lay their eggs in the nests of other birds. First, let's see how egg length changes with the host species. This will use the graphics function \texttt{stripplot()} from the \textit{lattice} package. The data frame \texttt{cuckoos} is in the \textit{DAAG} package. <>= par(mfrow=c(1,2)) library(DAAG) library(lattice) names(cuckoos)[1] <- "length" table(cuckoos$species) stripplot(species ~ length, data=cuckoos) ## Look also at the relationship between length and breadth; ## is it the same for all species? cuckoo.strip <- stripplot(breadth ~ length, groups=species, data=cuckoos, auto.key=list(columns=3)) print(cuckoo.strip) par(mfrow=c(1,1)) @ \begin{fmpage}{36pc} \exhead{1} Now estimate the means and standard deviations for each of the groups, using direct calculation: <>= with(cuckoos, sapply(split(length, species), mean)) with(cuckoos, sapply(split(length, species), sd)) @% [The function \texttt{split()} splits the lengths into six sublists, one for each species. The function \texttt{sapply()} applies the function calculation to the vectors of lengths in each separate sublist.] Check that the SD seems smallest for those species where the SD seems, visually, to be smallest. \end{fmpage} \begin{fmpage}{36pc} \exhead{2} Obtain, for each species, the standard error of the mean. This is obtained by dividing the standard deviation by the square root of the number of values: <>= sdev <- with(cuckoos, sapply(split(length, species), sd)) n <- with(cuckoos, sapply(split(length, species), length)) sdev/sqrt(n) @ \end{fmpage} \begin{fmpage}{36pc} \exhead{3} Now estimate obtain the means for each of the groups froam a model that fits a separate constant term for each species. The model can be specified using several different, but equivalent, formulations, or parameterizations. \begin{itemize} \item The following the species means directly, as parameters for the model. It forces all parameters to be species means <>= lm(length ~ -1 + species, data=cuckoos) @ \item In the following alternative parameterization, the first parameter estimate is the mean for the first species, while later estimates are differences from the first species. In other words, the first species becomes the baseline. <>= lm(length ~ species, data=cuckoos) @ % \end{itemize} Now answer the following \begin{enumerate} \item Use the function \texttt{fitted()} to calculate the fitted values in each case, and check that they are the same. Reconcile the two sets of results. \item Examine and interpret the model matrices for the two different parameterizations. \item Use the \texttt{termplot()} function to show the effects of the different factor levels. Be sure to call the function with \texttt{partial.resid=TRUE} and with \texttt{se=TRUE}. Does the variability seem similar for all host species? \end{enumerate} \end{fmpage} \begin{fmpage}{36pc} \exhead{4} The following creates (in the first line) and uses (second line) a function that calculates the standard error of the mean. <>= se <- function(x)sd(x)/sqrt(length(x)) with(cuckoos, sapply(split(length, species), se)) @ \end{fmpage} \begin{fmpage}{36pc} \exhead{4, continued} The standard error calculation can be done in a single line of code, without formally creating the function \texttt{se()}. Instead, the function definition can be inserted as an anonymous function, so that it does not need a name. The function definition is inserted where the function name would otherwise appear: <>= with(cuckoos, sapply(split(length, species), function(x)sd(x)/sqrt(length(x)))) @ \end{fmpage} \section{Regression Splines -- one explanatory variable} This pursues the use of smoothing methods, here formally within a regression context. \begin{fmpage}{36pc} \exhead{5} The following is based on the \texttt{fruitohms} data frame (in \textit{DAAG}) \begin{enumerate} \item Plot \texttt{ohms} against \texttt{juice}. \item Try the following: <>= plot(ohms ~ juice, data=fruitohms) with(fruitohms, lines(lowess(juice, ohms))) with(fruitohms, lines(lowess(juice, ohms, f=0.2), col="red")) @% Which of the two fitted curves best captures the pattern of change? \item Now fit a natural regression spline. Try for example: <>= library(splines) plot(ohms ~ juice, data=fruitohms) hat <- with(fruitohms, fitted(lm(ohms ~ ns(juice, 4)))) with(fruitohms, lines(juice,hat, col=3)) ## Check the locations of the internaal knots. attributes(ns(fruitohms$juice, 4))$knots @% Experiment with different choices for the number of degrees of freedom. How many degrees of freedom seem needed to adequately capture the pattern of change? Plot the spline basis functions. Add vertical lines to the plot that show the knot locations. \end{enumerate} \end{fmpage} \begin{fmpage}{36pc} \exhead{6} In the \texttt{geophones} data frame (\textit{DAAG}), plot \texttt{thickness} against distance. Use regression splines, as in Exercise 6, to fit a suitable curve through the data. How many degrees of freedom seem needed? Add vertical lines to the plot that show the locatons of the knots. \end{fmpage} \vspace*{8pt} \section{Regression Splines -- Two or More Explanatory Variables} We begin by using the \texttt{hills2000} data (\textit{DAAG} version 0.84 or later) to fit a model that is linear in \texttt{log(dist)} and \texttt{log(climb)}, thus <>= lhills2k <- log(hills2000[, 2:4]) names(lhills2k) <- c("ldist", "lclimb", "ltime") lhills2k.lm <- lm(ltime ~ ldist + lclimb, data = lhills2k) @ % Use \texttt{termplot()} to check departures from linearity in \texttt{lhills2k.lm}. Note whether there seem to be any evident outliers. \begin{fmpage}{36pc} \exhead{7} Now use regression splines to take out the curvature that was evident when \texttt{ltime} was modeled as a linear function of \texttt{ldist} and \texttt{lclimb}. Use \texttt{termplot()} to guide your choice of degrees of freedom for the spline bases. For example, you might try <>= lhills2k.ns <- lm(ltime ~ ns(ldist,2) + lclimb, data = lhills2k) @ Again, examine the diagnostic plots? Are there any points that should perhaps be regarded as outliers? Does a normal spline of degree 2 in \texttt{ldist} seem to give any benefit above a polynomial of degree 2. \end{fmpage} Note: The coefficients of the spline basis terms do not give useful information on what degree of spline curve is required. See exercise 9 below. If one fits a spline of degree 3 to a relationship that is esentially linear, all three coefficients are likely to be highly significant. Rather, check how the residual sum of squares changes as the number of degrees of freedom for the spline curve increases. [F-tests are not strictly valid, as successive models are not nested (the basis functions change), but they may be a helpful guide.] \begin{fmpage}{36pc} \exhead{8} The \textit{MASS} package has the function \texttt{lqs()} that can be used for a \textit{resistant} regression fit, i.e., the effect of outlying points is attenuated. Try the following <>= library(MASS) lhills2k.lqs <- lqs(ltime ~ ns(ldist,2) + lclimb, data = lhills2k) plot(resid(lhills2k.lqs) ~ fitted(lhills2k.lqs)) big4 <- order(abs(resid(lhills2k.lqs)), decreasing=TRUE)[1:4] text(resid(lhills2k.lqs)[big4] ~ fitted(lhills2k.lqs)[big4], labels=rownames(lhills2k)[big4], pos=4) @ % Try the plot without the two largest outliers. Does this make any difference of consequence to the fitted values? \end{fmpage} \begin{fmpage}{36pc} \exhead{10} Try the following: \begin{verbcode} x <- 11:20 y <- 5 + 1.25*x+rnorm(10) summary(lm(y ~ ns(x,2)))$coef summary(lm(y ~ ns(x,3)))$coef summary(lm(y ~ ns(x,4)))$coef \end{verbcode} Note that the coefficents are in all cases very mcuh larger than their standard errors. It takes both degree 2 spline basis terms, additional to the constant, to fit a line. All three degree 3 terms are required, and so on! Splines do not give a parsimonious representation, if the form of the model is known. \end{fmpage} \section{Errors in Variables} \begin{fmpage}{36pc} \exhead{10} Run the accompanying function \texttt{errorsINx()} several times. Comment on the results. The underlying relationship between $y$ and $x$ is the same in all cases. The error in $x$ is varied, from values of $x$ that are exact to values of $x$ that have random errors added with a variance that is twice that of the variance of $x$. \end{fmpage} \subsection{Function used} This is available from \url{http://www.maths.anu.edu.au/~johnm/r/functions/} \begin{verbatim} "errorsINx" <- function(mu = 8.25, n = 100, a = 5, b = 2.5, SDx=1, sigma = 2, timesSDx=(1:5)/2.5){ mat <- matrix(0, nrow=n, ncol=length(timesSDx)+2) x0 <- mu*exp(rnorm(n,0,SDx/mu))/exp(0) y <- a + b*x0+rnorm(n,0,sigma) mat[, length(timesSDx)+2] <- y mat[,2] <- x0 mat[,1] <- y sx <- sd(x0) k <- 2 for(i in timesSDx){ k <- k+1 xWITHerror <- x0+rnorm(n, 0, sx*i) mat[, k] <- xWITHerror } df <- as.data.frame(mat) names(df) <- c("y", "x", paste("x",timesSDx,sep="")) df } ## Now use function to simulate y vs x relationships, with several ## different values of timesSDx, which specifies the ratio of the ## errors in x variance to SD[x] oldpar <- par(mar=c(3.6,3.1,1.6,0.6), mgp=c(2.5,0.75,0), oma=c(1,1,0.6,1), mfrow=c(2,3), pty="s") mu <- 20; n <- 100; a <- 15; b <- 2.5; sigma <- 12.5; timesSigma<-(1:5)/2.5 mat <- errorsINx(mu = 20, n = 100, a = 15, b = 2.5, sigma = 5, timesSDx=(1:5)/2.5) beta <- numeric(dim(mat)[2]-1) sx <- sd(mat[,2]) y <- mat[, 1] for(j in 1:length(beta)){ xj <- mat[,j+1] plot(y ~ xj, xlab="", ylab="", col="gray30") if(j==1) mtext(side=3, line=0.5, "No error in x") else{ xm <- timesSigma[j-1] mtext(side=3, line=0.5, substitute(tau == xm*s[z], list(xm=xm))) } if(j>=4)mtext(side=1, line=2, "x") if(j%%3 == 1)mtext(side=2, line=2, "y") errors.lm <- lm(y ~ xj) abline(errors.lm) beta[j] <- coef(errors.lm)[2] bigsigma <- summary(errors.lm)$sigma print(bigsigma/sigma) abline(a, b, lty=2) } print(round(beta, 3)) \end{verbatim} \begin{verbatim} plotIntersalt <- function (dset = intersalt1, figno = 2) { oldpar <- par(oma = c(6.5, 0, 0, 0), mar = par()$mar - c(0, 0, 3.5, 0)) on.exit(par(oldpar)) lowna <- c(4, 5, 24, 28) plot(dset$na, dset$bp, pch = 15, ylim = c(50, 85), xlab = "Median sodium excretion (mmol/24hr)", ylab = "Median diastolic BP (mm Hg)", type = "n") points(dset$na[-lowna], dset$bp[-lowna], pch = 16) points(dset$na[lowna], dset$bp[lowna], pch = 1, lwd = 3) u <- lm(bp ~ na, data = dset) abline(u$coef[1], u$coef[2]) figtxt <- paste("Fig. ", figno, ": Plot of median blood pressure versus salt", "\\n(measured by sodium excretion) for 52 human", "\\npopulations. Four results (open circles) are for", "\\nnon-industrialised societies with very low salt intake,", "\\nwhile other results are for industrialised societies.", sep = "") mtext(side = 1, line = 6, figtxt, cex = 1.1, adj = 0, at = -20) } \end{verbatim}