<>=
options(SweaveHooks=list(fig=function(){par(cex.main=1.1,
mar=c(4.1,4.1,2.6,2.1),
mgp=c(2.25,0.5,0), tck=-0.02)}))
@
\part{Extending the Linear Model}
Package: \texttt{DAAG},
\vspace*{6pt}
Ideas that will be important in the expansive view of linear models
that will now be illustrated include: \textit{basis function},
\textit{factor}, and \textit{interaction}. The reach of R's
\textit{model formulae} is wide.
\section{A One-way Classification -- Eggs in the Cuckoo's Nest}
This demonstrates the use of linear models to fit qualitative effects.
Like many of nature's creatures, cuckoos have a nasty habit. They lay
their eggs in the nests of other birds. First, let's see how egg
length changes with the host species. This will use the graphics
function \texttt{stripplot()} from the \textit{lattice} package.
The data frame \texttt{cuckoos} is in the \textit{DAAG} package.
<>=
par(mfrow=c(1,2))
library(DAAG)
library(lattice)
names(cuckoos)[1] <- "length"
table(cuckoos$species)
stripplot(species ~ length, data=cuckoos)
## Look also at the relationship between length and breadth;
## is it the same for all species?
cuckoo.strip <- stripplot(breadth ~ length, groups=species,
data=cuckoos, auto.key=list(columns=3))
print(cuckoo.strip)
par(mfrow=c(1,1))
@
\begin{fmpage}{36pc}
\exhead{1}
Now estimate the means and standard deviations for each of the groups,
using direct calculation:
<>=
with(cuckoos, sapply(split(length, species), mean))
with(cuckoos, sapply(split(length, species), sd))
@%
[The function \texttt{split()} splits the lengths into six sublists,
one for each species. The function \texttt{sapply()} applies the
function calculation to the vectors of lengths in each separate
sublist.]
Check that the SD seems smallest for those species where the SD seems,
visually, to be smallest.
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{2}
Obtain, for each species, the standard error of the mean. This is
obtained by dividing the standard deviation by the square root of the
number of values:
<>=
sdev <- with(cuckoos, sapply(split(length, species), sd))
n <- with(cuckoos, sapply(split(length, species), length))
sdev/sqrt(n)
@
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{3} Now estimate obtain the means for each of the groups
froam a model that fits a separate constant term for each species.
The model can be specified using several different, but equivalent,
formulations, or parameterizations.
\begin{itemize}
\item The following the species means directly, as parameters for
the model. It forces all parameters to be species means
<>=
lm(length ~ -1 + species, data=cuckoos)
@
\item In the following alternative
parameterization, the first parameter estimate is the mean for the
first species, while later estimates are differences from the first
species. In other words, the first species becomes the baseline.
<>=
lm(length ~ species, data=cuckoos)
@ %
\end{itemize}
Now answer the following
\begin{enumerate}
\item Use the function \texttt{fitted()} to calculate the fitted values in
each case, and check that they are the same. Reconcile the two sets
of results.
\item Examine and interpret the model matrices for the two different
parameterizations.
\item Use the \texttt{termplot()} function to show the effects of the
different factor levels. Be sure to call the function with
\texttt{partial.resid=TRUE} and with \texttt{se=TRUE}.
Does the variability seem similar for all host species?
\end{enumerate}
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{4} The following creates (in the first line) and uses
(second line) a function that calculates the standard error of the
mean.
<>=
se <- function(x)sd(x)/sqrt(length(x))
with(cuckoos, sapply(split(length, species), se))
@
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{4, continued} The standard error calculation can be done in
a single line of code, without formally creating the function
\texttt{se()}. Instead, the function definition can be inserted as
an anonymous function, so that it does not need a name. The
function definition is inserted where the function name would
otherwise appear:
<>=
with(cuckoos, sapply(split(length, species),
function(x)sd(x)/sqrt(length(x))))
@
\end{fmpage}
\section{Regression Splines -- one explanatory variable}
This pursues the use of smoothing methods, here formally within
a regression context.
\begin{fmpage}{36pc}
\exhead{5}
The following is based on the \texttt{fruitohms} data frame
(in \textit{DAAG})
\begin{enumerate}
\item Plot \texttt{ohms} against \texttt{juice}.
\item Try the following:
<>=
plot(ohms ~ juice, data=fruitohms)
with(fruitohms, lines(lowess(juice, ohms)))
with(fruitohms, lines(lowess(juice, ohms, f=0.2), col="red"))
@%
Which of the two fitted curves best captures the pattern of change?
\item Now fit a natural regression spline. Try for example:
<>=
library(splines)
plot(ohms ~ juice, data=fruitohms)
hat <- with(fruitohms, fitted(lm(ohms ~ ns(juice, 4))))
with(fruitohms, lines(juice,hat, col=3))
## Check the locations of the internaal knots.
attributes(ns(fruitohms$juice, 4))$knots
@%
Experiment with different choices for the number of degrees of
freedom. How many degrees of freedom seem needed to adequately
capture the pattern of change? Plot the spline basis functions.
Add vertical lines to the plot that show the knot locations.
\end{enumerate}
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{6}
In the \texttt{geophones} data frame (\textit{DAAG}), plot
\texttt{thickness} against distance. Use regression splines,
as in Exercise 6, to fit a suitable curve through the data.
How many degrees of freedom seem needed? Add vertical lines
to the plot that show the locatons of the knots.
\end{fmpage}
\vspace*{8pt}
\section{Regression Splines -- Two or More Explanatory Variables}
We begin by using the \texttt{hills2000} data (\textit{DAAG}
version 0.84 or later) to fit a model that is linear in
\texttt{log(dist)} and \texttt{log(climb)}, thus
<>=
lhills2k <- log(hills2000[, 2:4])
names(lhills2k) <- c("ldist", "lclimb", "ltime")
lhills2k.lm <- lm(ltime ~ ldist + lclimb, data = lhills2k)
@ %
Use \texttt{termplot()} to check departures from linearity in
\texttt{lhills2k.lm}. Note whether there seem to be any evident outliers.
\begin{fmpage}{36pc}
\exhead{7}
Now use regression splines to take out the curvature that
was evident when \texttt{ltime} was modeled as a linear function
of \texttt{ldist} and \texttt{lclimb}. Use \texttt{termplot()}
to guide your choice of degrees of freedom for the spline bases.
For example, you might try
<>=
lhills2k.ns <- lm(ltime ~ ns(ldist,2) + lclimb, data = lhills2k)
@
Again, examine the diagnostic plots? Are there any points that
should perhaps be regarded as outliers?
Does a normal spline of degree 2 in \texttt{ldist} seem to give any
benefit above a polynomial of degree 2.
\end{fmpage}
Note: The coefficients of the spline basis terms do not give useful
information on what degree of spline curve is required. See exercise
9 below. If one fits a spline of degree 3 to a relationship that is
esentially linear, all three coefficients are likely to be highly
significant. Rather, check how the residual sum of squares changes as
the number of degrees of freedom for the spline curve increases.
[F-tests are not strictly valid, as successive models are not nested
(the basis functions change), but they may be a helpful guide.]
\begin{fmpage}{36pc}
\exhead{8}
The \textit{MASS} package has the function \texttt{lqs()}
that can be used for a \textit{resistant} regression
fit, i.e., the effect of outlying points is attenuated.
Try the following
<>=
library(MASS)
lhills2k.lqs <- lqs(ltime ~ ns(ldist,2) + lclimb, data = lhills2k)
plot(resid(lhills2k.lqs) ~ fitted(lhills2k.lqs))
big4 <- order(abs(resid(lhills2k.lqs)), decreasing=TRUE)[1:4]
text(resid(lhills2k.lqs)[big4] ~ fitted(lhills2k.lqs)[big4],
labels=rownames(lhills2k)[big4], pos=4)
@ %
Try the plot without the two largest outliers. Does this make
any difference of consequence to the fitted values?
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{10}
Try the following:
\begin{verbcode}
x <- 11:20
y <- 5 + 1.25*x+rnorm(10)
summary(lm(y ~ ns(x,2)))$coef
summary(lm(y ~ ns(x,3)))$coef
summary(lm(y ~ ns(x,4)))$coef
\end{verbcode}
Note that the coefficents are in all cases very mcuh larger than their
standard errors. It takes both degree 2 spline basis terms, additional
to the constant, to fit a line. All three degree 3 terms are
required, and so on! Splines do not give a parsimonious representation,
if the form of the model is known.
\end{fmpage}
\section{Errors in Variables}
\begin{fmpage}{36pc}
\exhead{10} Run the accompanying function \texttt{errorsINx()}
several times. Comment on the results. The underlying relationship
between $y$ and $x$ is the same in all cases. The error in $x$ is
varied, from values of $x$ that are exact to values of $x$ that have
random errors added with a variance that is twice that of the
variance of $x$.
\end{fmpage}
\subsection{Function used}
This is available from
\url{http://www.maths.anu.edu.au/~johnm/r/functions/}
\begin{verbatim}
"errorsINx" <-
function(mu = 8.25, n = 100, a = 5, b = 2.5, SDx=1, sigma = 2,
timesSDx=(1:5)/2.5){
mat <- matrix(0, nrow=n, ncol=length(timesSDx)+2)
x0 <- mu*exp(rnorm(n,0,SDx/mu))/exp(0)
y <- a + b*x0+rnorm(n,0,sigma)
mat[, length(timesSDx)+2] <- y
mat[,2] <- x0
mat[,1] <- y
sx <- sd(x0)
k <- 2
for(i in timesSDx){
k <- k+1
xWITHerror <- x0+rnorm(n, 0, sx*i)
mat[, k] <- xWITHerror
}
df <- as.data.frame(mat)
names(df) <- c("y", "x", paste("x",timesSDx,sep=""))
df
}
## Now use function to simulate y vs x relationships, with several
## different values of timesSDx, which specifies the ratio of the
## errors in x variance to SD[x]
oldpar <- par(mar=c(3.6,3.1,1.6,0.6), mgp=c(2.5,0.75,0),
oma=c(1,1,0.6,1),
mfrow=c(2,3), pty="s")
mu <- 20; n <- 100; a <- 15; b <- 2.5; sigma <- 12.5; timesSigma<-(1:5)/2.5
mat <- errorsINx(mu = 20, n = 100, a = 15, b = 2.5, sigma = 5,
timesSDx=(1:5)/2.5)
beta <- numeric(dim(mat)[2]-1)
sx <- sd(mat[,2])
y <- mat[, 1]
for(j in 1:length(beta)){
xj <- mat[,j+1]
plot(y ~ xj, xlab="", ylab="", col="gray30")
if(j==1)
mtext(side=3, line=0.5, "No error in x") else{
xm <- timesSigma[j-1]
mtext(side=3, line=0.5, substitute(tau == xm*s[z], list(xm=xm)))
}
if(j>=4)mtext(side=1, line=2, "x")
if(j%%3 == 1)mtext(side=2, line=2, "y")
errors.lm <- lm(y ~ xj)
abline(errors.lm)
beta[j] <- coef(errors.lm)[2]
bigsigma <- summary(errors.lm)$sigma
print(bigsigma/sigma)
abline(a, b, lty=2)
}
print(round(beta, 3))
\end{verbatim}
\begin{verbatim}
plotIntersalt <-
function (dset = intersalt1, figno = 2)
{
oldpar <- par(oma = c(6.5, 0, 0, 0), mar = par()$mar - c(0,
0, 3.5, 0))
on.exit(par(oldpar))
lowna <- c(4, 5, 24, 28)
plot(dset$na, dset$bp, pch = 15, ylim = c(50, 85),
xlab = "Median sodium excretion (mmol/24hr)",
ylab = "Median diastolic BP (mm Hg)", type = "n")
points(dset$na[-lowna], dset$bp[-lowna], pch = 16)
points(dset$na[lowna], dset$bp[lowna], pch = 1, lwd = 3)
u <- lm(bp ~ na, data = dset)
abline(u$coef[1], u$coef[2])
figtxt <-
paste("Fig. ", figno, ": Plot of median blood pressure versus salt",
"\\n(measured by sodium excretion) for 52 human",
"\\npopulations. Four results (open circles) are for",
"\\nnon-industrialised societies with very low salt intake,",
"\\nwhile other results are for industrialised societies.",
sep = "")
mtext(side = 1, line = 6, figtxt, cex = 1.1, adj = 0, at = -20)
}
\end{verbatim}