<>=
options(SweaveHooks=list(fig=function(){par(cex.main=1.1,
mar=c(4.1,4.1,2.6,2.1),
mgp=c(2.25,0.5,0), tck=-0.02)}))
graphics.off()
@
\part{Multi-level Models}
\section{Description and Display of the Data}
\subsection{Description}
This laboratory will work with data on corn yields from the Caribbean
islands of Antigua and St Vincent. Data are yields from packages on
eight sites on the Caribbean island of Antigua. The data frames
\texttt{ant111b} and \texttt{vince111b} hold yields for the standard
treatment, here identified as \texttt{111}, for sites on Antigua and
St Vincent respectively. Additionally, there will be some use of the
more extensive data in the data frame \texttt{antigua}. All three
data frames are in recent versions ($\geq$ 0.84) of the \textit{DAAG}
package. See \texttt{help(ant11b)} for details of the source of these
data.
The data frame \texttt{ant111b} has data for $n$=4 packages of land at
each of eight sites, while \texttt{vince111b} data for four packages
at each of nine sites. As will be described below, two possible
predictions are:
\begin{enumerate}
\item Predictions for new packages of land in one of the existing sites.
\item Predictions for new packages in a new site.
\end{enumerate}
The accuracies for the second type of prediction may be much less
accurate than for the first type. A major purpose of this laboratory
is to show how such differences in accuracy can be modeled.
\subsection{Display}
We begin by examining plots, for the treatment \texttt{111}, from the
combined data for the two islands. This information for the separate
islands is summarized in the datasets \texttt{ant111b} and
\texttt{vince111b} in the \textit{DAAG} package.
A first step is to combine common columns of \texttt{ant111b} and
\texttt{vince111b} into the single data frame \texttt{corn111b}.
<>=
library(lattice)
library(DAAG)
corn111b <- rbind(ant111b[,-8], vince111b)
corn111b$island <- c("Antigua", "StVincent")[corn111b$island]
@ %
<<2panels, echo=f>>=
corn.strip1 <- stripplot(site ~ harvwt | island, data = corn111b,
xlab="Harvest weight")
@
<<2panels-free, echo=f>>=
corn.strip2 <- stripplot(site ~ harvwt | island, data = corn111b,
xlab="Harvest weight",
scale=list(y=list(relation="free")))
@
<>=
corn.strip3 <- stripplot(site ~ harvwt, data = corn111b, groups=island,
xlab="Harvest weight", auto.key=list(columns=2))
@
\begin{itemize}
\item The following plot uses different panels for the two islands:
<<2islands-2panels, echo=t, eval=f>>=
<<2panels>>
@
\item The following plot uses different panels for the two islands, but
allows separate ("free" = no relation) vertical scales for the two plots.
<<2islands-2panels-free, eval=f>>=
<<2panels-free>>
@
\item The following uses a single panel, but uses different colours
(or, on a black and white device, different symbols) to distinguish
the two islands. Notice the use of \texttt{auto.key} to generate an
automatic key:
<<<1panel-color-island, echo=t, eval=f>>=
<>
@
\end{itemize}
\setkeys{Gin}{width=\textwidth}
\begin{figure}[H]
\begin{center}
<>=
print(corn.strip1, position=c(0, 0.525, 0.475,1))
print(corn.strip2, position=c(0, 0, 1, 0.475), newpage=FALSE)
print(corn.strip3, position=c(0.525, 0.525, 1, 1), newpage=FALSE)
@
\caption{Yields for the four packages of corn on
sites on the islands of Antigua and St Vincent.}
\label{fig:ant111}
\end{center}
\end{figure}
Next, we will obtain package means for the Antiguan data, for all treatments.
<>=
with(antigua, # antigua is from the DAAG package
antp <<- aggregate(harvwt, by = list(site = site,
package = block,
trt = trt),
FUN = mean)
)
names(antp)[4] <- "harvwt"
@ %
Notice the use of the version \verb!<<-! of the assignment symbol to ensure
that assignment takes place in the workspace.
Now plot mean harvest weights for each treatment, by site:
\setkeys{Gin}{width=\textwidth}
\begin{figure}[H]
\begin{center}
<>=
antp.trt <- stripplot(trt ~ harvwt|site, groups=trt, data=antp)
print(antp.trt)
@
\caption{Yields for the four packages of corn on each
of eight sites on the island of Antigua.}
\label{fig:ant111-sites}
\end{center}
\end{figure}
\subsubsection*{Questions and Exercises}
\begin{enumerate}
\item Which set of sites (Antigua or St Vincent) shows the largest
yields?
\item Create a plot that compares the logarithms of the yields,
within and between sites on the two islands. From this plot,
what, if anything, can you say about the differet variabilities
in yield, within and between sites on each island?
\end{enumerate}
\section{Multi-level Modeling}
\paragraph{*Analysis using \textit{lme}:}
The modeling command takes the form:
<>=
library(nlme)
ant111b.lme <- lme(fixed=harvwt ~ 1, random = ~1|site,
data=ant111b)
@
The only fixed effect is the overall mean. The argument
\verb!random = ~1|site! fits random variation between sites.
Variation between the individual units that are nested within sites,
i.e., between packages, are by default treated as random. Here is the
default output:
<>=
options(digits=4)
ant111b.lme
@
Notice that \textit{lme} gives, not the components of variance, but
the standard deviations (\texttt{StdDev}) which are their square
roots. Observe that, according to \textit{lme},
$\widehat{\sigma_B^2}$ = 0.76$^2$ = 0.578, and
$\widehat{\sigma_L^2}$ = 1.539$^2$ = 2.369.
The variance for an individual package is $\widehat{\sigma_B^2}$
+ $\widehat{\sigma_L^2}$.
\newline
{\small Those who are familiar with an analysis of variance table for
such data should note that \textit{lme} does not give the mean
square at any level higher than level 0, not even in this balanced
case.}
Note that the yields are not independent between different packages on
the same site, in the population that has packages from multiple
sites. (Conditional on coming from one particular site, package
yields are however independent.)
The take-home message from this analysis is:
\begin{itemize}
\item[o] For prediction for a new package at one of the existing sites,
the standard error is 0.76
\item[o] For prediction for a new package at a new site,
the standard error is $\sqrt{1.539^2+.76^2} = 1.72$
\item[o] For prediction of the mean of $n$ packages at a new site,
the standard error is $\sqrt{1.539^2+ 0.76^2/n}$. This is NOT
inversely proportional to $n$, as would happen if the yields
were independent within sites.
\end{itemize}
Where there are multiple levels of variation, the predictive accuracy
can be dramatically different, depending on what is to be predicted.
Similar issues are arise in repeated measures contexts, and in time
series. Repeated measures data has multiple profiles, i.e., many small
time series.
\subsection{Simulation}
The following function simulates results from a multilevel model
for the case where there are \texttt{npackages} packages at each
of \texttt{nplots} plots.
<>=
"simMlevel" <-
function(nsites=8, npackages=4, mu=4, sigmaL=1.54, sigmaB=0.76){
facSites <- factor(1:nsites)
facPackages <- factor(1:npackages)
dframe <- expand.grid(facPackages=facPackages, facSites=facSites)
nall <- nsites*npackages
siteEffects <- rnorm(nsites, 0, sigmaL)
err <- rnorm(nall, 0, sigmaB) + siteEffects[unclass(dframe$facSites)]
dframe$yield <- mu+err
dframe
}
@ %
The default arguments are \texttt{sigmaB} = 0.76 and \texttt{sigma}
= 1.54, as for the Antiguan data.
\subsection{Questions and Exercises}
\begin{enumerate}
\item Repeat the analysis
\begin{itemize}
\item[(a)] for the Antiguan data, now using a logarithmic scale.
\item[(b)] for the St Vincent data, using a logarithmic scale.
\end{itemize}
\item Overlay plots, for each of the two islands, that show how the
variance of the mean can be expected to change with the number of
packages $n$.
\item Are there evident differences between islands in the
contributions of the two components of variance? What are the
practical implications that flow from such differences as you may
observe?
\item Use the function \texttt{simMlevel()} to simulate a new set of data,
using the default arguments. Analyse the simulated data. Repeat this
exercise 25 or more times. How closely do you reproduce the values of
\texttt{sigmaL}=1.54 and \texttt{sigmaB}=0.76 that were used for the
simulation?
\end{enumerate}
\section{Multi-level Modeling -- Attitudes to Science Data}
These data are from in the \textit{DAAG} package for R. The data are
measurements of attitudes to science, from a survey where there were
results from 20 classes in 12 private schools and 46 classes in 29
public (i.e. state) schools, all in and around Canberra,
Australia. Results are from a total of 1385 year 7 students. The
variable \verb!like! is a summary score based on two of the
questions. It is on a scale from 1~(dislike) to 12~(like). The number
in each class from whom scores were available ranged from 3 to 50,
with a median of 21.5.
There are three variance components:
\vspace*{2pt}
\begin{verbatim}
Between schools 0.00105
Between classes 0.318
Between students 3.05
\end{verbatim}
The between schools component can be neglected. The variance for a
class mean is $0.318 + 3.05/n$, where $n$ is the size of the class.
The two contributions are about equal when $n$ =10.
\section{*Additional Calculations}
We return again to the corn yield data.
\paragraph{Is variability between packages similar at all sites?:}
<>=
if(dev.cur()==2)invisible(dev.set(3))
vars <- sapply(split(ant111b$harvwt, ant111b$site), var)
vars <- vars/mean(vars) # Standardize to a variance of one
qqplot(qchisq(ppoints(vars),3), 3*vars)
@
\paragraph{Does variation within sites follow a normal distribution?:}
<>=
qqnorm(residuals(ant111b.lme))
@
\paragraph{What is the pattern of variation between sites?}
<>=
locmean <- sapply(split(log(ant111b$harvwt), ant111b$site), mean)
qqnorm(locmean)
@
The distribution seems remarkably close to normal.
\paragraph{Fitted values and residuals in \textit{lme}:}
By default fitted values account for all random effects, except those
at level 0. In the example under discussion
\texttt{fitted(ant111b.lme)} calculates fitted values at level 1,
which can be regarded as estimates of the site means. They are not
however the site means, as the graph given by the following
calculation demonstrates:
<>=
hat.lm <- fitted(lm(harvwt ~ site, data=ant111b))
hat.lme <- fitted(ant111b.lme) # By default, level=1)
plot(hat.lme ~ hat.lm, xlab="Site means",
ylab="Fitted values (BLUPS) from lme")
abline(0,1,col="red")
@
The fitted values are known as BLUPs (Best Linear Unbiased
Predictors). Relative to the site means, they are pulled in
toward the overall mean. The most extreme site means will on
average, because of random variation, be more extreme than the
corresponding ``true'' means for those sites. There is a
theoretical result that gives the factor by which they should be
shrunk in towards the true mean.
Residuals are by default the residuals from the package means, i.e.,
they are residuals from the fitted values at the highest level
available. To get fitted values and residuals at level 0, enter:
<>=
hat0.lme <- fitted(ant111b.lme, level=0)
res0.lme <- resid(ant111b.lme, level=0)
plot(res0.lme, ant111b$harvwt - hat0.lme) # Points lie on y=x
@
\section{Notes -- Other Forms of Complex Error Structure}
Time series are another important special case. A first step is,
often, to subtract off any trend, and base further analysis on
residuals about this trend. Observations that are close together
in time are typically more closely correlated than observations
that are widely separated in time.
The variances of the mean of $n$ observations with variance
$\sigma^2$ will, assuming that positive correlation between neighbouring
observations makes the major contribution to the correlation
structure, be greater than $\frac{\sigma^2}{n}$.
Here is a simple way to generate data that are sequentially correlated.
The autocorrelation plot shows how the estimated correlation changes as
observations move further apart.
<>=
y <- rnorm(200)
y1 <- y[-1] + 0.5*y[-length(y)]
acf(y1)
@
Of course the multiplier in \texttt{y1 <- y[-1] + 0.5*y[-1000]} can
be any number at all, and more complex correlation structures
can be generated by incorporating further lags of \texttt{y}.