\documentclass[a4paper,10pt]{book} \SweaveOpts{engine=R, echo=t, quiet=t} % \VignetteIndexEntry{chapter 7: Solutions to Exercises} \usepackage{a4wide} \usepackage{fancyvrb} \usepackage{float} \usepackage{exercises} \begin{document} \markright{\bf Chapter 7 Exercises} \centerline{Data Analysis \& Graphics Using R, 2$^{nd}$ edn -- Solutions to Exercises (\today)} <>= options(show.signif.stars=FALSE, digits=4) options(SweaveHooks=list(fig=function(){par( mar=c(4.6,4.1,3.1,2.1), mgp=c(2.5,0.5,0), tck=-0.02)})) @ \begin{fmpage}{32pc} \noindent \textit{Preliminaries} <>= library(DAAG) library(splines) @ \end{fmpage} \begin{fmpage}{32pc} \exhead{1} Re-analyze the sugar weight data of Subsection 7.1.1 using the \verb!log(weight)! in place of \verb!weight!. \end{fmpage} From the scatterplot in Figure 7.1, it is clear that the treatment variances are not constant. Perhaps a logarithmic transformation will stabilize the variances. <>= sugarlog.aov <- aov(log(weight) ~ trt, data=sugar) summary.lm(sugarlog.aov) summary.lm(sugarlog.aov) @ On the log scale, the differences from control remain discernible. However the plot should be compared with plots from random normal data. This should be repeated several times. There will be occasional samples that show changes in variability of the observed residuals that are of the extent observed for these data. \begin{figure}[H] \begin{center} \setkeys{Gin}{width=0.75\textwidth} <>= par(mfrow=c(2,3)) plot(sugarlog.aov, which=1) title(main="Sugar data", line=1.75) for(i in 1:5){ plot(aov(rnorm(12)~sugar$trt), which=1) title(main="Random normal data", line=1.75) } par(mfrow=c(1,1)) @ \caption{Plot of residuals versus fitted values, for the log(sugar weight) data.} \end{center} \end{figure} \begin{fmpage}{32pc} \exhead{3} Use the method of Section 7.3 to determine, formally, whether there should be different regression lines for the two data frames \verb!elastic1! and \verb!elastic2! from Exercise 1 in Section 5.11. \end{fmpage} It will be convenient to work with a single data frame: <>= elastic2$expt <- rep(2, length(elastic2$stretch)) elastic1$expt <- rep(1, length(elastic1$stretch)) elastic <- rbind(elastic1, elastic2) elastic$expt <- factor(elastic$expt) @ We fit two models as follows: <>= e.lm1 <- lm(distance ~ stretch, data=elastic) # a single line e.lm2 <- lm(distance ~ stretch + expt, data=elastic) # two parallel lines e.lm3 <- lm(distance ~ stretch + expt + stretch:expt, data=elastic) # two lines @ The following sequential analysis of variance table indicates that there is mild evidence against the two lines having the same intercept. <>= anova(e.lm1, e.lm2, e.lm3) @ Recall, however, from Exercise 5.1, that observation 7 is an influential outlier. Let's check to see what happens to the three models when this observation is deleted. <>= e.lm1 <- lm(distance ~ stretch, data=elastic[-7,]) e.lm2 <- lm(distance ~ stretch + expt, data=elastic[-7,]) e.lm3 <- lm(distance ~ stretch + expt + stretch:expt, data=elastic[-7,]) anova(e.lm1, e.lm2, e.lm3) @ Now, we see that there is really very little evidence of a difference between the two lines. Observation 7 seems different in character from other observations. \begin{fmpage}{32pc} \exhead{4} The data frame \verb!toycars! consists of 27 observations on the distance (in meters) traveled by one of three different toy cars on a smooth surface, starting from rest at the top of a16-inch-long ramp tilted at varying angles (measured in degrees). Because of differing frictional effects for the three different cars, we seek three regression lines relating distance traveled toangle. %\vspace*{-6.8pt} \begin{enumerate} \item As a first try, fit the model in which the three lines have the same slope but have different intercepts. \item Note the value of$R^2$from the summary table. Examine the diagnostic plots carefully. Is there an influential outlier? How should it be treated? \item The physics of the problem actually suggests that the three lines should have the same intercept (very close to 0, in fact), and possibly differing slopes, where the slopes are inversely related to the coefficient of dynamic friction for each car. Fit the model, and note that the value of$R^2$is slightly lower than that for the previously fitted model. Examine the diagnostic plots. What has happened to the influential outlier? In fact, we have exhibited an example where taking$R^2$too seriously could be somewhat hazardous; in this case, a more carefully thought out model can accommodate all of the data satisfactorily. Maximizing$R^2$does not necessarily give the best model! \end{enumerate} %toycars.lm0 <- lm(distance ~ angle+factor(car), data=toycars) %toycars.lm1 <- lm(distance ~ angle+angle:factor(car), data=toycars) \end{fmpage} <>= toycars$car <- factor(toycars$car) # car should be a factor toycars.lm <- lm(distance ~ angle + car, data=toycars) summary(toycars.lm) @ From the diagnostics (below), we see that there is an influential outlier. The model is not fitting all of the data satisfactorily. \begin{figure}[H] \begin{center} \setkeys{Gin}{width=0.75\textwidth} <>= par(mfrow=c(1,4)) plot(toycars.lm) par(mfrow=c(1,1)) @ \caption{Diagnostic plots for toycars.lm} \end{center} \end{figure} To fit the model with a constant intercept and possibly differing slopes, we proceed as follows: <>= toycars.lm2 <- lm(distance ~ angle + angle:car, data=toycars) summary(toycars.lm2) @ We can see from the diagnostics below that observation 17 is still somewhat influential, but it is no longer an outlier. All of the data are accommodated by this new model reasonably well. \begin{figure}[H] \begin{center} \setkeys{Gin}{width=0.75\textwidth} <>= par(mfrow=c(1,4)) plot(toycars.lm2) par(mfrow=c(1,1)) @ \caption{Diagnostic plots for toycars.lm2} \end{center} \end{figure} \begin{fmpage}{32pc} \exhead{5} The data frame \verb!cuckoos! holds data on the lengths and breadths of eggs of cuckoos, found in the nests of six different species of host birds. Fit models for the regression of length on breadth that have: \vspace*{-6.8pt} \begin{itemize} \item [A:] a single line for all six species. \item [B:] different parallel lines for the different host species. \item [C:] separate lines for the separate host species. \end{itemize} Use the \verb!anova()! function to print out the sequential analysis of variance table. Which of the three models is preferred? Print out the diagnostic plots for this model. Do they show anything worthy of note? Examine the output coefficients from this model carefully, and decide whether the results seem grouped by host species. How might the results be summarized for reporting purposes? \end{fmpage} <>= cuckoos.lm <- lm(length ~ breadth, data=cuckoos) # one line cuckoos.lm2 <- lm(length ~ breadth + species, data=cuckoos) # parallel lines cuckoos.lm3 <- lm(length ~ breadth + species + species:breadth, data=cuckoos) # different lines anova(cuckoos.lm, cuckoos.lm2, cuckoos.lm3) @ From the anova summary, we see that the second model is preferable. The standard diagnostics are given below. \begin{figure}[H] \begin{center} \setkeys{Gin}{width=0.75\textwidth} <>= par(mfrow=c(1,4)) plot(cuckoos.lm2) par(mfrow=c(1,1)) @ \caption{Diagnostic plots for cuckoos.lm2} \end{center} \end{figure} There is nothing on these plots that calls for especial attention. <>= summary(cuckoos.lm2) @ The baseline species is hedge sparrow, and we see some groupings among the host species. The relation between length and breadth of the eggs is similar when the host species are hedge sparrow, pied wagtail and tree pipit. Even when the robin is the host species, there is little evidence of a difference in the way in which length and breadth are related. However, the linear relation between length and breadth has a smaller intercept when the host species is either the meadow pipit or the wren. \begin{fmpage}{32pc} \exhead{8} Apply spline regression to the \verb!geophones! data frame. Specifically, regress thickness against distance, and check the fits of 4-, 5- and 6-degree-of-freedom cases. Which case gives the best fit to the data? How does this fitted curve compare with the polynomial curves obtained in the previous exercise? Calculate pointwise confidence bounds for the 5-degree-of-freedom case. \end{fmpage} We fit the 4-, 5-, and 6-degree-of-freedom spline models to the geophones data as follows: <>= geo.spl4 <- lm(thickness ~ ns(distance, df=4), data=geophones) geo.spl5 <- lm(thickness ~ ns(distance, df=5), data=geophones) geo.spl6 <- lm(thickness ~ ns(distance, df=6), data=geophones) @ The fitted curves are plotted thus: <>= plot(geophones) lines(spline(geophones$distance, predict(geo.spl4)),col=1) lines(spline(geophones$distance, predict(geo.spl5)),col=2, lty=2) lines(spline(geophones$distance, predict(geo.spl6)),col=3, lty=4) bottomleft <- par()$usr[c(1,3)] legend(bottomleft[1], bottomleft[2], lty=c(1:2,4), col=1:3, legend=c("4 df", "5 df", "6 df"), xjust=0, yjust=0) @ The 6-degree-of-freedom case gives the best fit to the data; it captures some of the curvature at the large distance values, while retaining smoothness in other regions. The 5-degree-of-freedom case is smoother than the quartic, while capturing similar amounts of curvature in the large distance region. The 95\% confidence bounds for the 5-degree-of-freedom case can be obtained and plotted as follows: <>= plot(geophones, pty="s") lines(spline(geophones$distance, predict(geo.spl5)),col=2) lines(geophones$distance, predict(geo.spl5, interval="confidence")[,"lwr"], col=2) lines(geophones$distance, predict(geo.spl5, interval="confidence")[,"upr"], col=2) @ % \setkeys{Gin}{width=\textwidth} \begin{figure}[H] \begin{center} \begin{minipage}[c]{0.325\linewidth} <>= par(mar=c(4.1,3.6, 1.6,.6)) <> mtext(side=3, line=0.5, "A", adj=0) @ \end{minipage} \hspace{.03\linewidth} \begin{minipage}[c]{0.325\linewidth} <>= par(mar=c(4.1,3.6, 1.6,.6)) <> mtext(side=3, line=0.5, "B", adj=0) @ \end{minipage} \hspace{.03\linewidth} \begin{minipage}[c]{0.24\linewidth} \caption{Panel A shows 4, 5 and 6 df spline curves fitted to the geophones data. Panel B shows confidence bounds for expected thickness, for the 5df fit.} \end{minipage} \end{center} \end{figure} \begin{fmpage}{32pc} \exhead{10} Check the diagnostic plots for the results of exercise 8 for the 5-degree-of-freedom case. Are there any influential outliers? \end{fmpage} The standard diagnostics for the 5-degree-of-freedom spline model fit to the geophones data can be plotted using \begin{figure}[H] \begin{center} \setkeys{Gin}{width=0.75\textwidth} <>= par(mfrow=c(1,4)) plot(geo.spl5) par(mfrow=c(1,1)) @ \caption{Diagnostic plots for 5df spline model.} \end{center} \end{figure} There are no extreme outliers. Observation 19 is a mild outlier which exerts moderate influence. This should not be of major concern. The plot of the residuals versus the fitted values does indicate that some of the nonlinearity has not been satisfactorily modeled. \begin{fmpage}{32pc} \exhead{11} Continuing to refer to exercise 8, obtain plots of the spline basis curves for the 5-degree-of-freedom case. That is, plot the relevant column of the model matrix against $y$. \end{fmpage} The first basis function is a constant, to include an intercept in the model. (Note that this implies that there are actually 6 degrees of freedom in the model.) The remaining basis functions are plotted as follows: \begin{figure}[H] \begin{center} \begin{minipage}[c]{0.32\linewidth} <>= X5 <- model.matrix(geo.spl5) plot(X5[,2] ~ geophones$distance, type="l") lines(X5[,3] ~ geophones$distance, col=3) lines(X5[,4] ~ geophones$distance, col=4) lines(X5[,5] ~ geophones$distance, col=5) lines(X5[,6] ~ geophones$distance, col=6) @ \end{minipage}\hspace{.05\linewidth} \begin{minipage}[c]{0.34\linewidth} \caption{Spline basis functions.} \end{minipage} \end{center} \end{figure} We could also use \verb+matplot()+ for this problem. \begin{verbatim} matplot(geophones$distance, X5[,-1], type="l") \end{verbatim} \begin{fmpage}{32pc} \exhead{13} The \verb!ozone! data frame holds data, for nine months only, on ozone levels at the Halley Bay station between 1956 and 2000. (See Christie (2000) and Shanklin (2001) for the scientific background. Up to date data are available from the web page %\newline \texttt{http://www.nerc-bas.ac.uk/public/icd/jds/ozone/}.) Replace zeros by missing values. Determine, for each month, the number of missing values. Plot the October levels against Year, and fit a smooth curve. At what point does there seem to be clear evidence of a decline? Plot the data for other months also. Do other months show a similar pattern of decline? \end{fmpage} A simple way to replace 0's by missing value codes is the following: <>= names(ozone) Ozone <- ozone for (i in 2:11){ Ozone[ozone[,i]==0, i] <- NA } @ One way to count up the monthly missing values is the following: <>= sapply(Ozone[,-c(1,11)], function(x) sum(is.na(x))) @ A plot of the October ozone levels against Year can be obtained as follows: \setkeys{Gin}{width=\textwidth} \begin{figure}[H] \begin{center} \begin{minipage}[c]{0.45\linewidth} <>= attach(Ozone, warn.conflicts=FALSE) plot(Oct ~ Year, ylab = "October ozone levels (DU)", pch=16) lines(lowess(Year, Oct, f=.35)) detach(Ozone) @ \end{minipage}\hspace{.1\linewidth} \begin{minipage}[c]{0.3\linewidth} \caption{Lowess curve fitted to the ozone data.} \end{minipage} \end{center} \end{figure} We see that ozone level is decreasing throughout the period, but there is an acceleration in the mid- to late-1970s. To plot the data for the other months, we can do the following: \begin{figure}[H] \begin{center} \begin{minipage}[c]{0.65\linewidth} <>= ozone.stk <- stack(Ozone, select=2:10) ozone.stk$Year <- rep(seq(1956,2000), 9) names(ozone.stk) <- c("Ozone", "Month", "Year") ozone.stk$Month <- factor(ozone.stk$Month, levels=c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb", "Mar", "Apr")) library(lattice) print(xyplot(Ozone ~ Year|Month, data=ozone.stk)) @ \end{minipage}\hspace{.075\linewidth} \begin{minipage}[c]{0.25\linewidth} \caption{Change in ozone levels over time, by month.} \end{minipage} \end{center} \end{figure} Similar declines are evident in several of the other months. The decline is less steep in some of the other months. \begin{fmpage}{32pc} \exhead{15*}$^{*}$Compare the two results \fvset{xleftmargin=0pt} \begin{verbcode} seedrates.lm <- lm(grain ~ rate + I(rate^2), data=seedrates) seedrates.pol <- lm(grain ~ poly(rate,2), data=seedrates) \end{verbcode} Check that the fitted values and residuals from the two calculations are the same, and that the$t$-statistic and$p$-value are the same for the final coefficient, i.e., the same for the coefficient labeled \verb!poly(rate, 2)2! in the polynomial regression as for the coefficient labeled \verb!I(rate^2)! in the regression on \verb!rate! and \verb!rate^2!. \hspace*{10pt}Regress the second column of \verb!model.matrix(seedrates.pol)! on \verb!rate! and \verb!I(rate^2!), and similarly for the third column of \verb!model.matrix(seedrates.pol)!. Hence express the first and second orthogonal polynomial terms as functions of \verb!rate! and \verb!rate^2!. \end{fmpage} The following shows that the fitted values and residuals are the same for the two calculations. The$t$-statistic and$p\$-value are also the same for the final coefficient. <>= seedrates.lm <- lm(grain ~ rate + I(rate^2), data=seedrates) seedrates.pol<- lm(grain ~ poly(rate, 2), data=seedrates) fitted(seedrates.lm)-fitted(seedrates.pol) resid(seedrates.lm)-resid(seedrates.pol) summary(seedrates.lm) summary(seedrates.pol) @ From the following output, we can infer that the first orthogonal polynomial is $p_1(x) = -1.265 + .01265x$ and the second orthogonal polynomial is $p_2(x) = 3.742 - .08552x + .0004276x^2$ <>= attach(seedrates) y <- model.matrix(seedrates.pol)[,2] y.lm <- lm(y ~ rate + I(rate^2)) coef(y.lm) y <- model.matrix(seedrates.pol)[,3] y.lm <- lm(y ~ rate + I(rate^2)) coef(y.lm) @ Among other things, the polynomials given above have the property that $p_1(50)p_2(50) + p_1(75)p_2(75) + p_1(100)p_2(100)+p_1(125)p_2(125)+ p_1(150)p_2(150)$ since the values of the predictor are: <>= rate detach(seedrates) @ \end{document}