\part{Further Practice with R}
\section{Information about the Columns of Data Frames}
\begin{fmpage}{36pc}
\exhead{1}
Try the following:
<>=
class(2)
class("a")
class(cabbages$HeadWt) # cabbages is in the datasets package
class(cabbages$Cult)
@ %
Now do \texttt{sapply(cabbages, class)}, and note which columns hold
numerical data. Extract those columns into a separate data frame,
perhaps named \texttt{numtinting}.\newline
[Hint: \texttt{cabbages[, c(2,3)]} is not the correct answer, but it is,
after a manner of speaking, close!]
\end{fmpage}
\vspace*{3pt}
\begin{fmpage}{36pc}
\exhead{2} Functions that may be used to get information about data
frames include \texttt{str()}, \texttt{dim()}, \texttt{row.names()}
and \texttt{names()}. Try each of these functions with the data
frames \texttt{allbacks}, \texttt{ant111b} and \texttt{tinting}
(all in \textit{DAAG}).
For getting information about each column of a data frame, use
\texttt{sapply()}. For example, the following applies the function
\texttt{class()} to each column of the data frame \texttt{ant111b}.
<>=
library(DAAG)
sapply(ant111b, class)
@ %
For columns in the data frame \texttt{tinting} that are factors, use
\texttt{table()} to tabulate the number of values for each level.
\end{fmpage}
\vspace*{6pt}
\section{Tabulation Exercises}
\begin{fmpage}{36pc}
\exhead{3}
In the data set \texttt{nswpsdi1} (\texttt{DAAGxtras})
create a factor that categorizes subjects as: (i) black; (ii)
hispanic; (iii) neither black nor hispanic. You can do this as
follows:
<<3-way>>=
gps <- with(nswpsid1, 1 + black + hisp*2)
table(gps) # Check that there are no 3s, ie black and hispanic!
grouping <- c("other", "black", "hisp")[gps]
table(grouping)
@ %
\end{fmpage}
\vspace*{3pt}
\begin{fmpage}{36pc}
\exhead{4}
Tabulate the number of observations in each of the different districts
in the data frame \texttt{rockArt} (\textit{DAAGxtras}). Create a
factor \texttt{groupDis} in which all \texttt{District}s with less than
5 observations are grouped together into the category \texttt{other}.
\end{fmpage}
<>=
library(DAAGxtras)
groupDis <- as.character(rockArt$District)
tab <- table(rockArt$District)
le4 <- rockArt$District %in% names(tab)[tab <= 4]
groupDis[le4] <- "other"
groupDis <- factor(groupDis)
@ %
\vspace*{6pt}
\section{Data Exploration -- Distributions of Data Values}
\begin{fmpage}{36pc}
\exhead{5}
The data frame \texttt{rainforest} (\textit{DAAG}
package) has data on four different rainforest species. Use
\verb!table(rainforest$species)! to check the names and numbers of
the species present. In the sequel, attention will be limited to
the species \textit{Acmena smithii}. The following plots a histogram
showing the distribution of the diameter at base height:
<>=
library(DAAG) # The data frame rainforest is from DAAG
Acmena <- subset(rainforest, species=="Acmena smithii")
hist(Acmena$dbh)
@ %
Above, frequencies were used to label the the
vertical axis (this is the default). An alternative is to use a
density scale (\texttt{prob=TRUE}). The histogram is interpreted as
a crude density plot. The density, which estimates the number of
values per unit interval, changes in discrete jumps at the
breakpoints (= class boundaries). The histogram can then be
directly overlaid with a density plot, thus:
<>=
hist(Acmena$dbh, prob=TRUE, xlim=c(0,50)) # Use a density scale
lines(density(Acmena$dbh, from=0))
@ %
Why use the argument \texttt{from=0}? What is the effect of omitting it?
\vspace{3pt}
[Density estimates, as given by R's function \texttt{density()},
change smoothly and do not depend on an arbitrary choice of
breakpoints, making them generally preferable to histograms. They do
sometimes require tuning to give a sensible result. Note especially
the parameter \texttt{bw}, which determines how the bandwidth is
chosen, and hence affects the smoothness of the density estimate.]
\end{fmpage}
\vspace*{3pt}
\section{The \texttt{paste()} Function}
\begin{fmpage}{36pc}
\exhead{6}
Here are examples that illustrate the use of \texttt{paste()}:
<>=
paste("Leo", "the", "lion")
paste("a", "b")
paste("a", "b", sep="")
@ %
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{6, continued}
\vspace*{-15pt}
<>=
paste(1:5)
paste("a", 1:5)
paste("a", 1:5, sep="")
paste(1:5, collapse="")
paste(letters[1:5], collapse="")
## possumsites is from the DAAG package
with(possumsites, paste(row.names(possumsites), " (", altitude, ")", sep=""))
@ %
What are the respective effects of the parameters \texttt{sep} and
\texttt{collapse}?
\end{fmpage}
\vspace*{3pt}
\section{Random Samples}
\begin{fmpage}{36pc}
\exhead{7}
By taking repeated random samples from the normal distribution, and
plotting the distribution for each such sample, one can get an idea
of the effect of sampling variation on the sample distribution.
A random sample of 100 values from a normal distribution (with mean 0
and standard deviation 1) can be obtained, and a histogram and overlaid
density plot shown, thus:
<>=
y <- rnorm(100)
hist(y, probability=TRUE) # probability=TRUE gives a y density scale
lines(density(y))
@ %
Repeat several times
In place of the 100 sample values:
\begin{itemize}
\item[(a)] Take 5 samples of size 25, then showing the plots.
\item[(b)] Take 5 samples of size 100, then showing the plots.
\item[(c)] Take 5 samples of size 500, then showing the plots.
\item[(d)] Take 5 samples of size 2000, then showing the plots.
\end{itemize}
(Hint: By preceding the plots with \texttt{par(mfrow=c(4,5))},
all 20 plots can be displayed on the one graphics page. To bunch
the graphs up more closely, make the further settings
\texttt{par(mar=c(3.1,3.1,0.6,0.6), mgp=c(2.25,0.5,0))})
\vspace*{3pt}
Comment on the usefulness of a sample histogram and/or density plot
for judging whether the population distribution is likely to be close
to normal.
\end{fmpage}
\vspace*{4pt}
Histograms and density plots are, for ``small'' samples, notoriously
variable under repeated sampling. This is true even for sample sizes as
large as 50 or 100.
\vspace*{3pt}
\begin{fmpage}{36pc}
\exhead{8} This explores the function \texttt{sample()}, used to
take a sample of values that are stored or enumerated in a
vector. Samples may be with or without replacement; specify
\texttt{replace = FALSE} (the default) or \texttt{replace =
TRUE}. The parameter \texttt{size} determines the size of the
sample. By default the sample has the same size (length) as the
vector from which samples are taken. Take several samples of size 5
from the vector \texttt{1:5}, with \texttt{replace=FALSE}. Then
repeat the exercise, this time with \texttt{replace=TRUE}. Note how
the two sets of samples differ.
\end{fmpage}
\vspace*{8pt}
\begin{fmpage}{36pc}
\exhead{9$^*$}
If in Exercise 4 above a new random sample of trees could be taken,
the histogram and density plot would change. How much might we
expect them to change?
\vspace*{3pt}
The boostrap approach treats the one available sample as a microcosm
of the population. Repeated with replacement samples are taken from
the one available sample. This is equivalent to repeating each sample
value and infinite number of times, then taking random samples from the
population that is thus created. The expectation is that variation between
those samples will be comparable to variation between samples from the
original population.
\begin{itemize}
\item[(a)] Take repeated (5 or more) bootstrap samples from the Acmena
dataset of Exercise 5, and show the density plots. [Use
\verb!sample(Acmena$dbh, replace=TRUE)!].
\item[(b)] Repeat, now with the \texttt{cerealsugar} data from \textit{DAAG}.
\end{itemize}
\end{fmpage}
\vspace*{6pt}
\section{{\Large\textbf{*}}Further Practice with Data Input}
One option is to experiment with using the R Commander GUI
to input these data.
\begin{fmpage}{36pc}
\exhead{10\textbf{*}}
With a live internet connection, files can be read directly from a
web page. Here is an example:
<>=
webfolder <- "http://www.maths.anu.edu.au/~johnm/datasets/text/"
webpage <- paste(webfolder, "molclock.txt", sep="")
molclock <- read.table(url(webpage))
@ %
With a live internet connection available,
use this approach to input the file \textbf{travelbooks.txt} that
is available from this same web page.
\end{fmpage}
\vspace*{6pt}