\part{Further Practice with R} \section{Information about the Columns of Data Frames} \begin{fmpage}{36pc} \exhead{1} Try the following: <>= class(2) class("a") class(cabbages$HeadWt) # cabbages is in the datasets package class(cabbages$Cult) @ % Now do \texttt{sapply(cabbages, class)}, and note which columns hold numerical data. Extract those columns into a separate data frame, perhaps named \texttt{numtinting}.\newline [Hint: \texttt{cabbages[, c(2,3)]} is not the correct answer, but it is, after a manner of speaking, close!] \end{fmpage} \vspace*{3pt} \begin{fmpage}{36pc} \exhead{2} Functions that may be used to get information about data frames include \texttt{str()}, \texttt{dim()}, \texttt{row.names()} and \texttt{names()}. Try each of these functions with the data frames \texttt{allbacks}, \texttt{ant111b} and \texttt{tinting} (all in \textit{DAAG}). For getting information about each column of a data frame, use \texttt{sapply()}. For example, the following applies the function \texttt{class()} to each column of the data frame \texttt{ant111b}. <>= library(DAAG) sapply(ant111b, class) @ % For columns in the data frame \texttt{tinting} that are factors, use \texttt{table()} to tabulate the number of values for each level. \end{fmpage} \vspace*{6pt} \section{Tabulation Exercises} \begin{fmpage}{36pc} \exhead{3} In the data set \texttt{nswpsdi1} (\texttt{DAAGxtras}) create a factor that categorizes subjects as: (i) black; (ii) hispanic; (iii) neither black nor hispanic. You can do this as follows: <<3-way>>= gps <- with(nswpsid1, 1 + black + hisp*2) table(gps) # Check that there are no 3s, ie black and hispanic! grouping <- c("other", "black", "hisp")[gps] table(grouping) @ % \end{fmpage} \vspace*{3pt} \begin{fmpage}{36pc} \exhead{4} Tabulate the number of observations in each of the different districts in the data frame \texttt{rockArt} (\textit{DAAGxtras}). Create a factor \texttt{groupDis} in which all \texttt{District}s with less than 5 observations are grouped together into the category \texttt{other}. \end{fmpage} <>= library(DAAGxtras) groupDis <- as.character(rockArt$District) tab <- table(rockArt$District) le4 <- rockArt$District %in% names(tab)[tab <= 4] groupDis[le4] <- "other" groupDis <- factor(groupDis) @ % \vspace*{6pt} \section{Data Exploration -- Distributions of Data Values} \begin{fmpage}{36pc} \exhead{5} The data frame \texttt{rainforest} (\textit{DAAG} package) has data on four different rainforest species. Use \verb!table(rainforest$species)! to check the names and numbers of the species present. In the sequel, attention will be limited to the species \textit{Acmena smithii}. The following plots a histogram showing the distribution of the diameter at base height: <>= library(DAAG) # The data frame rainforest is from DAAG Acmena <- subset(rainforest, species=="Acmena smithii") hist(Acmena$dbh) @ % Above, frequencies were used to label the the vertical axis (this is the default). An alternative is to use a density scale (\texttt{prob=TRUE}). The histogram is interpreted as a crude density plot. The density, which estimates the number of values per unit interval, changes in discrete jumps at the breakpoints (= class boundaries). The histogram can then be directly overlaid with a density plot, thus: <>= hist(Acmena$dbh, prob=TRUE, xlim=c(0,50)) # Use a density scale lines(density(Acmena$dbh, from=0)) @ % Why use the argument \texttt{from=0}? What is the effect of omitting it? \vspace{3pt} [Density estimates, as given by R's function \texttt{density()}, change smoothly and do not depend on an arbitrary choice of breakpoints, making them generally preferable to histograms. They do sometimes require tuning to give a sensible result. Note especially the parameter \texttt{bw}, which determines how the bandwidth is chosen, and hence affects the smoothness of the density estimate.] \end{fmpage} \vspace*{3pt} \section{The \texttt{paste()} Function} \begin{fmpage}{36pc} \exhead{6} Here are examples that illustrate the use of \texttt{paste()}: <>= paste("Leo", "the", "lion") paste("a", "b") paste("a", "b", sep="") @ % \end{fmpage} \begin{fmpage}{36pc} \exhead{6, continued} \vspace*{-15pt} <>= paste(1:5) paste("a", 1:5) paste("a", 1:5, sep="") paste(1:5, collapse="") paste(letters[1:5], collapse="") ## possumsites is from the DAAG package with(possumsites, paste(row.names(possumsites), " (", altitude, ")", sep="")) @ % What are the respective effects of the parameters \texttt{sep} and \texttt{collapse}? \end{fmpage} \vspace*{3pt} \section{Random Samples} \begin{fmpage}{36pc} \exhead{7} By taking repeated random samples from the normal distribution, and plotting the distribution for each such sample, one can get an idea of the effect of sampling variation on the sample distribution. A random sample of 100 values from a normal distribution (with mean 0 and standard deviation 1) can be obtained, and a histogram and overlaid density plot shown, thus: <>= y <- rnorm(100) hist(y, probability=TRUE) # probability=TRUE gives a y density scale lines(density(y)) @ % Repeat several times In place of the 100 sample values: \begin{itemize} \item[(a)] Take 5 samples of size 25, then showing the plots. \item[(b)] Take 5 samples of size 100, then showing the plots. \item[(c)] Take 5 samples of size 500, then showing the plots. \item[(d)] Take 5 samples of size 2000, then showing the plots. \end{itemize} (Hint: By preceding the plots with \texttt{par(mfrow=c(4,5))}, all 20 plots can be displayed on the one graphics page. To bunch the graphs up more closely, make the further settings \texttt{par(mar=c(3.1,3.1,0.6,0.6), mgp=c(2.25,0.5,0))}) \vspace*{3pt} Comment on the usefulness of a sample histogram and/or density plot for judging whether the population distribution is likely to be close to normal. \end{fmpage} \vspace*{4pt} Histograms and density plots are, for small'' samples, notoriously variable under repeated sampling. This is true even for sample sizes as large as 50 or 100. \vspace*{3pt} \begin{fmpage}{36pc} \exhead{8} This explores the function \texttt{sample()}, used to take a sample of values that are stored or enumerated in a vector. Samples may be with or without replacement; specify \texttt{replace = FALSE} (the default) or \texttt{replace = TRUE}. The parameter \texttt{size} determines the size of the sample. By default the sample has the same size (length) as the vector from which samples are taken. Take several samples of size 5 from the vector \texttt{1:5}, with \texttt{replace=FALSE}. Then repeat the exercise, this time with \texttt{replace=TRUE}. Note how the two sets of samples differ. \end{fmpage} \vspace*{8pt} \begin{fmpage}{36pc} \exhead{9$^*$} If in Exercise 4 above a new random sample of trees could be taken, the histogram and density plot would change. How much might we expect them to change? \vspace*{3pt} The boostrap approach treats the one available sample as a microcosm of the population. Repeated with replacement samples are taken from the one available sample. This is equivalent to repeating each sample value and infinite number of times, then taking random samples from the population that is thus created. The expectation is that variation between those samples will be comparable to variation between samples from the original population. \begin{itemize} \item[(a)] Take repeated (5 or more) bootstrap samples from the Acmena dataset of Exercise 5, and show the density plots. [Use \verb!sample(Acmena$dbh, replace=TRUE)!]. \item[(b)] Repeat, now with the \texttt{cerealsugar} data from \textit{DAAG}. \end{itemize} \end{fmpage} \vspace*{6pt} \section{{\Large\textbf{*}}Further Practice with Data Input} One option is to experiment with using the R Commander GUI to input these data. \begin{fmpage}{36pc} \exhead{10\textbf{*}} With a live internet connection, files can be read directly from a web page. Here is an example: <>= webfolder <- "http://www.maths.anu.edu.au/~johnm/datasets/text/" webpage <- paste(webfolder, "molclock.txt", sep="") molclock <- read.table(url(webpage)) @ % With a live internet connection available, use this approach to input the file \textbf{travelbooks.txt} that is available from this same web page. \end{fmpage} \vspace*{6pt}