<>=
options(SweaveHooks=list(fig=function(){par(cex.main=1.1,
mar=c(4.1,4.1,2.6,2.1), pty="s",
mgp=c(2.25,0.5,0), tck=-0.02)}))
@%
\part{Populations \& Samples -- Theoretical \& Empirical Distributions}
R functions that will be used in this laboratory include:
\begin{enumerate}
\item \texttt{dnorm()}: Obtain the density values for the theoretical
normal distribution;
\item \texttt{pnorm()}: Given a normal deviate or deviates, obtain the
cumulative probability;
\item \texttt{qnorm()}: Given the cumulative probabilty. calculate the
normal deviate;
\item \texttt{sample()}: take a sample from a vector of values.
Values may be taken without replacement (once taken from the vector, the
value is not available for subsequent draws), or with replacement (values
may be repeated);
\item \texttt{density()}: fit an empirical density curve to a set of values;
\item \texttt{rnorm()}: Take a random sample from a theoretical normal
distribution;
\item \texttt{runif()}: similar to \texttt{rnorm()}, but sampling is
from a uniform distribution;
\item \texttt{rt()}: similar to \texttt{rnorm()}, but sampling is
from a $t$-distribution (the degrees of freedom must be given as the second
parameter);
\item \texttt{rexp()}: similar to \texttt{rnorm()}, but sampling is
from an exponential distribution;
\item \texttt{qqnorm()}: Compare the empirical distribution of a set of
values with the empirical normal distribution.
\end{enumerate}
\section{Populations and Theoretical Distributions}
\begin{fmpage}{36pc}
\exhead{1}
\vspace*{-15pt}
\begin{enumerate}
\item Plot the density and the cumulative probability
curve for a normal distribution with a mean of 2.5 and SD = 1.5.\newline
Code that will plot the curve is
<>=
curve(dnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5)
curve(pnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5)
@
\item From the cumulative probability curve in (a), read off the area
under the density curve between \texttt{x=0.5} and \texttt{x=4}.
Check your answer by doing the calculation
<>=
pnorm(4, mean=2.5, sd=1.5) - pnorm(0.5, mean=2.5, sd=1.5)
@ %
\end{enumerate}
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{1, continued}
\vspace*{-15pt}
\begin{enumerate}
\item The density for the distribution in items (i) and (ii), given by
\texttt{dnorm(x, 2.5, 1.5)}, gives the relative number of observations
per unit interval that can be expected at the value \texttt{x}.
For example \texttt{dnorm(x=2, 2.5, 1.5)} $\simeq$ 0.2516. Hence
\begin{itemize}
\item [(i)] In a sample of 100 the expected number of observations per unit interval,
in the immediate vicinity of \texttt{x = 2}, is 25.16
\item [(ii)] In a sample of 1000 the expected number of observations per unit interval,
in the immediate vicinity of \texttt{x = 2}, is 251.6
\item [(iii)] The expected number of values from a sample of 100,
between 1.9 amd 2.1, is approximately 0.2 $\times$ 251.6 = 50.32
\vspace*{-9pt}
\begin{minipage}[c]{0.75\linewidth}
\[
\left[ \begin{array}{l}
\mbox{The number can be calculated more exactly as}\\
\mbox{\texttt{(pnorm(2.1, 2.5, 1.5) - pnorm(1.9, 2.5, 1.5)) * 1000}}\\
\end{array}
\right]
\]
\end{minipage}
\hspace*{.175\linewidth}
\end{itemize}
Repeat the calculation to get approximate and more exact values for the
expected number
\begin{itemize}
\item [(i)] between 0.9 and 1.1
\item [(ii)] between 2.9 and 3.1
\item [(iii)] between 3.9 and 4.1
\end{itemize}
\end{enumerate}
\end{fmpage}
By way of example, here is the code for (a):
<>=
curve(dnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5)
curve(pnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5)
@
\begin{fmpage}{36pc}
\exhead{2}
\vspace*{-15pt}
\begin{enumerate}
\item Plot the density and the cumulative probability curve for a
$t$-distribution with 3 degrees of freedom. Overlay, in
each case, a normal distribution with a mean of 0 and SD=1.
\newline
[Replace \texttt{dnorm} by \texttt{dt}, and specify \texttt{df=10}]
\item Plot the density and the cumulative probability curve for an
exponential distribution with a rate parameter equal to 1
(the default). Repeat, with a rate parameter equal to 2. (When used
as a failure time distribution; the rate parameter is the expected
number of failures per unit time.)
\end{enumerate}
\end{fmpage}
\vspace*{6pt}
\section{Samples and Estimated Density Curves}
\begin{fmpage}{36pc}
\exhead{3}
Use the function \texttt{rnorm()} to draw a random sample of 25
values from a normal distribution with a mean of 0 and a standard deviation
equal to 1.0. Use a histogram, with \texttt{probability=TRUE} to display
the values. Overlay the histogram with: (a) an estimated density curve;
(b) the theoretical density curve for a normal distribution with mean 0 and
standard deviation equal to 1.0.
Repeat with samples of 100 and 500 values, showing the different displays
in different panels on the same graphics page.
\end{fmpage}
<>=
par(mfrow=c(1,3), pty="s")
x <- rnorm(50)
hist(x, probability=TRUE)
lines(density(x))
xval <- pretty(c(-3,3), 50)
lines(xval, dnorm(xval), col="red")
@
\begin{fmpage}{36pc}
\exhead{4}
Data whose distribution is close to lognormal are common. Size measurements
of biological organisms often have this character. As an example, consider
the measurements of body weight (\texttt{body}), in the data frame
\texttt{Animals} (\textit{MASS}). Begin by drawing a histogram of the
untransformed values, and overlaying a density curve. Then
\begin{enumerate}
\item Draw an estimated density curve for the logarithms of the values.
Code is given immediately below.
\item Determine the mean and standard deviation of \verb!log(Animals$body)!.
Overlay the estimated density with the theoretical density for a normal
distribution with the mean and standard deviation just obtained.
\end{enumerate}
Does the distribution seem normal, after transformation to a logarithmic scale?
\end{fmpage}
<>=
library(MASS)
plot(density(Animals$body))
logbody <- log(Animals$body)
plot(density(logbody))
av <- mean(logbody)
sdev <- sd(logbody)
xval <- pretty(c(av-3*sdev, av+3*sdev), 50)
lines(xval, dnorm(xval, mean=av, sd=sdev))
@
\begin{fmpage}{36pc}
\exhead{5}
The following plots an estimated density curve for a random sample
of 50 values from a normal distribution:
<>=
plot(density(rnorm(50)), type="l")
@ %
\begin{enumerate}
\item Plot estimated density curves, for random samples of 50 values,
from (a) the normal distribution; (b) the uniform distribution
(\texttt{runif(50)}); (c) the $t$-distribution with 3 degrees of freedom.
Overlay the three plots (use \texttt{lines()} in place of \texttt{plot()}
for densities after the first).
\item Repeat the previous exercise, but taking random samples of 500 values.
\end{enumerate}
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{6}
There are two ways to make an estimated density smoother:
\begin{itemize}
\item[(a)] One is to increase the number of samples, For example:
<>=
plot(density(rnorm(500)), type="l")
@
\end{itemize}
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{6, continued}
\vspace*{-15pt}
\begin{itemize}
\item[(b)] The other is to increase the bandwidth. For example
<>=
plot(density(rnorm(50), bw=0.2), type="l")
plot(density(rnorm(50), bw=0.6), type="l")
@
\end{itemize}
Repeat each of these with bandwidths (\texttt{bw}) of 0.15,
with the default choice of bandwidth, and with the bandwidth set to 0.75.
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{7}
Here we experiment with the use of \texttt{sample()} to take a sample
from an empirical distribution, i.e., from a vector of values that is
given as argument. Here, the sample size will be the number of values in
the argument. Any size of sample is however permissible.
<>=
sample(1:5, replace=TRUE) # With replacement sample, size 5
for(i in 1:10)print(sample(1:5, replace=TRUE)) # 10 such samples
plot(density(log10(Animals$body)))
lines(density(sample(log10(Animals$body), replace=TRUE)), col="red")
@
Repeat the final density plot several times, perhaps using different
colours for the curve on each occasion. This gives an indication of
the stability of the estimated density curve with respect to sample
variation.
\end{fmpage}
\vspace*{9pt}
\section{*Normal Probability Plots}
\begin{fmpage}{36pc}
\exhead{8}
Partly because of the issues with bandwidth and choice of kernel,
and partly because it is hard to
density estimates are not a very effective means for judging normality.
A much better tool is the normal probability plot, which works with
cumulative probability distributions. Try
<>=
qqnorm(rnorm(10))
qqnorm(rnorm(50))
qqnorm(rnorm(200))
@ %
For samples of modest to large sizes, the points lie close to a line.
The function \texttt{qreference()} (\textit{DAAG}) takes one sample as
a reference (by default it uses a random sample) and by default
provides 5 other random normal samples for comparison. For example:
<>=
library(DAAG)
qreference(m=10)
qreference(m=50)
qreference(m=200)
@ %
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{9}
The intended use of \texttt{qreference()} is to draw a normal probability for
a set of data, and place alongside it some number of normal probability plots
for random normal data. For example
<>=
qreference(possum$totlngth)
@ %
Obtain similar plots for each of the variables \texttt{taill},
\texttt{footlgth} and \texttt{earconch} in the possum data.
Repeat the exercise for males and females separately
\end{fmpage}
\begin{fmpage}{36pc}
\exhead{10} Use normal probability plots to assess whether the
following sets of values, all from data sets in the \texttt{DAAG}
package, have distributions that seem consistent with the assumption
that they have been sampled from a normal distribution?
\begin{enumerate}
\item the difference \texttt{heated - ambient}, in the data frame
\texttt{pair65} (\textit{DAAG})?
\item the values of \texttt{earconch}, in the \texttt{possum} data frame
(\textit{DAAG})?
\item the values of \texttt{body}, in the data frame \texttt{Animals}
(\textit{MASS})?
\item the values of \texttt{log(body)}, in the data frame \texttt{Animals}
(\textit{MASS})?
\end{enumerate}
\end{fmpage}
\vspace*{9pt}
\section{Boxplots -- Simple Summary Information on a Distribution}
In the data frame \texttt{cfseal} (\textit{DAAG}), several of the columns
have a number of missing values. A relevant question is: ``Do missing
and non-missing rows have similar values, for columns that are complete?''
\begin{fmpage}{36pc}
\exhead{11}
Use the following to find, for each column of the data frame \texttt{cfseal},
the number of missing values:
\begin{verbcode}
sapply(cfseal, function(x)sum(is.na(x)))
\end{verbcode}
Observe that for \texttt{lung}, \texttt{leftkid}, \texttt{rightkid},
and \texttt{intestines} values are missing in the same six rows. For
each of the remaining columns compare, do boxplots that compare the
distribution of values for the 24 rows that had no missing values with
the distribution of values for the 6 rows that had missing values.
\end{fmpage}
Here is code that can be used to get started:
\begin{verbcode}
present <- complete.cases(cfseal)
boxplot(age ~ present, data=cfseal)
\end{verbcode}
Or you might use the lattice function and do the following:
\begin{verbcode}
present <- complete.cases(cfseal)
library(lattice)
present <- complete.cases(cfseal)
bwplot(present ~ age, data=cfseal)
\end{verbcode}
\begin{fmpage}{36pc}
\exhead{12}
Tabulate, for the same set of columns for which boxplots were obtained
in Exercise 2, differences in medians, starting with:
\begin{verbcode}
median(age[present]) - median(age[!present]))
\end{verbcode}
Calculate also the ratios of the two interquartile ranges,
i.e.
\begin{verbcode}
IQR(age[present]) - IQR(age[!present]))
\end{verbcode}
\end{fmpage}