The user!2009 (~470 participants) conference in Rennes was followed, a few days later, by the smaller DSC2009 (Directions in Statistical Computing) conference (~48 particpants) in Copenhagen. Presentations at DSC2009 were in several instances more specialized versions of talks that had been given at Rennes. Audience questions at DSC2009 often focused on quite technical issues, in a manner that would have been out of place at Rennes.
The talks that are identified below are a selection only. I will be glad to hear from anyone who is able to fill gaps in my account. Please let me know, also, of any mistakes. Here are web links to further or other sources of information and comment:
The following focuses mostly on user!2009, but drawing attention to follow-up issues and discussion at DSC2009. Some general comments now follow:
The R system was spawned in the Unix world, by researchers who were familiar with that world. More generally, there was a familiarity with ideas that were pretty much a commonplace to those who had worked in an era where serious use of computer systems often required close contact with hardware details. R came out of a historical coincidence where the developers had similar and complementary abilities. Much of the technical knowledge is not written down anywhere in a systematic manner. A major part of the use of R has moved from a community where such technical understanding can be assumed, into the mainstream. Increasingly, R has users who are unfamiliar with the details of its functionality. These ”under the hood” details, which can be ignored much of the time, do sometimes bite users who lack the knowledge that could be assumed in this historical culture. The following are examples:
(JM’s comment: Partly, these issues are a result of the increasing sophistication of the tasks that relatively naive users can now attempt. They are to that extent issues for use of any statistical software, not just R.)
Just as important as the abilities that R provides directly may be its role in bringing together a community of individuals with common interests but, often, very different expertise. Creation of the R2wd package was possible because of the confluence of expertise at useR! 2009. So maybe useR! conferences should incorporate sessions of something akin to a Google Summer of Code exercise?
Both REvolution and Netezza are heavily into parallel processing and large datasets. REvolution has recently released several R packages (foreach, iterators and doMC) that address parallel processing.
The bigdata package makes it possible to replace R’s common creation of multiple copies of a dataset that can be represented as a matrix, by a single copy that is held on disk and accessed via pointers. All of part of the data can then be brought into physical memory, as required for processing. The ff package makes it possible, with some limitations, to store data frames in this way.
An alternative, for some applications, is the use of functions in Thomas Lumley’s survey and mitools packages to store and access data from a database in data frame format. These use the R-DBI and RODBC interfaces. The relevant functions will in the near further be placed in a separate package.
DSC2009: Hadley Wickham’s talk at DSC2009 canvassed yet another approach. Although objects in R’s underlying model are non-mutable (assignment creates a new instance of the object), enironment objects are a little different This allows the creation of mutable objects which, as happens with the files that bigdata creates on disk, can be changed in place.
At the tutorial session on lme4 (multi-level modeling), Doug Bates discussed some slick sparse matrix methodology that allows R to fit models with very large numbers of random effects. A particular application was to educational measurement data.
This is by no means a complete list. Parallel computing was a major theme.
There are many different 2-D views of multivariate data sets. Dynamic graphics allows one Euclidean representation might morph into another. Additionally, different methods, eg multi-dimensional scaling and correspondence analysis, can morph via various intermediate forms of representation into one another. Michael Greenacre’s talk that demonstrated some of the possibilities was an exercise in virtuosity.
The present drivers can be frustratingly slow. Also, graphics features such as text may not scale up when the size of the graph changes. Some current devices do not support transparency, useful in allowing the amount of over-plotting to determine the color density. Several projects are aimed at rectifying these deficiencies:
Development versions of both Acinonyx and mosaiq are available from R-forge. The development version of Acinonyx works best on Mac OS X systems, should work on Windows, and is untested on other systems.
Anthony Runnalls has made substantial progress with this project, at present
limiting attention to base R and the recommended packages. The project is intended
to make it easier for researchers to develop experimental versions of the R interpreter.
There is currently some increase in computational time, relative to the C
implementation that we copy down from CRAN.
This is a major issue for taking advantage of the impending ubiquity of multiple processors in all including laptop systems. (The Intel Core 2 Duo is already widely available on desktop and laptop systems. R’s lack of multi-threading limits its ability to take advantage of the multiple processors.
This is desiged to facilitate cooperative use of R via the internet, using, e.g., systems such as Amazon cloud. Each participant can look in, at any time, to the work of other participants. See http://biocep-distrib.r-forge.r-project.org/.
Adele Cutler gave an insightful talk on random forests. Potential for over-fitting, if there is total reliance on OOB (Out-Of-Bag) accuracy estimates, may be more of an issue than had been acknowledged. Current proximities are pretty much for the training data, and are on that account likely to exaggerate differences between classes. She described a new method for estimating proximities, which yields (following multi-dimensional scaling) rather different low-dimensional representations. (See http://www.math.usu.edu/~adele, re RAFT). The class weights are supposed to have the effect of assigning prior probabilities. Because of problems in adapting the code for incorporation into R, these do not, in the R version, work as intended.
John Storey described joint work with Leek that addressed problems arising
from a between observations (slides) dependence structure in the analysis of
expression array and other high dimensional data. Dependence commonly
arises in expression array data because some slides (perhaps, from cancer
tissue samples from different individuals) share common but unmeasured
covariates. Omission of these covariates from the analysis commonly leads to
coefficients for some of the expression indices that are biased and possibly
misleading. The large number of unexpressed features provides information
that makes it possible to use a singular value decomposition to provide a
low order approximation to effects due to the missing covariates. There can
be substantial changes in the sequences that are identified as exhibiting
[JM: This is surely an important addition to the list of items that call for attention when analyzing expression array data.]
Jerome Friedman discussed his algorithm for a form of penalized regression where both the multiplier lambda and the Lp norm (with p in the range from 0 to 2) has to be chosen. Cross-validation was used to choose the optimum. In the cases investigated, there was a preference for p between 0 and 1.