useR! 2009 (Rennes, July 8-10) and DSC2009 (Copenhagen, July 13-14)

The user!2009 (~470 participants) conference in Rennes was followed, a few days later, by the smaller DSC2009 (Directions in Statistical Computing) conference (~48 particpants) in Copenhagen. Presentations at DSC2009 were in several instances more specialized versions of talks that had been given at Rennes. Audience questions at DSC2009 often focused on quite technical issues, in a manner that would have been out of place at Rennes.

The talks that are identified below are a selection only. I will be glad to hear from anyone who is able to fill gaps in my account. Please let me know, also, of any mistakes. Here are web links to further or other sources of information and comment:

The following focuses mostly on user!2009, but drawing attention to follow-up issues and discussion at DSC2009. Some general comments now follow:

  1. The attendance at useR! has been getting steadily larger each successive year. More than 470 attended the 2009 meeting. Around 250 attended one or other of the tutorials that was held on the Tuesday.
  2. Participants were widely drawn: from a variety of application areas, from finance and business, from various areas of science, from computing, etc, as well as from academic and applied statistics.
  3. Organizations with a commercial interest in R were much in evidence: REvolution Computing (who are pushing R quite aggressively), Netezza (whose involvement with R is fairly new; they provide hardware and software solutions for work with vary large datasets), Mango Solutions (a UK data based analysis company), and Tibco (the people who now market S-PLUS).
  4. Strong themes were:
    1. Parallel computing and/or the processing of very large datasets.
    2. Interactive display of large datasets.
    3. Display of multivariate data, sometimes using dynamic display in innovative ways.
  5. Several of the talks demonstrated innovative and/or unusual uses of R.
    1. Use a Nintendo Wii, joystick-like, to drive R. For example, this type of control can be used to navigate through a cloud of points. (Jensen Landon, Shah Vatsal)
    2. The Seewave package for sound analysis and synthesis was developed intially for working with birdsong data.
    3. One participant developed, over the course of the conference and assisted by the expertise of other participants, an alpha version of a package (provisionally called R2wd) that creates Word files, combining text and R output (including graphs and tables) at will.
    4. Develop and prototype an environmental monitoring system that would be implemented on a real-time setup with limited memory.
    5. R code was a major part of a system that automates the settings of last-minute price offers at Thomas Cook, extracting relevant data from Reuters and other sources. (Jan Wijfells).

Reflections and More Detailed Comments

An Increasingly Diverse User Community (Peter Dalgaard)

The R system was spawned in the Unix world, by researchers who were familiar with that world. More generally, there was a familiarity with ideas that were pretty much a commonplace to those who had worked in an era where serious use of computer systems often required close contact with hardware details. R came out of a historical coincidence where the developers had similar and complementary abilities. Much of the technical knowledge is not written down anywhere in a systematic manner. A major part of the use of R has moved from a community where such technical understanding can be assumed, into the mainstream. Increasingly, R has users who are unfamiliar with the details of its functionality. These ”under the hood” details, which can be ignored much of the time, do sometimes bite users who lack the knowledge that could be assumed in this historical culture. The following are examples:

(JM’s comment: Partly, these issues are a result of the increasing sophistication of the tasks that relatively naive users can now attempt. They are to that extent issues for use of any statistical software, not just R.)

Strength in Community!

Just as important as the abilities that R provides directly may be its role in bringing together a community of individuals with common interests but, often, very different expertise. Creation of the R2wd package was possible because of the confluence of expertise at useR! 2009. So maybe useR! conferences should incorporate sessions of something akin to a Google Summer of Code exercise?

Parallel Processing and Large Datasets

Both REvolution and Netezza are heavily into parallel processing and large datasets. REvolution has recently released several R packages (foreach, iterators and doMC) that address parallel processing.

The bigdata package makes it possible to replace R’s common creation of multiple copies of a dataset that can be represented as a matrix, by a single copy that is held on disk and accessed via pointers. All of part of the data can then be brought into physical memory, as required for processing. The ff package makes it possible, with some limitations, to store data frames in this way.

An alternative, for some applications, is the use of functions in Thomas Lumley’s survey and mitools packages to store and access data from a database in data frame format. These use the R-DBI and RODBC interfaces. The relevant functions will in the near further be placed in a separate package.

DSC2009: Hadley Wickham’s talk at DSC2009 canvassed yet another approach. Although objects in R’s underlying model are non-mutable (assignment creates a new instance of the object), enironment objects are a little different This allows the creation of mutable objects which, as happens with the files that bigdata creates on disk, can be changed in place.

At the tutorial session on lme4 (multi-level modeling), Doug Bates discussed some slick sparse matrix methodology that allows R to fit models with very large numbers of random effects. A particular application was to educational measurement data.

This is by no means a complete list. Parallel computing was a major theme.

Dynamic Graphics

There are many different 2-D views of multivariate data sets. Dynamic graphics allows one Euclidean representation might morph into another. Additionally, different methods, eg multi-dimensional scaling and correspondence analysis, can morph via various intermediate forms of representation into one another. Michael Greenacre’s talk that demonstrated some of the possibilities was an exercise in virtuosity.

DSC2009 (mostly): Low-level graphics abilities

The present drivers can be frustratingly slow. Also, graphics features such as text may not scale up when the size of the graph changes. Some current devices do not support transparency, useful in allowing the amount of over-plotting to determine the color density. Several projects are aimed at rectifying these deficiencies:

Development versions of both Acinonyx and mosaiq are available from R-forge. The development version of Acinonyx works best on Mac OS X systems, should work on Windows, and is untested on other systems.

DSC2009: Reengineering the R Interpreter Into C++

Anthony Runnalls has made substantial progress with this project, at present limiting attention to base R and the recommended packages. The project is intended to make it easier for researchers to develop experimental versions of the R interpreter. There is currently some increase in computational time, relative to the C implementation that we copy down from CRAN.
(http://www.cs.kent.ac.uk/projects/cxxr/)

DSC2009 (mostly): Multi-threading

This is a major issue for taking advantage of the impending ubiquity of multiple processors in all including laptop systems. (The Intel Core 2 Duo is already widely available on desktop and laptop systems. R’s lack of multi-threading limits its ability to take advantage of the multiple processors.

Biocep Computational Open Platform

This is desiged to facilitate cooperative use of R via the internet, using, e.g., systems such as Amazon cloud. Each participant can look in, at any time, to the work of other participants. See http://biocep-distrib.r-forge.r-project.org/.

Talks on Analysis Methodology

Random Forests

Adele Cutler gave an insightful talk on random forests. Potential for over-fitting, if there is total reliance on OOB (Out-Of-Bag) accuracy estimates, may be more of an issue than had been acknowledged. Current proximities are pretty much for the training data, and are on that account likely to exaggerate differences between classes. She described a new method for estimating proximities, which yields (following multi-dimensional scaling) rather different low-dimensional representations. (See http://www.math.usu.edu/~adele, re RAFT). The class weights are supposed to have the effect of assigning prior probabilities. Because of problems in adapting the code for incorporation into R, these do not, in the R version, work as intended.

Dependence structure in high dimensional data

John Storey described joint work with Leek that addressed problems arising from a between observations (slides) dependence structure in the analysis of expression array and other high dimensional data. Dependence commonly arises in expression array data because some slides (perhaps, from cancer tissue samples from different individuals) share common but unmeasured covariates. Omission of these covariates from the analysis commonly leads to coefficients for some of the expression indices that are biased and possibly misleading. The large number of unexpressed features provides information that makes it possible to use a singular value decomposition to provide a low order approximation to effects due to the missing covariates. There can be substantial changes in the sequences that are identified as exhibiting expression.
[JM: This is surely an important addition to the list of items that call for attention when analyzing expression array data.]

Penalized Regression

Jerome Friedman discussed his algorithm for a form of penalized regression where both the multiplier lambda and the Lp norm (with p in the range from 0 to 2) has to be chosen. Cross-validation was used to choose the optimum. In the cases investigated, there was a preference for p between 0 and 1.