Data Mining

Course — Classification, Modern Regression and Multivariate Exploration Using R

Here, I comment on motivations for this course.

Data Analysis Demands in 2013

"If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what's getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on."
[Hal Varian, Chief Economist at Google] Read the whole interview Varian speaks - Youtube video

A new Twist on Data Deployment and Analysis

Data mining gives a new twist on data deployment and analysis methodologies that have been developed over the past century or more. A good overview is the online article by Mike Loukides at O'REILLY radar: What is data science?

"The future belongs to the companies and people that turn data into products."

Technological and methodological changes and advances have included:

Huge increases in computational power and in computer storage
A synergy between theoretical and algorithmic advances, advances in software and in computational power
Integration of what were formerly stand-alone abilities into single software systems with a single interface and command language. (The R system is a prime example; see below)
New types of data, and new opportunites for collecting data, arising from advances in instrumentation, from the internet and from widespread deployment of databases. Chapter 5 of Ayres (2007), entitled "Why now?", has interesting commentary on the impact of such advances.

Classification is a major pre-occupation of data mining, with a more limited focus on regression with a continuous outcome variable. The aim is typically prediction rather than the more challenging task of interpretation of model parameters.

The R System

Students intending to take this course will it useful to come with some initial familiarity with the open-source R system for scientific and statistical computing and for graphics. The R system is available without charge for downloading from the internet. It is a marvelous example of what can be achieved when highly skilled specialists co-operate internationally, using the internet for communication and co-ordination.

Background Reading

Ian Ayres 2007, Super Crunchers. Why Thinking-By-Numbers is the New Way to be Smart. Bantam. [This places data mining in a wider context of data-based decision-making in business, government and consumer affairs. While popular in style and short on analysis detail, it offers a useful overview of ways in which applications of data mining and related analytical techniques are developing and changing, in part because of the new opportunities and challenges of the internet.]

Thomas H. Davenport and Jeanne G. Harris 2007, Competing on Analytics: The New Science of Winning. Harvard Business School Press. [Analytics is a buzzword for the application of data mining type approaches in commerce. Davenport and Thomas give a useful overview of issues for the deployment of analytical techniques within organizations - benefits and traps, choice of amenable tasks, the role of management, skill base issues, etc.]

Nate Silver: The Signal and the Noise. The Art and Science of Prediction [Nate Silver called the outcome of last two US presidential campaigns to within a whisker. Important points about practical data analysis issues are well made. He covers the financial crisis that started in 2010, prediction of election results, picking likely sports stars, weather and climate forecasting, earthquake prediction, economic forecasting, prediction of epidemics, amd much more. He documents the recent record of successes and failures of prediction in these areas.]

John Maindonald and John Braun 2010, Data Analysis and Graphics Using R - An Example-Based Approach, 3nd edn Cambridge University Press. [Of greatest relevance to the course are Chapter 2 on Styles of Data Analysis, Chapters 5 & 6 (through to 6.3) on Linear Models, Chapter 8 (through to 8.3) on logistic regression, Chapter 11 on Tree-based Methods, and Chapter 12 (through to 12.2) on Multivariate Data Exploration & Discrimination.]

Course — Classification, Modern Regression and Multivariate Exploration Using R

Here, I comment on motivations for this course.

Data Analysis Demands in 2013

A new Twist on Data Deployment and Analysis

The R System

Background Reading

Links