Monthly Archives: August 2011

Books for data mining and related topics

Below is a running list of books and references for those interesting in learning about data mining, statistical inference and related modeling. Feel free to comment on ones you feel need to be added to this list.
1. “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” by Hastie, Tibshirani and Friedman.

This is an in depth overview of may machine learning and data mining techniques. In addition, the authors wrote a number of the more powerful R packages, which they use to produce analysis for examples in the book. Wherever possible different algorithms, software packages and approaches are compared providing a comparative analysis of computational speed, accuracy and usefulness. You can download data sets used for these examples for a hands-on learning  experience. Check out their website for more details.
2. “Bayesian Data Analysis,” by GelmanCarlinStern, and Rubin.

This is practically the bible in the subject and a must have for anyone working in this field. At almost 700pages it’s likely the most comprehensive treaties of the subject. General information, updates, as well as the data sets used in the examples can be found on Gelman’s site.

3. “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions,”  by Seni and Elder.

Here the focus is on ways to combine different machine learning models to provide more effective and flexible analysis. The book is very clear and a great introduction to the field at large.  There is also plenty of R code examples to play around with. It is available for example here.

4. “Mining the Social Web,” by Russell.

This is a fun overview of using data mining techniques to analyze trends in social media. The book focuses on the “twitter” python package and has step-by-step instructions, technical overviews,  and plenty of source code.  You can find it here. One thing to note is that there are two python packages called “twitter” so make sure you are using the one cited in the book.

5) “Hidden Markov models for time series: An Introduction Using R,”  by Zucchini and MacDonald.

The book provides a good introduction to Markov Models and an in-depth treatment of the title topic. Springer has a an entire series dedicated to various branches of statistical analysis geared towards the R language and packages therein. Many university library systems allow you to download these for free.

6) “Data Analysis Using Open Source Tools, ” by Janert.

A great introduction to the tools and techniques necessary for data analysis. There is a chapter on essentially every topic you need to cover as a budding data scientist – data visualization, data cleaning, predictive analysis, unsupervised learning, time-series,  overview of the necessary mathematics, etc. You can find the book here.