Books for data mining and related topics

Below is a running list of books and references for those interesting in learning about data mining, statistical inference and related modeling. Feel free to comment on ones you feel need to be added to this list.
1. “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” by Hastie, Tibshirani and Friedman.

This is an in depth overview of may machine learning and data mining techniques. In addition, the authors wrote a number of the more powerful R packages, which they use to produce analysis for examples in the book. Wherever possible different algorithms, software packages and approaches are compared providing a comparative analysis of computational speed, accuracy and usefulness. You can download data sets used for these examples for a hands-on learning  experience. Check out their website for more details.
2. “Bayesian Data Analysis,” by GelmanCarlinStern, and Rubin.

This is practically the bible in the subject and a must have for anyone working in this field. At almost 700pages it’s likely the most comprehensive treaties of the subject. General information, updates, as well as the data sets used in the examples can be found on Gelman’s site.

3. “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions,”  by Seni and Elder.

Here the focus is on ways to combine different machine learning models to provide more effective and flexible analysis. The book is very clear and a great introduction to the field at large.  There is also plenty of R code examples to play around with. It is available for example here.

4. “Mining the Social Web,” by Russell.

This is a fun overview of using data mining techniques to analyze trends in social media. The book focuses on the “twitter” python package and has step-by-step instructions, technical overviews,  and plenty of source code.  You can find it here. One thing to note is that there are two python packages called “twitter” so make sure you are using the one cited in the book.

5) “Hidden Markov models for time series: An Introduction Using R,”  by Zucchini and MacDonald.

The book provides a good introduction to Markov Models and an in-depth treatment of the title topic. Springer has a an entire series dedicated to various branches of statistical analysis geared towards the R language and packages therein. Many university library systems allow you to download these for free.

6) “Data Analysis Using Open Source Tools, ” by Janert.

A great introduction to the tools and techniques necessary for data analysis. There is a chapter on essentially every topic you need to cover as a budding data scientist – data visualization, data cleaning, predictive analysis, unsupervised learning, time-series,  overview of the necessary mathematics, etc. You can find the book here.


2 thoughts on “Books for data mining and related topics

  1. human mathematics says:

    1. Have you read all of #1? How do you think it contrasts to Bishop?

    2. Can you contrast #2 with some other Bayesian books / notes?

    • notjustmath says:

      1. I can’t to claim to have read all of Hastie et al, but I have read a lot of it and more than half of the chapter in entirely; the book is a bit dense to just read back to back, but is an incredibly handy reference. The chapters are clearly broken and can be read independently. I have not spent much at with Bishop at all, but reliable sources tell me it’s a good overview of ML. What I can say is that Hastie et al is very good at discussing and comparing learners in tow with specific R packages, so you get to see where say splines do better than naive-bayes and such things.
      2. There isn’t really a single place I have learned Bayesian statistics from, so it’s hard for me to compare “Bayesian Data Analysis” to other sources. However, this book basically contains everything I have seen anywhere else. What is regularly frustrating in this text, and this is true of many Bayesian and other stats books, is that a theory or example is worked out by starting with the words “so we know that the prior is… or we have a normal distribution underlying our prior likelihood …” or some statement of the sort. But no mention is made how the authors came up with this starting point of how you would arrive there. Of course, you learn how to deal with such things….

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: