A running list of tools useful for the data miner, with a few words about each.
1. The R language.
R is one of the most powerful statistical analysis tools out there. The over 1600 packages and libraries contain just about everything you have ever heard of, plus a wealth of data sets to play around with. The documentation for the most part is very good with plenty of example code. The IDLE is easy to install, and package addition is just a button clicks on the menu. Syntactically the language is similar to python, and is pretty intuitive, with a lot of thought having gone in to make data set manipulation as easy as possible. The standard plot libraries as well as more advanced version, such as ggplot2, provide great visualization capabilities.
R has some issues with scaling, i.e. dealing with large data, but these problems are being solved (see for example, here). Of course, there are technical drawbacks (for example, R is very slow running loops, and is not great for date/time manipulation in time series, etc); I will try to address some of these in subsequent posts.
Python is a high-level programming language, also open source, and in conjunction with the numpy and scipy packages is one of the most powerful computational tools out there. It allows a wide range of programming styles and is compatible with a number of other languages such as C and C++. There is also a well-developed module library of scientific as well as data mining tools; for example, there are a number of modules that allow you to mine and investigate social networks such as twitter that are fun to look into, as well as things like regression and svm’s. The matplotlib is a powerful visualization tool.
Weka is a data mining and analytics package suite that is incredibly easy to use. It provides an interface riddled with all the standard data mining tools you have heard of and allows you to combine these in a click-and-drag fashion. Model adjustment is similarly easy. Everything is coded in Java, which I don’t know, and so have no experience in getting under the hood of this thing.