Data mining tools

A running list of tools useful for the data miner, with a few words about each.

1. The R language.

R is one of the most powerful statistical analysis tools out there. The over 1600 packages and libraries contain just about everything you have ever heard of, plus a wealth of data sets to play around with. The documentation for the most part is very good with plenty of example code. The IDLE is easy to install, and package addition is just a button clicks on the menu. Syntactically the language is similar to python, and is pretty intuitive, with a lot of thought having gone in to make data set manipulation as easy as possible.   The standard plot libraries as well as more advanced version, such as ggplot2, provide great visualization capabilities.

R has some issues with scaling, i.e. dealing with large data, but these problems are being solved (see for example, here). Of course, there are technical drawbacks (for example, R is very slow running loops, and is not great for date/time manipulation in time series, etc); I will try to address some of these in subsequent posts.

2. Python.

Python is a high-level programming language, also open source, and in conjunction with the numpy and scipy packages is one of the most powerful computational tools out there. It allows a wide range of programming styles and is compatible with a number of other languages such as C and C++. There is also a well-developed module library of scientific as well as data mining tools; for example, there are a number of modules that allow you to mine and investigate social networks such as twitter that are fun to look into, as well as things like regression and svm’s. The matplotlib is a powerful visualization tool.

3. Weka.

Weka is a data mining and analytics package suite that is incredibly easy to use. It provides an interface riddled with all the standard data mining tools you have heard of and allows you to combine these in a click-and-drag fashion. Model adjustment is similarly easy. Everything is coded in Java, which I don’t know, and so have no experience in getting under the hood of this thing.

Advertisements

4 thoughts on “Data mining tools

  1. Hans Engler says:

    Adding to the description of R: RStudio provides a good environment for programming, and managing your data, code, and plots. And the “rattle” data mining package has many of the functionalities of Weka and even a similar interface, but it works within R and even generates R code. Being implemented in an R setting, it also has fewer memory limitations than Weka.

    You can therefore use Weka to explore many possible algorithms for a particular task, replicate the good ones in rattle, and then tweak this in R.

    Both Weka nnd rattle are good teaching tools in my view. I have used Weka in a class on data mining because it is so easy to learn, but the fact that you can’t easily look :”under the hood” always leaves some of us hungry.

    • notjustmath says:

      Thank you for the info. RStudio does look very nice – am about to download it and take a closer look. I have used WEKA in the past, but as you say have found it too restrictive; there are ways to integrate it with R which I tried to do at some point and ran into a number of issues. Perhaps rattle is the better approach.
      Most of the time I program in R and Python almost at the same time, i.e. when something is running in R, I am coding the next python script and vice versa. Due to R’s slowness at data reading and writing, all my preprocessing now is in Python and usually via a wrapper that connects to a db directly – this I have found to save a lot of time on many occasion and not overload RAM. Since a lot of the datasets I work with are large enough to cause memory and computational issues, I like to have as much control as possible.

      I know there are packages that allow you to call R from Python, etc, but I have not used these since the syntax seems awkward and this adds just another layer of something that can go wrong in production – have you done this? Usually, I just write a bash script to call code one after the other, both reading/writing form the same database.

      I have just ordered a DELL machine with a powerful GPU, and some other custom components – the GPU comes configured with CUDA, which should allow me to have full control of computation distribution/parallelization (some of which I’ll be able to directly from within the Python or R code). Will keep you updated as to how much that helps.

      • Hans Engler says:

        As to parallelization in R – use the package “snow” on a multicore machine and get into the habit of using l/s/tapply as much as possible as this tends to let one discover opportunities to parallelize a task. I don’t know much about Python I have to confess – this may very well be a superior approach all around.

        What is your experience with getting to choose the programming environment? I am in an R shop here, and would probably not be able to get the relevant users to adopt something different such as Python.

  2. notjustmath says:

    Snow looks good – thank you. I have been replacing most R loops with l/s/tapply functions and generally get better results, sometimes much, much better. I say “generally,” because there are mixed reviews and some confusion as to whether these are actually faster than loops – see for example here: http://stackoverflow.com/questions/5533246/why-is-apply-method-slower-than-a-for-loop-in-r – and in some instances I noticed loops to be on par or slightly faster. What I have found the most effective is a combination of intelligent binning and the sapply() family. For example, when I have needed to evaluate a function, or model, on a large set, breaking that set up into smaller chunks, looping over those, and putting everything back together has proven the difference between a computation that takes an hour+, or simply stalls, and one that takes less than 60s.

    For python lately I have gravitated towards coding and running console within the Aquamacs editor – this seems to work perfectly fine, and is fairly convenient; you can also, configure Aquamacs to work with ipython which allows tab-completion, intelligent copy/paste, as well as other functionalities. Since, I run a lot of scripts on the server, that is all done in the terminal. Python is indeed a very good language and allows for a lot of flexibility, as well as optimization. However, with the abundance of packages R is hard to walk away from. I have tried a number of different python packages, such as scikit, but have so far found these to be disappointing in terms of performance. Of course, for all data pre-processing and some of the small models I have coded up Python has been very good, but coding a full model each time I want to run a new type of experiment is not reasonable. All in all I have a good balance between python for data pre-processing and R for most modeling/experimenting and with some intelligence, as well as good technology, the latter can be pulled up to speed for most any application.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: