Filed under: Things to look into ]]>

I will give a brief survey on the workings of a few of these packages in view of just one machine (extending this to a ‘real’ cluster basically only requires making sure that all packages and dependencies are installed in all machines and passwordless ssh login is enabled).

So suppose that you have a dual-core machine, like my Mac OS X 10.6 here, and you want to run a process in parallel. You can start with the package snowfall, which is a wrapper for snow, which in turns depends on Rmpi, Socket and so on… you can see the dependency tree here. However, none of that is important at the moment if you want to just try and run some computation faster. Start with installing snowfall, using say install.packages(“snowfall”) in the R console and proceed as follows:

1) Initialize a two cpu cluster (or however many cpus you have on board)

1.5) if necessary push data to all the cpu’s

2) run your computation

3) stop the cluster

Snowfall has it’s own version of all the functions in the apply() family, as well as some other network management tools.

Lets try an example in both parallel and sequential mode:

library(snowfall)</pre> sfInit(parallel=TRUE, cpus=2, type="SOCK", socketHosts=rep('localhost',2) ) ###if you have any data that will be used in the computation, such as a matrix or data frame, you need to make sure these are pushed (written) to all the cores sfExportAll() ### or if you want to export just one object sfExport("data") system.time(sfLapply(1:100000, function(i){exp(i)})) #user system elapsed #0.072 0.009 0.172 system.time(lapply(1:100000, function(i){exp(i)})) #user system elapsed #0.140 0.021 0.163 sfStop()

So what happened? the regular version lapply() was actually faster than the parallel sfLapply(). Well, this is due to a combination of the time it takes to copy data plus latency and the moral of the story is that parallel computing is not always better; it makes sense only when the computation time is actually significantly longer than the writing/latency overhead.

But let us take a look at an example that actually justifies the post. Here is a code snippet from an tree ensemble model and the bit we’ll take a look at has R predicting from a fitted rpart tree, here called ‘tlist[[1]],’ i.e. the first tree in my ensemble, on a data set of 22 variables and 65000 rows, called ‘datatrain.’

library(snowfall)</pre> sfInit(parallel=TRUE, cpus=2, type="SOCK", socketHosts=rep('localhost',2) ) sfExport('tlist') sfExport('datatrain') system.time(result <- sfLapply(1:nrow(datatrain), function(r) {predict(tlist[[1]],datatrain[r,])})) #user system elapsed #0.357 0.497 158.179 system.time(result <- lapply(1:nrow(datatrain), function(r) {predict(tlist[[1]],datatrain[r,])})) #user system elapsed #281.359 8.689 290.506 sfStop()

Generally I’ve seen about a 2x improvement in such calculations, which makes sense. Now lets try the same with another package called “multicore,” which is designed for the exact purpose of computing on a single machine with multiple cores.

library("multicore") system.time(result <- mclapply(1:nrow(datatrain), function(r) {predict(tlist[[1]],datatrain[r,])})) #user system elapsed #235.624 9.320 132.308

Even better… Multicore also has a number of functions for loop parallelization, as well as some other tools.

So why would you ever use snowfall over multicore. Well, one reason is that multicore only features the lapply() function, where snowfall has the entire family, but that might not be a sticking point for you; however, managing multiple machines with snowfall really doesn’t take more work than managing a dual-core one, so if you are planning to write code that potentially needs to be scaleable this is a good way to go.

Either way, both of these packages give a very quick and easy way to see immediate performance improvement in your computations. Of course, this is only a small part of the story which I will continue in subsequent posts.

Filed under: Data Mining, Technical ]]>

Filed under: Startups ]]>

There is a developing field called algebraic statistics which explores probability and statistics problems involving discrete random variables using methods coming from commutative algebra and algebraic geometry. The *basic* point is that the parameters for such statistical models are often constrained by polynomial relationships – and these are exactly the subject of commutative algebra and algebraic geometry. I would like to learn something more about this relationship, so in this post I’ll describe one example that I worked through – it comes from a book on the subject written by Bernd Sturmfels. Disclaimer : the rest of this post is technical.

Suppose that you have three independent exponential random variables with parameters respectively. Recall that this means has the density function

when and otherwise.

Such a random variable is often interpreted in terms of waiting time to first occurrence for a Poisson process.

Suppose that rather than observing the time of occurrence directly, we only get to observe whether and occur before or not. (I’ll leave you to think about how this could come up in practice, but it seems like it could). So what we observe is a discrete random variable with four possible states indicating whether or not variable occurs before variable three. Now for an exercise in working with probability distributions (something the students in the class I taught could hopefully do!):

Great. Now suppose that we sample the variable some large number of times, and record the number of times that takes on the four values above. Given these observances, we would like to predict the three values. One way of doing this is, given a choice of values for the s, to compute . Then we choose the values to maximize this probability (this is known as the method of maximum likelihood estimation). Rather than maximize this function, we’ll maximize the log of it (which amounts to the same thing) to take advantage of the multiplicative structure of the function. You can compute this function:

Since the probability function involves equations of degree in the , there’s no harm in setting and solving for and . Any other set of solutions will be the same as these up to scaling all the values. After doing this, the maximizing equations (i.e. the gradient components of ) are

Things are starting to look more like polynomials! If someone asked you to solve this, you would probably try to clear denominators, use one equation to solve for and plug it into the second equation to get one equation with one unknown. The problem is that when you clear denominators, you will introduce extra solutions into your equation. That is, you will find solutions to your new equations which are not solutions to the equations you are trying to solve. In this example, you will find is a solution to the problem with cleared denominators, but not to the original equation.

How do you know which solutions are extra and which are not? In commutative algebra, there is exactly a process for this, and it’s called the saturation of an ideal. The ideal we must use in our example is generated by the polynomials consisting of the equations above with cleared denominators, and the common denominator of all terms appearing in the two equations. (Our ideal is an ideal in the ring ).

Saturation of an ideal is a process which is cumbersome to carry out by hand for all but the simplest examples, but has been implemented by computer algebra packages like Macaulay, Sage, and Singular. In general it’s very computationally intensive, so it may be slow for ‘large’ ideals. One strategy that will speed it up in this example is to use actual numbers in place of . For example, when then

and is given by a solution to

Then I calculate that (there is only one positive root) and (full disclosure: used Mathematica). In general the result of the saturation will allow you to solve for in terms of and then you will have that is given by the zeros of some polynomial which is usually of degree 3:

Filed under: Technical, Things to look into ]]>

– probability that the transmitted symbol is s.

– the conditional probability that the received symbols is r given that the transmitted symbol is s.

– the joint probability that the received symbol is r and the transmitted symbol is s.

– probability that the received symbol is r.

– the conditional probability that the transmitted symbol is s given that the received symbols is r.

– the odds paid on the occurrence of s, i.e. the number of dollars returned on a one-dollar bet on s.

– the fraction of capital that the gambler decides to bet on the occurrence of s after receiving r.

First, suppose that . Note that if , then the gambler can capitalize on making repeated bets on s, and this extra betting would in turn lower in most natural settings (this is known as “arbitrage” in the market world). Also, lets suppose that there is no “track take,” i.e. , and that the gambler bets his entire capital regardless of the symbol received, i.e. (note: he can withhold money by placing canceling bets so this is a reasonable assumption).

The amount of capital after N bets is now , where is the number of times the transmitted symbol is s and the received symbol is r. The log difference of the N’th and starting bet is

,

and we have

with probability 1.

Since we have that

,

where H(X) is Shannon’s source rate. We can maximize the firs term by putting and what we get is which is the Shannon’s rate of transmission as talked about by Matt here.

Now if , we have

,

and G is still maximized by setting . Hence,

,

where

.

Notice that to maximize G, the gambler has ignored the odds, i.e. it is only the information rate that matters in his strategy. In a future post I’ll discuss further implication of these results, as well as the “track take” scenario.

Filed under: Technical, Things to look into ]]>