When you use R enough for data processing and mining you’ll soon realize that R is very slow when it comes to loops. Basically if your loop takes a long time to finish, or you know it will require a substantial number of calls to memory, it is best to replace the whole thing with one of the many optimized functions in the R armory. It seems like every data manipulation, tabulations, cross-reference, etc I have ever had to do has a function written specifically for that procedure, so it’s worth looking around.

One very useful tool is the sapply() function; there are actually many variants of it, which you can see read about here. Skipping some option details

sapply(X,FUN)

where “X” is a vector or list and “FUN” is a function to be applied to every element of “X.”

For example, lets generate a list of 20 random numbers between -5 and 5, and round each of them.

numbers <- rnorm(20, mean=0, sd=5) > numbers [1] 1.4150660 -3.3882659 -1.9812246 6.4669487 8.6084292 -6.8745447 6.9130365 -0.4088883 10.3087670 3.7913419 -0.9671000 5.6355830 -2.7645819 -2.9150188 2.3123911 -3.5230559 [17] 4.5095781 4.2253857 -2.8721957 -1.0569906 > sapply(numbers, round) [1] 1 -3 -2 6 9 -7 7 0 10 4 -1 6 -3 -3 2 -4 5 4 -3 -1

So sapply() return a list where the function “round” has been applied to every element of “numbers.” Of course, I could also just do

round(numbers)

to get the same result, but sometimes this is not possible because the function your write requires manipulation of data outside the list “X,” or some such thing.

Lets look at another example, where we have a data frame of two columns “time,” a numeric time-stamp value and “glu,” some glucose levels. First 10 rows look like this;

> df[1:10,] time glu 1 -61821514560 120 2 -61821514260 116 3 -61821513960 114 4 -61821513660 114 5 -61821513360 114 6 -61821513060 116 7 -61821512760 116 8 -61821512460 112 9 -61821512160 112 10 -61821511860 114

and we want to create an autocorrelation plot of this time series. Here is the version with a loop:

#make a correlation coef. vector corcoef <- vector(mode="numeric") l <- length(nrow(df)) for (i in 1:l){ corcoef[i] <- cor(df$glu[i:l], df$glu[1:(l-i+1)]) }

Here is the version with sapply().

corcoef <- sapply((1:l), function(i) cor(df$glu[i:l], df$glu[1:(l-i+1)]))

For every index i from 1 to length(corcoef) sapply() produces the appropriate correlation coefficient

cor(df$glu[i:l], df$glu[1:(l-i+1)])which we then write into corcoef.

Lets plot a bit of it

plot(corcoef[11000, type="l", col="blue")

See more about this data and autocorrelation on this post on mathbabe.org.

Of course, this is just a simple example and you can do much more. In my experience using sapply() has decreased computation time significantly, 50+ fold in many cases.

Do you know what’s actually going on in the (*)apply functions? Upon first search I couldn’t find the code that was actually being executed.

Here’s a simple example to show that you really don’t always get improvement – lapply (1.507 s) was much faster than the for loop (4.074 s), but sapply (5.409) was actually slower.

n <- 1e6

square <- rep(0,n)

system.time(for(i in 1:n) square[i] <- i^2)

system.time(square <- lapply(1:n,function(i) i^2))

system.time(square <- sapply(1:n,function(i) i^2))

All I can say, is that sapply() is a wrapper for lapply() and the actual loop is called in C.

More details here https://svn.r-project.org/R/trunk/src/main/apply.c

Also, there is a package called plyr, http://cran.r-project.org/web/packages/plyr/index.html, that has many useful optimized function and is worthwhile looking into.