Category Archives: Tools

Time-stamps and R

I recently found out that dealing with time-stamps in R can be a real pain. On the surface everything works fine, we can call the Sys.time() command, take this object of class “POSIXct” or “POSIXt,” and transform into a numeric (i.e. number of second from midnight 01/01/1970 ) for arithmetic manipulation.

 > Sys.time() [1] "2011-09-03 18:18:03 EDT"
> class(Sys.time())
[1] "POSIXct" "POSIXt"
> as.numeric(Sys.time())
[1] 1315088290

Now all seems fine unless say your time-stamp is in some other format. Suppose it looks like “12/15/10 0:00.” From what I have seen R can’t really deal well with formats such as this one and besides parsing to coerce this into something R likes what has worked is the following. You could break up the time-stamp above into two pieces, the date and the time. The time R handles ok, the date is the problem. Luckily for this example I didn’t have to do this as the csv file contained a separate column for date and time, in addition to time-stamp. So we can transform the date into a format R likes via

<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;">as.Date(gdata$Date[1:10], "%m/%d/%Y")</span></pre>
[1] "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15" "0010-12-15"

<span class="Apple-style-span" style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;">

– ok so far so good. Then we can take time column on it own and paste the two strings together.

paste(as.Date(gsendata$Date, "%m/%d/%Y") , gsendata$Time)[1:10]
  [,1]
 [1,] "0010-12-15 0:04:00"
 [2,] "0010-12-15 0:09:00"
 [3,] "0010-12-15 0:14:00"
 [4,] "0010-12-15 0:19:00"
 [5,] "0010-12-15 0:24:00"
 [6,] "0010-12-15 0:29:00"
 [7,] "0010-12-15 0:34:00"
 [8,] "0010-12-15 0:39:00"
 [9,] "0010-12-15 0:44:00"
[10,] "0010-12-15 0:49:00"</pre>

Now R still won’t like this if you try to transform these stamps to numeric format, so we first have to use strptime(), before doing so.

as.numeric(strptime(paste(as.Date(gdata$Date, "%m/%d/%Y") , gsendata$Time), "%Y-%m-%d %H:%M:%S"))[1:10]
             [,1]
 [1,] -61821514560
 [2,] -61821514260
 [3,] -61821513960
 [4,] -61821513660
 [5,] -61821513360
 [6,] -61821513060
 [7,] -61821512760
 [8,] -61821512460
 [9,] -61821512160</pre>

Now you probably notice that the numeric values are negative; that’s because as.Date() function has transformed “10” into the year 0010, so if you care about the actual date/time and not just time elapsed, you could add the appropriate numeric value of 2000 years to the entire vector.

There is probably (hopefully) a better way to deal with time formats that R does not like, other than going to python and using really nice packages like datetime, so if you know post a comment.

Loops in R

When you use R enough for data processing and mining you’ll soon realize that R is very slow when it comes to loops. Basically if your loop takes a long time to finish, or you know it will require a substantial number of calls to memory, it is best to replace the whole thing with one of the many optimized functions in the R armory. It seems like every data manipulation, tabulations, cross-reference, etc I have ever had to do has a function written specifically for that procedure, so it’s worth looking around.

One very useful tool is the sapply() function; there are actually many variants of it, which you can see read about here. Skipping some option details


sapply(X,FUN)

where “X” is a vector or list and “FUN” is a function to be applied to every element of “X.”
For example, lets generate a list of 20 random numbers between -5 and 5, and round each of them.

numbers <- rnorm(20, mean=0, sd=5)
> numbers
 [1]  1.4150660 -3.3882659 -1.9812246  6.4669487  8.6084292 -6.8745447  6.9130365 -0.4088883 10.3087670  3.7913419 -0.9671000  5.6355830 -2.7645819 -2.9150188  2.3123911 -3.5230559
[17]  4.5095781  4.2253857 -2.8721957 -1.0569906
> sapply(numbers, round)
[1]  1 -3 -2  6  9 -7  7  0 10  4 -1  6 -3 -3  2 -4  5  4 -3 -1

So sapply() return a list where the function “round” has been applied to every element of “numbers.” Of course, I could also just do

round(numbers)

to get the same result, but sometimes this is not possible because the function your write requires manipulation of data outside the list “X,” or some such thing.

Lets look at another example, where we have a data frame of two columns “time,” a numeric time-stamp value and “glu,”  some glucose levels. First 10 rows look like this;

> df[1:10,]
           time glu
1  -61821514560 120
2  -61821514260 116
3  -61821513960 114
4  -61821513660 114
5  -61821513360 114
6  -61821513060 116
7  -61821512760 116
8  -61821512460 112
9  -61821512160 112
10 -61821511860 114

and we want to create an autocorrelation plot of this time series. Here is the version with a loop:

#make a correlation coef. vector
corcoef <- vector(mode="numeric")
l <- length(nrow(df))

for (i in 1:l){
corcoef[i] <- cor(df$glu[i:l], df$glu[1:(l-i+1)])
}

Here is the version with sapply().

corcoef <- sapply((1:l), function(i) cor(df$glu[i:l], df$glu[1:(l-i+1)]))

For every index i from 1 to length(corcoef) sapply() produces the appropriate correlation coefficient
cor(df$glu[i:l], df$glu[1:(l-i+1)])which we then write into corcoef.

Lets plot a bit of it

plot(corcoef[11000, type="l", col="blue")

See more about this data and autocorrelation on this post on mathbabe.org.

Of course, this is just a simple example and you can do much more.  In my experience using sapply() has decreased computation time significantly, 50+ fold in many cases.

Data mining tools

A running list of tools useful for the data miner, with a few words about each.

1. The R language.

R is one of the most powerful statistical analysis tools out there. The over 1600 packages and libraries contain just about everything you have ever heard of, plus a wealth of data sets to play around with. The documentation for the most part is very good with plenty of example code. The IDLE is easy to install, and package addition is just a button clicks on the menu. Syntactically the language is similar to python, and is pretty intuitive, with a lot of thought having gone in to make data set manipulation as easy as possible.   The standard plot libraries as well as more advanced version, such as ggplot2, provide great visualization capabilities.

R has some issues with scaling, i.e. dealing with large data, but these problems are being solved (see for example, here). Of course, there are technical drawbacks (for example, R is very slow running loops, and is not great for date/time manipulation in time series, etc); I will try to address some of these in subsequent posts.

2. Python.

Python is a high-level programming language, also open source, and in conjunction with the numpy and scipy packages is one of the most powerful computational tools out there. It allows a wide range of programming styles and is compatible with a number of other languages such as C and C++. There is also a well-developed module library of scientific as well as data mining tools; for example, there are a number of modules that allow you to mine and investigate social networks such as twitter that are fun to look into, as well as things like regression and svm’s. The matplotlib is a powerful visualization tool.

3. Weka.

Weka is a data mining and analytics package suite that is incredibly easy to use. It provides an interface riddled with all the standard data mining tools you have heard of and allows you to combine these in a click-and-drag fashion. Model adjustment is similarly easy. Everything is coded in Java, which I don’t know, and so have no experience in getting under the hood of this thing.