converting timestamp state event logs to runtime in R data.table -


i have large data set of logged timestamps corresponding state changes (e.g., light switch flips) this:

library(data.table) library(lubridate) foo <-    data.table(ts = ymd_hms("2013-01-01 01:00:01",                           "2013-01-01 05:34:34",                           "2013-01-02 14:12:12",                           "2013-01-02 20:01:00",                           "2013-01-02 23:01:00",                           "2013-01-03 03:00:00",                           "2013-05-04 05:00:00"),              state = c(1, 0, 1, 0, 0, 1, 0) ) 

and i'm trying (1) convert history of state logs run-times in seconds, , (2) convert these daily cumulative run-times. (but not all) of time, consecutive logged state values alternate. kludgy start, falls little short.

foo[, dif:=diff(ts)] foo[state==1][, list(runtime = sum(dif)), .(floor_date(ts, "day"))] 

in particular, when state "on" during period crosses midnight, approach isn't smart enough split things up, , incorrectly reports runtime longer 1 day. also, using diff not intelligent either, since make mistakes if there consecutive identical states or nas.

any suggestions correctly resolve runtime still fast , efficient large data sets?

this should work. played around different starting values of foo there still edges cases didn't factor in . 1 thing need take note of if real data has timezone accepts daylight savings time break when making data.table dates. can workaround doing force_tz utc or gmt first (you can change later). on other hand if need account 25 hour or 23 hour day you'll need strategically change them timezone.

#i'm using devel version of data.table includes shift function leading/lagging variables foo[,(paste0("next",names(foo))):=shift(.sd,1,0,"lead")] #shift fill=na produced error reason workaround foo[nrow(foo),`:=`(nextts=na,nextstate=na)] #make data.table every date min ts max ts complete<-data.table(datestamp=seq(from=floor_date(foo[,min(ts)],unit="day"),to=ceiling_date(foo[,max(ts)],unit="day"),by="days")) #make column end of day complete[,enddate:=datestamp+hours(23)+minutes(59)+seconds(59.999)] #set keys , overlapping join setkey(foo,ts,nextts) setkey(complete,datestamp,enddate) overlap<-foverlaps(foo[state==1],complete,type="any") #compute run time each row overlap[,runtime:=as.numeric(difftime(pmin(datestamp+days(1),nextts),pmax(datestamp,ts),units="secs"))] #summarize down seconds per day overlap[,list(runtime=sum(runtime)),by=datestamp] 

Comments