July 16, 2014

Comparison Between esProc’s Sequence Table Object and R’s Data Frame (II)

Comparison Between esProc’s Sequence Table Object and R’s Data Frame (I)

Actual case

In this part we use a real case for comprehensive comparison o fdata frame and sequence table.
Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.

Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.

Sequence Table Solution:


Data frame Solution:

01     library(gdata) #use excel function library
02     A1<- read.xls("e:\\data\\all.xlsx") #import data
03     A2<-subset(A1,as.POSIXlt(Date)>=as.POSIXlt('2012-06-01') &as.POSIXlt(Date)<=as.POSIXlt('2012-06-30')) #filter by date
04     A3 <- split(A2,A2$Code) #group by Code
05     A8<-list()
06     for(i in 1:length(A3)){
07       A3[[i]][order(as.numeric(A3[[i]]$Date)),] #sort by Date in each group
08       A3[[i]]$INC<-with(A3[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price
09       if(nrow(A3[[i]])>0){  #add a column, continuous increased days
10         A3[[i]]$CID[[1]]<-1
11         for(j in 2:nrow(A3[[i]])){
12           if(A3[[i]]$INC[[j]]>0 ){
13             A3[[i]]$CID[[j]]<-A3[[i]]$CID[[j-1]]+1
14           }else{
15             A3[[i]]$CID[[j]]<-0
16           }
17         }   
18       }
19       if(max(A3[[i]]$CID)>=5){  #stock max CID is bigger than 5
20         A8[[length(A8)+1]]<-A3[[i]]
21       }
22     }
23     A9<-lapply(A8,function(x) x$Code[[1]]) #finally,stock code

Comparison
1. Data frame function is not rich enough, and is lack of professionalism. We need to use nested loops to meet the requirement in this case. It’s of low computational efficiency. Sequence table has rich and diverse functions. Without the use of loop statement we can achieve the same purpose. The code is shorter and simpler, and the performance is higher.

2. When programming for data frame, the code is obscure and hard to write. With sequence table, the code is clear and easy to understand. The cost of learning is lower.

3. When large amount of data is involved in this scenario, the memory consumption will be huge. Sequence table is computationby reference, which consumes less memory. Data frame is computation by value pass. The memory consumption is several times more than sequence table. It easy to result into memory overflow in this scenario.

4.To import Excel data into data frame, R requires third-party software packages. However they seem to have difficulty working together. Data import needs ten minutes to complete. With sequence table this only needs tens of seconds.

Test Performance

Test 1: Generating 10 million records in memory, each consists of three fields. All values ​​are random numbers. Records are filtered, and each field is summed.
         
Sequence table:


Data frame

> library(timeDate)
> start=Sys.timeDate()
> col1=rnorm(n=10000000,mean=20000,sd=10000)
> col2=rnorm(n=10000000,mean=40000,sd=10000)
> col3=rnorm(n=10000000,mean=80000,sd=10000)
> data1=data.frame(col1,col2,col3)
> data2=subset(data1,col1>90)
> result=colSums(data2)
> print(result)
        col1         col2         col3
200844165732 390691612886 781453730448
> end=Sys.timeDate()
> print(end-start)
Time difference of 1.533333 mins

Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.

Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields

Sequence Table:


Data frame:

>library(timeDate)
> start=Sys.timeDate()
> data<-read.table("d:/T21.txt",sep = "\t")
> data1=subset(data,V1>90,select=c(V9,V11))
> result=colSums(data1)
> print(result)
         V9         V11
 5942982895          59484930179
> end=Sys.timeDate()
> print(end-start)
Time difference of 1.134722 hours

Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.

From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation. 

No comments:

Post a Comment