Subscribe to Dr. Granville's Weekly Digest

A Method of Grouping and Summarizing Data of Big Text Files in R Language

It is common to use R language to group and summarize data of files. Sometimes we may find ourselves processing comparatively big files which have smaller computed result and bigger source data. We cannot load them wholly to the memory when we need to compute them. The only solutions could be batch importing and computing as well as result merging. We’ll use an example in the following to illustrate the way of R language to group and summarize data from big text files.

Here is a file, sales.txt, of 1G size, which contains a great number of records of sales orders. We want to group field CLIENT and summarize field AMOUNT. “\t” is used in the file as the column separator. The first rows of data are as follows:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE

1       WVF Vip         5       440.0        2009-02-03

2       UFS Com       13     1863.4      2009-07-05

3       SWFR            2       1813.0      2009-07-08

4       JFS Pep          27     670.8        2009-07-08

5       DSG              15     3730.0      2009-07-09

6       JFE                10     1444.8      2009-07-10

7       OLF               16     625.2        2009-07-11

 

R’s solution:

1

con <- file("E: \\sales.txt", "r")

2

result=read.table(con,nrows=100000,sep="\t",header=TRUE)

3

result<-aggregate(result[,4],list(result[,2]),sum)

4

while(nrow(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t",col.names=c("ORDERID","Group.1","SELLERID","x","ORDERDATE")))!=0) {

5

  databatch<-databatch[,c(2,4)]

6

  result<-rbind(result,databatch)

7

  result<-aggregate(result[,2],list(result[,1]),sum)

8

}

9     

close(con)

Part of the computed result:

    Group.1         x
1       ARO  17981798
2       BDR  85584558
3       BON  51293129
4       BSF 287908788
5       CHO  23482348

 

Code interpretation:

The 1st line: Open the file handle.

The 2nd ~ 3rd line: Import the first batch of 100,000 rows of data, group and summarize them and save the result in result.

The 4th ~ 8th line: Import data by loop, with 100,000 rows of data per batch, and store them in the variable databatch. Then get the second and fourth field, i.e. “CLIENT” and “AMOUNT”, merge databatch into result, and execute grouping operation.

It can be seen that, at a certain moment, only databatch, which includes 100,000 rows of data, and result, the summarizing result, have memory usage. Usually, the size of the latter is small and will not result in a memory overflow.

The 11th line: Close the file handle.

Matters needing attention:

Data frame. Because the data frame of R language cannot directly perform the computing of big files, loop statement is necessary to help to do the job in this occasion. The steps are: import a batch of data and merge them into the data frame result; group and summarize result and then import the next batch of data. You can see that this part of code of loop statement is a little complicated.

Column name. As the first row of data is the column name, header=TRUE can be used in the first batch of data to directly set the column name. But the subsequent data hasn’t column names and header=FALSE should be used to import data. The default column names are V1, V2 and so forth when header=FALSE is used. But the default column names are Group.1 and x after grouping and summarizing are executed, and col.names is needed to change the column names in order to maintain structure consistency both before and after grouping and summarizing and set the stage for the subsequent merging. The code about column names is worth our notice because it is easy to get wrong.

Alternative solutions:

Python, esProc and Perl can also perform the same operation. They can execute the grouping and summarizing of data from big text files and the subsequent structured data computing as R language does. We’ll briefly introduce the coding methods used by esProc and Python.

esProc can process data in batches automatically, which requires no manual control from the programmers by loop statement and produces quite simple code:

A

1

=file("e:/sales.txt").cursor@t()

2

=A2.groups(CLIENT;sum(AMOUNT))

        

Cursor is a data type used for structured data computing in esProc. Its usage is similar to that of the data frame, but it is better at processing big files and performing complicated computations. What’s more, @t option in the code indicates that the first line of the file is the column name. So it is convenient to use the column name directly in subsequent computation.

Python’s code structure, which also requires manual loop control, is similar to that of R language. But Python itself hasn’t the structured data type, like data frame or cursor, so its code is executed in a lower level:

1

from itertools import groupby

2

from operator import itemgetter

3

result = []

4

myfile = open("E:\\sales.txt",'r')

5

BUFSIZE = 10240000

6

myfile.readline()

7

lines = myfile.readlines(BUFSIZE)

8

value=0

9

while lines:

10

    for line in lines:

11

        record=line.split('\t')

12

        result.append([record[1],float(record[3])])

13

    result=sorted(result,key=lambda x:(x[0]))                # the sorting before grouping is executed

14  

    batch=[]

15    

    for key, items in groupby(result, itemgetter(0)):    # group using groupBy function

16  

        value=0

17  

        for subItem in items:value+=subItem[1]

18    

        batch.append([key,value])                 # finally, merger the summarizing results into a two-dimensional array

19  

    result=batch

20  

    lines = myfile.readlines(BUFSIZE)

21    

myfile.close()

 

Except for the above two-dimensional array, Python can execute the operation with the third-party packages. For example, pandas has the structured data object similar to the data frame. pandas simplifies the code in a similar way as R language. But it lacks sufficient ability to perform big file computing, thus loop statement is still needed while programming.

 

 

Views: 2006

Tags: Python, R, analytics, data, esProc, grouping, language, programming, summarizing

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Jessica May on September 9, 2014 at 1:24am

I Agree with you, Ran! Shell is a simple and effective way in many cases, especially for filter operations involving traversing. 


But concerned about this example, if the data amount is very large, it will be too slow to group after sorting. Sorting is of very high computational complexity, as it has to be accomplished before many times of traversing of the big data, making it much slower for grouping and summarizing operations. However, shell is too difficult to achieve grouping operations with only one traversing action, which will be highly efficient, thus it is still necessary to use a programming language. 


Anyway, this approach is very clever indeed.

Comment by Ran Locar on August 30, 2014 at 2:53am

I recently had to process lots of data on a remote machine, where I could not install anything and where the data was too big to transfer to a more 'friendly' machine. So I became friends with linux out-of-box utilities.

Assuming the 'really big file' is on a linux/unix machine, the same 'group-by' could be accomplished using the following short shell script:

awk -F, '{print $2 "~" $4}' try.txt | sort | awk -F~ '{if ($1!=prev){print prev, sum; sum=0} sum+=$2; prev=$1} END {print prev,sum}'

(which basically - takes fields 2 and 4, sorts the file, and then calculates a running sum per value of the client field)

My point is not that shell scripts >> R or python, but that for simple analytical purposes (counts, sums, group-by, sorting) and text manipulation (basically XML to TSV) shell tools like awk, sed, grep, sort, uniq -c, do a WONDERFUL job; they are universally available, FAST, easy to combine, etc.

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service