MarketingDistillery.com is about web analytics, data science and marketing strategy

This article takes only 2 minutes to read

Working with Large Data Sets in R – The data.table Module

R is a great tool for statistical analysis and data mining. The abundance of high quality user-created modules makes even the most difficult task a true pleasure. Unfortunately, the flexibility of the language comes at a price – performance and memory consumption. Try loading a large file into a data.frame on a home computer (and by large I mean 1GB+) and you might end up with all your RAM and a lot of the swap space gone in minutes. This is because of the overhead induced by how R manages objects and also because of the unnecessary copying done behind the scenes of many common operations.

Luckily, there is a package to solve some of our problems with an optimized version of the data.frame structure – the data.table package. The naming similarity is not a coincidence. The data.table class inherits from the core data.frame, but that doesn’t mean it can be used as a drop-in replacement everywhere. Some standard functions work equally good with both, while some provide specialized versions. The Data Table structure offers fast loads from big CSV files, computation of aggregates, joins and pivoting/casting even on GB-sized data.

The package is installed using the standard CRAN-method. Data Tables can be created identically as Data Frames or can be easily converted back and forth between the two.

Data Tables are indexed differently from traditional Data Frames. In core R you can access individual elements using square brackets and element coordinates, e.g.:

In Data Tables this doesn’t work:

Note that the command returned “1” instead of “a”. This is because the indexing operator (square brackets) behaves more like a SELECT query in SQL. The pattern is:

DT[WHERE,SELECT, by=GROUPBY]

So, our DT[1,1] is translated to: SELECT 1 FROM DT WHERE row = 1. To get the desired value we need to use:

With the above, it is very easy to calculate aggregates, e.g. the sum of values by customer:

Which could be translated into SELECT sum(value) FROM DT GROUP BY customer. See the pattern? Just remember the above WHERE-SELECT-GROUPBY mnemonic and using Data Tables will be pure fun.

To load a CSV file straight into a Data Table we need to use a specialized function. Remember – do not use the standard read.csv function and then cast the result into a data.table. The effect will be highly inefficient and memory hungy. Instead – use the fread function from the data.table package. It can automatically detect field separators and has the stringsAsFactors feature set to FALSE by default.

To get a more verbose response use a verbose=TRUE parameter. To list all tables currently in memory use tables():

Two Data Tables can be easily joined together. First, let us create a new Data Table with customer names:

We need to set up sort keys for both DT and DTnames to make the join efficient. The setkey function sorts the Data Table using the supplied column:

Joining both tables is now performed by supplying the second table inside the indexing operator:

Personally, I think that the most valuable feature of the data.table package is the ability to quickly calculate pivot tables/casts using the dcast.data.table function. As an example, let us look at the french_fries data from the reshape package. It contains results of a french fry cooking oil consumer test.

The “treatment” column contains an oil brand indicator, while the “subject” column is a consumer identifier. To calculate a pivot table of mean “buttery” scores for treatment vs subject we use:

Even for large tables, dcast is many times faster than reshape’s cast (not to mention it can actually handle the data).

Consult the package documentation at CRAN to learn about the full feature set of Data Tables and happy number crunching!

Leave a Reply

You must be logged in to post a comment.