The RFM Customer Segmentation model is an embarrassingly simple way of segmenting the customer base inside a marketing database. The resulting groups are easy to understand, analyze and action without the need of going through complex mathematics. Unfortunately, not all CRM platforms contain a module to perform RFM Customer Segmentation. This article gives you a sketch of how to calculate it in R, Pandas and Apache Spark.

# Define Metrics

The first step in RFM Customer Segmentation is to define the three attributes. The model allows for a certain flexibility with definitions and you can adjust them to the specifics of your business. The three attributes are:

**Recency**which represents the “freshness” of customer activity. Naturally, we would like to identify*active*and*inactive*customers. The basic definition for this attribute is the number of days since last order or interaction.**Frequency**captures how often the customer buys. By default this could be the total number of orders in the last year or during the whole of customer’s lifetime.**Monetary**value indicates how much the customer is spending. In many cases this is just the sum of all order values.

Certain businesses and industries will require slightly modified versions of the attributes. In business models based on frequent micro-transactions (e.g. in the online gaming industry) the average time between orders can be used for **Frequency**. If your customers buy only once a year (e.g. under a subscription model) then consider using the highest order value as **Monetary** value.

In this article we choose the default definitions (in pseudocode) as follows:

- Time since last order:
*R = NOW – max(Order Date)* - Number of orders:
*F = COUNT(OrderId)* - Total order value:
*M = SUM(Order Value)*

# Compute the RFM Table

E-commerce systems use common data structures to record sales, customer data etc. We will use a synthetic flat file to simulate a “data dump” from an e-commerce platform. The file contains one line for each order with pipe-separated (“|”) fields recording:

**Order Date**(YYYY-MM-DD)**Order Number****Customer Id****Product Id**encoded as letter “P” followed by unique product id**Order Value**

Top 5 rows in the input file look like this:

1 2 3 4 5 |
2014-01-01|235|40|P42|263 2014-01-01|236|30|P18|308 2014-01-01|237|40|P26|328 2014-01-02|238|74|P40|230 2014-01-02|239|7|P39|286 |

At the core of RFM Customer Segmentation is the **RFM Table. **It contains one row for each customer and Customer Id, R, F and M attributes in columns.

## R

The calculation of the RFM Table in R is simple and requires three aggregates to be performed. Depending on the size of data, you can use a data.frame or a data.table to read the input file. Start by importing the necessary packages. We will use *lubridate* to quickly parse Order Dates and *data.table* to read larger input files.

1 2 |
library(lubridate) library(data.table) |

Read the input file into a data.frame using the *read.csv* function. Note that the file does not contain a header so we change column names manually. In the example, I am using *stringsAsFactor=FALSE*, but you might prefer to convert product names or ids into factors for speed and efficiency.

1 2 |
raw_data <- read.csv("data.txt", sep="|", stringsAsFactor=FALSE, header=FALSE) names(raw_data) <- c("Date", "OrderNumber", "CustomerId","ProductId", "Value") |

Now, parse the Order Date string into the standard R date type. The

*lubridate*package provides a much faster equivalent of

*strptime*. For conveniece, we also store the “current” date in the NOW variable.

1 2 |
raw_data$ParsedDate <- fast_strptime(raw_data$Date, "%Y-%m-%d") NOW <- fast_strptime("2014-03-02", "%Y-%m-%d") |

Calculate the three attributes separately and then combine into an RFM Table by merging on Customer Id.

**Recency**is calculated in two steps. First calculate the maximum Order Date for each customer and then, separately, the difference between that date and NOW.

**Frequency**and

**Monetary**value is computed similarly with an aggregation. All three attributes are then joined using two merge operations.

1 2 3 4 5 6 7 8 9 10 11 |
R_table <- aggregate(ParsedDate ~ CustomerId, raw_data, FUN=max) # Calculate R R_table$R <- as.numeric(NOW - R_table$ParsedDate) F_table <- aggregate(OrderNumber ~ CustomerId, raw_data, FUN=length) # Calculate F M_table <- aggregate(Value ~ CustomerId, raw_data, FUN=sum) # Calculate M RFM_table <- merge(R_table,F_table,by.x="CustomerId", by.y="CustomerId") # Merge R with F RFM_table <- merge(RFM_table,M_table, by.x="CustomerId", by.y="CustomerId") # Merge M into RF RFM_table$ParsedDate <- NULL # Remove unnecessary column names(RFM_table) <- c("CustomerId", "R", "F", "M") # And change names |

In cases when your file is large and does not fit into memory in the standard *data.frame*, try using the awesome *data.table* package. Note that there are some significant differences in working with data in this format (please read the manual or my previous article on data.table).

The input file is imported using *fread* instead of *read.csv *and* *column names should be set with *setNames.*

1 2 3 4 |
raw_data <- fread("data.txt", sep="|", stringsAsFactor=FALSE) setNames(raw_data, c("Date", "OrderNumber", "CustomerId","ProductId", "Value")) raw_data[,ParsedDate := fast_strptime(Date, "%Y-%m-%d")] NOW <- fast_strptime("2014-03-02", "%Y-%m-%d") |

The actual RFM Table calculation can be combined into a single command thanks to

*data.frame*‘s aggregation syntax. Note the

*by=”CustomerId”*parameter which specifies the aggregation column.

1 2 3 4 5 6 |
RFM_table <- raw_data[,list( R = as.numeric(NOW - max(ParsedDate)), F = length(OrderNumber), M = sum(Value)), by="CustomerId" ] |

## Pandas

Pandas provides high-performance data structures to Python. If you are an R developer, you will find similar concepts and featurse in both systems. First, we load the necessary modules and provide shortened aliases.

1 2 3 |
import pandas as pd import numpy as np import datetime as dt |

Now, load the input data file and parse the Order Date column.

1 2 3 4 |
data = pd.read_csv('data.txt', sep='|', names=['Date','OrderNumber','CustomerId', 'ProductId','Value']) data['ParsedDate'] = data['Date'].apply(lambda x: dt.datetime.strptime(x,"%Y-%m-%d")) NOW = dt.datetime(2014,3,2) |

To calculate the actual RFM table, use the Split-Apply-Combine pattern:

**Split**the data into groups based on some criteria (Customer Id in our case).**Apply**a function to each group independently.**Combine**the result into a new data structure.

We will group the data by Customer Id and apply three different functions to each column. We provide them as anonymous lambda functions inside a dictionary with keys indicating columns to which they are to be applied.

1 2 3 |
rfmTable = data.groupby('CustomerId').agg({ 'ParsedDate': lambda x: (NOW - x.max()).days, # R 'OrderNumber': lambda x: len(x), # F 'Value': lambda x: x.sum()}) # M |

Pandas will automatically store the calculated

**Recency**as

*numpy*

*.timedelta64*which is not what we want. To convert it back to an integer representing the number of days use

*astype(int)*on the column. For additional convenience we can rename the columns (note the

*inplace=True*which tells Pandas not to copy the underlying data).

1 2 |
rfmTable['ParsedDate'] = rfmTable['ParsedDate'].astype(int) rfmTable.rename(columns={'ParsedDate': 'R', 'OrderNumber': 'F', 'Value': 'M'}, inplace=True) |

## Spark

For your Big Data situations, Apache Spark is the best way to compute the RFM Table. It is a large-scale and fast data processing engine. In contrast to Hadoop, it can perform calculations purely in RAM and provides significant speedups. We will use Scala as our language of choice to interact with Spark. Please note that this code is much more complicated than R or Pandas and you should be familiar with Scala, Spark and core MapReduce concepts. Apache Spark can be installed on a single machine and used without a cluster and is generally easier to setup than Hadoop.

We start by importing key packages – *SimpleDateFormat* and *Date* to handle Order Dates; *SparkContex* to use the context *sc* variable.

1 2 3 4 |
import java.text.SimpleDateFormat import java.util.Date import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ |

Spark can work with varying data storage systems, e.g. HDFS, Cassanda, S3 etc. In this example we will use a simple local file. The

*textFile*method from the

*sc*variable (the default SparkContext) reads a text file from HDFS or local file and returns an RDD (Resilient Distributed Dataset) of Strings (one for each line).

1 |
val file = sc.textFile("data.txt") |

For convenience, create a SimpleDateFormat parser.

1 |
val defaultDateFormatter = new SimpleDateFormat("yyyy-MM-dd") |

Spark reads text files line by line, so we need to convert each one manually. We use the

*map*function to process each line through an anonymous function that splits it on the pipe character and then matches the resulting array to a template to extract (and convert) data. Each resulting tuple has two elements – Customer Id and another embedded tuple with Order Date, Order Id, Product Id and Order Value:

(CustomerId, (OrderDate, OrderId, ProductId, Value))

Note the *cache()* call at the end. It tells Spark to persist the resulting RDD of Tuples (records) in memory for faster computations.

1 2 3 4 5 6 |
val records = file.map( s => s.split("\\|") match { case Array(date, orderId, customerId, productId, ordValue) => (customerId.toInt, (defaultDateFormatter.parse(date), orderId.toInt, productId, ordValue.toInt)) } ).cache() |

Now, calculate the three attributes,

**Recency**,

**Frequency**and

**Monetary**value, separately. First, we use the

*map*method to turn each record into a Key-Value tuple where Customer Id is the key and the required piece of information is the value. Then, we use the

*reduceByKey*to compute the actual aggregate. For example, to calculate

**Recency,**we generate a (Customer Id, Days since last order) tuple and then reduce it by key with a

*min*function. Note that in the code “_1” returns the first element of an n-tuple, “_2” the second element etc. For

**Frequency,**we generate a tuple with CustomerId as key and 1 as value. If we then reduce it by summing, we get the count of tuples (orders).

**Monetary**value is now simple to calculate by adding Order Value’s together.

1 2 3 |
val rRecords = records.map(v => (v._1, ((new Date).getTime - v._2._1.getTime) / (24 * 60 * 60 * 1000))).reduceByKey((a : Long,b : Long) => Math.min(a,b)) val fRecords = records.map(v => (v._1, 1)).reduceByKey((a : Int, b : Int) => a + b) val mRecords = records.map(v => (v._1, v._2._4)).reduceByKey((a : Int, b : Int) => a + b) |

Merge the three resulting RDDs into one with a chain of

*join*calls. Note that the result will have a nested structure of tuples and you will need to flatten the output with the

*map*function and an anonymous template matching function into:

(Customer Id, (R, F, M))

Each record can be then transformed into a pipe-separated String and stored into a CSV file.

1 2 3 4 5 |
val rfmRecords = rRecords.join(fRecords).join(mRecords).map(v => v match { case (custId, ((rVal, fVal), mVal)) => (custId, (rVal, fVal, mVal)) }).cache() rfmRecords.map((t) => "%d|%d|%d|%d".format(t._1,t._2._1,t._2._2,t._2._3)).coalesce(1,shuffle=true).saveAsTextFile("/home/crow/projekty/mktdistillery/data2.csv") |

# Select segment boundaries

Now that the RFM Table is ready, we need to decide how to allocate attribute values into segments. There are multiple ways of doing this, but the most basic uses quantiles. Simply put, the k-th quantile is a value *x* such that k% of values are less than *x*. So if the 25th quantile is 10, then 25% of all values are smaller than 10. Note that the 50th quantile is actually the median.

To basic quantile-based segmentation uses a uniform quantile split, e.g. [0%, 25%, 50%, 75%, 100%] and allocates each metric into the proper interval. For example, if the value for the above quantiles are [0, 10, 35,70, 120], then 5 would go into interval 1, 40 would go into interval 3 etc. To define segments in the RFM Customer Segmentation model, calculate the above quantiles for R, F and M separately and assign each value into the proper interval.

To calculate segmentation boundaries in R use the *quantile* function and then the *findInterval* function to allocate rows of RFM Table into segments:

1 2 3 |
RFM_table$Rsegment <- findInterval(RFM_table$R, quantile(RFM_table$R, c(0.0, 0.25, 0.50, 0.75, 1.0))) RFM_table$Fsegment <- findInterval(RFM_table$F, quantile(RFM_table$F, c(0.0, 0.25, 0.50, 0.75, 1.0))) RFM_table$Msegment <- findInterval(RFM_table$M, quantile(RFM_table$M, c(0.0, 0.25, 0.50, 0.75, 1.0))) |

We can do the same in Pandas using *quantile* and *searchsorted*:

1 2 3 4 5 6 7 |
Rquant = rfmTable['R'].quantile([0.0,0.25,0.50,0.75,1.0]) Fquant = rfmTable['F'].quantile([0.0,0.25,0.50,0.75,1.0]) Mquant = rfmTable['M'].quantile([0.0,0.25,0.50,0.75,1.0]) rfmTable['Rsegment'] = rfmTable['R'].apply(lambda x: Rquant.searchsorted(x, side='right')[0]) rfmTable['Fsegment'] = rfmTable['F'].apply(lambda x: Fquant.searchsorted(x, side='right')[0]) rfmTable['Msegment'] = rfmTable['M'].apply(lambda x: Mquant.searchsorted(x, side='right')[0]) |

Unfortunately, the code for Spark would be too complex to cover in this post. It is possible to do a quick-and-dirty quantile calculation by sending all data into one node of the Spark cluster and sorting, but by doing so we would lose the real power of distributed computing. In most cases, the RFM Table should fit into single machine’s memory and quantile segmentation can be done in R or Pandas. Therefore, Spark would only be used to transform huge transaction logs into a much smaller RFM Table.

# Analyze

Now that customers have been assigned to segments, the most interesting part begins – interpretation and analytics. The details will largely depend on your business and campaign plan, but here are some tips to get you started:

- The
**Recency**segment can be used for customer activity analysis. Split your base into Active/Inactive groups for reactivation campaigns. **Frequency**and**Monetary**value will help you select VIP and Fresh customers.- Note that the RFM Model can be viewed as putting customers into buckets inside a 3d cube (R,F and M being the dimensions). Some segments should be combined for simplicity.
- Analyze the flow of customers between segments.

Hi Kamil, thank you for this article.

My company is currently developing an open source predictive marketing platform, based on Spark, Elasticsearch, Akka etc and we are going the Spark way you sketched.

Building quantiles with Spark is not too complex, when using Twitter’s Algebird library (QTree from there).

Great job.

Regards, Stefan

[…] RFM Customer Segmentation in R, Pandas and Apache Spark […]