is about web analytics, data science and marketing strategy

This article takes only 4 minutes to read

Cut through the customer base: a simple example on who to target

While positioning the brand in the market, the most important question to ask is who we want to communicate with. With a poor understanding of the customer base this might result in gently saying ‘not so outstanding’ results. If your customers are posh yuppies and you start going for a TV ad with how pricy you are then guess what: yuo’re standing on an edge of a cliff and making a huge step forward.

How then determine which drums you shoud be playing? In this article some simple expamples will be shown.



Most probably you are already collecting data about your customers, even if you don’t than Facebook does it for you. When customers like your page or download your app a nice chunk of data comes to your warehouse.


To not to get lost in the data crowd, statistics comes with a simple, yet powerful tool: a logistic regression. In broad terms, it is a model that allows you to explain a 0-1 event (did the customer made the purchase? Have he purchased again?) with a set of predefined variables. The output is:

  • the probability that the given set of variables values will result in 1; e.g. what are the chances that a 25 year old man who works in a bank will buy my brand new FX strategy? (I would bet it’s close to 0.00)
  • the relative importance of each variable in the equation, e.g. what characteristics of my customers are driving purchases? certain age? sex?

Here is a quick guide on how to train a logistic regression in R on a sample dataset.

Lets’s get our data into it:

And see what’s there:

We have the variable “purchased” in first columns that we would like to model. In some cases a variable might be grouped (it doesn’t really matter if you’re 32 or 33), so let’s do that with age:

Pretty self explanatory so far. Now let’s train the logistic regression:

Couldn’t be easier. Family = “binomial” tells the generalized linear models function what to do. If you want to explore more – always remember to read the help page; to do so type ?glm.

Now let’s see what the model can tell us:

So what can we read from it?

  • whenever sex is set to 1 (in our case it’s men), we can expect significantly higher purchase rate (marked with *** – the more stars the better!)
  • education is significant only from level 3. The higher education, the lower (check out the minus sign!) purchase rate
  • Age seems quite irrelevant, although the older the audince the better for us

So we can already have a marketing persona in mind – a 40+ male with as low education as possible. But before jumping to the offline marketing departement with bright ideas about ads with retired pysical workers talking their slang, let’s check out how strongly divesified is the final response.

To do so, let’s group the segments of same clients and see what the model would predict them to have the purchase rate:

The first line grabs the unique type customers. The library ‘boot’ contains a very basic inverse logit function used to transform the logistic regression output into probabilities. The third line collapses the descriptives of each group into a label, which would be nice to have on the plot. Speaking of which, let’s see the data visuallized using a very powerful tool in R: ggplot2.

The first line basically tells the program what are the aesthetics (i.e. colors, types, x’s and y’s). The second does the real job: makes the bars. The third line is to make our labels look nice, i.e. perpedicular to the axis. The last one removes the legend – as there is no point of having a legend when only one series is plotted. The last one sorts our labels in the descending order of probabilities – the best first! So the output looks like this:


Probabilities of making a purchase

Probabilities of making a purchase

So what we see, is that the probability at the beginning drops only a bit, and it groups the uneducated, later in the age men. Youngsters are far behind. Now we are more certain about the direction of our marketing communication.

You might wonder – but the data here is small, what if gets really big? Not a problem then! If the logistic regression you want to train needs hundreads variables and runs on millions rows, you might do it directly in Hive, using the Mahout package. Once you know how logistic regression works, the size of the data matters less thanks to modern technology. Understand the principles, and you shall not be afraid of any data.


Let’s have a closer look at the output of GLM in R:

  • The deviance residuals: these are the descriptive statistics of the glm equivalent of residuals in standard linear model. In linear model we would analize the distribution of \(y-\hat{y} = y-X\beta \), while in GLM we look at \(2\left(l(y)-l(\hat{\mu})\right)\), with \(l(y)\) being the log-likehood function evaluated at the observed values and \(l(\hat{\mu})\) on the predicted ones. Rule of thumb is same as in linear models: we want this distribution to be as close to normal as possible with small variance.
  • Coefficients: here we can see maximum likelihood estmates: \(\hat{\beta_{MLE}}\), their standard error: \(\hat{\sigma_{MLE}}\), z-value: \(\frac{\hat{\beta_{MLE}}}{\hat{\sigma_{MLE}}}\) and the p-value of a hypothesis that the parameter \(\beta=0\), assuming asymptotic \(t\) distribution with \(n-\text{dim}\beta\) degrees of freedom, where \(n\) is the number of observations and \(\text{dim}\beta\) is the number of parameters.
  • Null and residual deviance: here we can compare how well our model describes the data. Null deviance, is the deviance for a model with solely a constant. If our model is bad, the two values will be close to each other.
  • AIC is the Akaike Information Criterion; similarly to deviance it measures the goodness of fit. It is defined as: \(\text{dim}\beta-2l(\hat{\mu})\)

7 responses to “Cut through the customer base: a simple example on who to target”

  1. AJ Bahnken

    Loved this blog post. It would be nice if you provided some further detail about the dataset (does 0 or 1 represent male?) and about interpreting the output of glm.

  2. wije

    I believe you can do predict(logisticRegression, profiles, type=”response”) to get probabilities directly rather than converting the log-odds with “boot”. Nice article!

  3. darma


    I am getting Error while executing the last statement.

    Error in order(profiles$prediction) : argument 1 is not a vector

    any help?


Leave a Reply

You must be logged in to post a comment.