While positioning the brand in the market, the most important question to ask is who we want to communicate with. With a poor understanding of the customer base this might result in gently saying ‘not so outstanding’ results. If your customers are posh yuppies and you start going for a TV ad with how pricy you are then guess what: yuo’re standing on an edge of a cliff and making a huge step forward.

How then determine which drums you shoud be playing? In this article some simple expamples will be shown.

Most probably you are already collecting data about your customers, even if you don’t than Facebook does it for you. When customers like your page or download your app a nice chunk of data comes to your warehouse.

To not to get lost in the data crowd, statistics comes with a simple, yet powerful tool: a logistic regression. In broad terms, it is a model that allows you to explain a 0-1 event (did the customer made the purchase? Have he purchased again?) with a set of predefined variables. The output is:

- the probability that the given set of variables values will result in 1; e.g. what are the chances that a 25 year old man who works in a bank will buy my brand new FX strategy? (I would bet it’s close to 0.00)
- the relative importance of each variable in the equation, e.g. what characteristics of my customers are driving purchases? certain age? sex?

Here is a quick guide on how to train a logistic regression in R on a sample dataset.

Lets’s get our data into it:

1 |
dane <- read.table("dataset.csv", sep=",", dec=".", header=T, stringsAsFactors=F) |

And see what’s there:

1 2 3 4 5 6 7 8 |
head(dane) purchased sex age education 1 1 0 32 5 2 1 0 32 3 3 0 0 29 4 4 1 0 37 5 5 0 0 33 6 6 1 0 43 5 |

We have the variable “purchased” in first columns that we would like to model. In some cases a variable might be grouped (it doesn’t really matter if you’re 32 or 33), so let’s do that with age:

1 2 3 |
dane$age_grouped <- cut(dane$age, breaks = c(-Inf, 18, 25, 29, 39, Inf), labels = c("underage", "19-25", "26-29", "30-39", "40+")) |

Pretty self explanatory so far. Now let’s train the logistic regression:

1 |
logisticRegression <- glm(purchased ~ sex + age_grouped + as.factor(education), data = dane, family = "binomial") |

Couldn’t be easier. Family = “binomial” tells the generalized linear models function what to do. If you want to explore more – always remember to read the help page; to do so type ?glm.

Now let’s see what the model can tell us:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
>summary(logisticRegression) Call: glm(formula = purchased ~ sex + age_grouped + as.factor(education), family = "binomial", data = dane) Deviance Residuals: Min 1Q Median 3Q Max -2.5924 -0.6420 0.4572 0.6690 2.2740 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -13.13421 216.02581 -0.061 0.951519 sex 0.80560 0.07323 11.001 < 2e-16 *** age_grouped19-25 11.76301 216.02516 0.054 0.956575 age_grouped26-29 13.20880 216.02516 0.061 0.951244 age_grouped30-39 14.90907 216.02515 0.069 0.944977 age_grouped40+ 15.65349 216.02516 0.072 0.942235 as.factor(education)2 -0.64962 0.54537 -1.191 0.233591 as.factor(education)3 -1.11923 0.53495 -2.092 0.036419 * as.factor(education)4 -1.54924 0.54007 -2.869 0.004123 ** as.factor(education)5 -1.94170 0.53926 -3.601 0.000317 *** as.factor(education)6 -2.36842 0.59124 -4.006 6.18e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6842.5 on 4999 degrees of freedom Residual deviance: 4889.0 on 4989 degrees of freedom AIC: 4911 Number of Fisher Scoring iterations: 12 |

So what can we read from it?

- whenever sex is set to 1 (in our case it’s men), we can expect
*significantly*higher purchase rate (marked with *** – the more stars the better!) - education is significant only from level 3. The higher education, the lower (check out the minus sign!) purchase rate
- Age seems quite irrelevant, although the older the audince the better for us

So we can already have a *marketing persona* in mind – a 40+ male with as low education as possible. But before jumping to the offline marketing departement with bright ideas about ads with retired pysical workers talking their slang, let’s check out how strongly divesified is the final response.

To do so, let’s group the segments of same clients and see what the model would predict them to have the purchase rate:

1 2 3 4 5 |
profiles <- unique(dane[,c("sex", "age_grouped", "education")]) library('boot') profiles <- data.frame(profiles, profile=apply(profiles, 1, paste, collapse=", ")) profiles <- data.frame(profiles, prediction=inv.logit(predict(logisticRegression, profiles))) |

The first line grabs the unique type customers. The library ‘boot’ contains a very basic inverse logit function used to transform the logistic regression output into probabilities. The third line collapses the descriptives of each group into a label, which would be nice to have on the plot. Speaking of which, let’s see the data visuallized using a very powerful tool in R: ggplot2.

1 2 3 4 5 6 7 |
library('ggplot2') ggplot(profiles, aes(x=profile, y=prediction)) + geom_bar(stat="sum") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(legend.position="none")+ scale_x_discrete(limits = profiles$profile[rev(order(profiles$prediction))] ) |

The first line basically tells the program what are the *aesthetics* (i.e. colors, types, x’s and y’s). The second does the real job: makes the bars. The third line is to make our labels look nice, i.e. perpedicular to the axis. The last one removes the legend – as there is no point of having a legend when only one series is plotted. The last one sorts our labels in the descending order of probabilities – the best first! So the output looks like this:

So what we see, is that the probability at the beginning drops only a bit, and it groups the uneducated, later in the age men. Youngsters are far behind. Now we are more certain about the direction of our marketing communication.

You might wonder – but the data here is small, what if gets really big? Not a problem then! If the logistic regression you want to train needs hundreads variables and runs on millions rows, you might do it directly in Hive, using the Mahout package. Once you know how logistic regression works, the size of the data matters less thanks to modern technology. Understand the principles, and you shall not be afraid of any data.

**Appendix**

Let’s have a closer look at the output of GLM in R:

- The deviance residuals: these are the descriptive statistics of the glm equivalent of residuals in standard linear model. In linear model we would analize the distribution of \(y-\hat{y} = y-X\beta \), while in GLM we look at \(2\left(l(y)-l(\hat{\mu})\right)\), with \(l(y)\) being the log-likehood function evaluated at the observed values and \(l(\hat{\mu})\) on the predicted ones. Rule of thumb is same as in linear models: we want this distribution to be as close to normal as possible with small variance.
- Coefficients: here we can see maximum likelihood estmates: \(\hat{\beta_{MLE}}\), their standard error: \(\hat{\sigma_{MLE}}\), z-value: \(\frac{\hat{\beta_{MLE}}}{\hat{\sigma_{MLE}}}\) and the p-value of a hypothesis that the parameter \(\beta=0\), assuming asymptotic \(t\) distribution with \(n-\text{dim}\beta\) degrees of freedom, where \(n\) is the number of observations and \(\text{dim}\beta\) is the number of parameters.
- Null and residual deviance: here we can compare how well our model describes the data. Null deviance, is the deviance for a model with solely a constant. If our model is bad, the two values will be close to each other.
- AIC is the Akaike Information Criterion; similarly to deviance it measures the goodness of fit. It is defined as: \(\text{dim}\beta-2l(\hat{\mu})\)

Loved this blog post. It would be nice if you provided some further detail about the dataset (does 0 or 1 represent male?) and about interpreting the output of glm.

Hi AJ!

Thank you for you comment – I have added an appendix to the article. As for the data, please treat it as artificial – while modelling your own data, it doesn’t matter if 0 is male or female – although being gentelmen might suggest treating 1 as woman 😉

Awesome, thank you!

I believe you can do predict(logisticRegression, profiles, type=”response”) to get probabilities directly rather than converting the log-odds with “boot”. Nice article!

Yes, the predict function indeed can be used here and produces the same output. As with many things in R, there is no single solution. Thank you for careful reading! 🙂

Hi,

I am getting Error while executing the last statement.

Error in order(profiles$prediction) : argument 1 is not a vector

any help?

Regards

DARMA

Hi Darma,

I have just executed with no errors – did you load all libraries? How does your vector head(profiles$prediction) looks like? I have something like this: “[1] 0.4583861 0.6582773 0.1862367 0.3558180 0.6405096 0.8021910”.

Regards,

Krzysztof