We will base this post around a story. You are working as a Data Scientist in one of the biggest online gift shops. Your job is to analyze data and make sure campaigns and products match what the customers are looking. Click streams, sales history and social network interactions. All this data sits in the shiny new Big Data platform and waits for you to mine it for insight.

One day the Head of Marketing approaches your desk and says: “*The management team wants to send out a newsletter with gifts for the Valentine’s Day. It needs to be personal and targeted. Can you do it?*” You respond with a quick “*yes*”. This sounds like a simple task in predictive analytics. Launching Pig, Hive or connecting to a Redshift instance you pour yourself a cup of coffee. A couple of queries later you pull the data out and create basic visualizations to look for patterns.

Once you plot the Average Order Value per customer on a histogram you notice something (see. Fig 1). Three distinct groups of customers focused around low, mid and high-valued purchases. Looking at this chart you say: “*Yes, this is what I need. We will use the Average Order Value to build a segmentation*”. All that is left now is for the Sales Department to prepare separate offers for low, mid and high-value customers.

You decide to fit a mixture of three normal distributions to the histogram visible on Fig 1 using the *mixtools* package in R. You will classify each customer in the database by picking the closest cluster center. A simple methodology that is often utilized in data mining projects. It is easy to explain and also efficient to compute on large databases. The R script returns a mixture of three normal distributions:

- Low-value cluster centered at 32 GBP with a standard deviation of 13.0 GBP
- Mid-value cluster centered at 75 GBP with a standard deviation of 12.5 GBP
- High-value cluster centered at 121 GBP with a standard deviation of 21.74 GBP

You put together a short Pig script and a few minutes later the Big Data system responds with a list of customers assigned to appropriate segments. Scratch one for the analytics team!

Two weeks later your boss rushes into your office and says: “*Stop whatever you are doing! The newsletter campaign is a disaster. The unsubscribe rate on our newsletter is through the roof!*”. “*Impossible*”, you respond, “*This was based on rigorous statistical modeling! Big Data doesn’t lie.*”

You have clearly done everything according to the book here? Well, not quite.

What we came across in this short story is a problem we like to call “Small Data”. **It arises when you have lots of data, but not enough real information to build a model.** Let us get back to the story above. If you were to reexamine the database, you might find that a lot of customers made only one or two purchases. Despite this, you used a nearest-cluster classification mechanism which often relied on a single observation. That does not sound right when you think about it again, does it?

The problem with Small Data is that it often hides itself in Big Data systems. **Big Data is not really about how much data you have stored in the data warehouse. It is about how much actual data you can use in your models.** In the example above you often had only a single observation even though your data warehouse contained millions of records. Small Data problems in a Big Data world.

So, how can we deal with this issue? The solution to Small Data problems lies not in crisp models, but in the area of Bayesian Statistics. I**n the Bayesian setting we accept that our knowledge about the world is often expressed as degrees of belief.** Instead of putting each customer into a single segment, we calculate the probability that he belongs to each one of the three groups. New observations update our beliefs according the Bayes Equation.

Let us work through a simple example. Assume Customer A registers to your newsletter and he has no sales history yet. In the classical model we have no information about this person and hence our naive classifier will not even work. In the Bayesian world, we would use some initial beliefs about Customer A – we call it a *prior*. The simplest assumption is that he has equal probability to be in the low, mid or high-value segments. Sometimes you may be able to use other sources of information to build a prior. For example, your customers might be required to fill in a survey when signing into the newsletter. We will continue with a *tabula rasa* approach and a flat prior:

$$P(A \in Low, A \in Mid, A \in High) = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$$

Sometime later, Customer A makes his first purchase for 90 pounds. In the classical setting, that puts him in the mid value segment (90 pounds is closer to 75 than it is to 121). What if this was a just a small gift for someones birthday? What if in reality this person is only willing to spend around 120 pounds on most occasions. **The naive classifier would make a mistake and cause the customer to unsubscribe from the newsletter.** Reexamine Fig. 2 and you may notice that even mid-value customers can sometimes buy expensive gifts.

Working out the math behind the Bayesian approach is not that difficult. We start with the prior – a ⅓ probability that the customer belongs to each of the segments. The purchase for 90 pounds allows us to update our beliefs using the Bayesian Equation. For the low-value segment:

$$P(A \in Low | 90) = \frac{P(90| A \in Low) P(A \in Low)}{P(90)} \enspace ,$$

where:

- \(P(A \in Low | 90)\) is the posterior probability of Customer A being in the low value segment after he makes the purchase for 90 pounds.
- \(P(90| A \in Low)\) is the probability of making a purchase for 90 pounds if the customer is a low valued customer. We call this the likelihood.
- \(P(A \in Low)\) is our prior belief of Customer A being in the low value cluster.
- \(P(90)\) is the probability of observing a purchase of 90 pounds.

We will use \(\mathcal{N}_{\mu,\sigma}(x)\) to represent the value at \(x\) of a normal distribution’s density centered at \(\mu\) and with a \(\sigma\) standard deviation. With the flat prior and a mixture of normal distributions as the likelihood, we calculate:

$$

\begin{align}

P(A \in low | 90) = \\

\frac{P(90 | A \in Low) P(A \in Low)}{P(90)} = \\

\frac{\frac{1}{3} \mathcal{N}_{32,13.0}(90)}{P(90 | A \in Low) P(A \in Low) + \dots + P(90 | A \in High)P(A \in High)} = \\

\frac{\frac{1}{3} \mathcal{N}_{32,13.0}(90)}{\frac{1}{3} \mathcal{N}_{32,13.0}(90) + \frac{1}{3} \mathcal{N}_{75,12.5}(90) + \frac{1}{3} \mathcal{N}_{121,21.74}(90)} = \\

0

\end{align}

$$

An almost identical calculation for mid and high-value segments gives us the posterior distribution:

$$P(A \in Low, A \in Mid, A \in High | 90) = (0,0.68,0.32)$$

As you can see, our beliefs for the high segment have not fallen to a 0 as in the simple classifier. **Each incoming sale gives us more information** about the customer’s value.

All we need now is to be able to send newsletters based on a probability distribution instead of a single cluster membership. If you are sending many emails in a period of time you can just pick a version at random for each email being sent. Use the posterior as the randomization distribution. A slightly better approach is to create a pool of low, mid and high value products and select items from them randomly according to your beliefs about customer’s value.

Big Data systems often feel like the best solution to many data-driven problems. Before jumping into the sea of information – **stop and think!** You might be dealing with a Small Data problem. **The Bayesian approach is far more superior and natural in such circumstances.** Do not be afraid to use a distribution and a bit of randomness to get to the result you need.

N(90)=0 for me (this is a density).

How do you compute N(90) in your formula ? thanks