MarketingDistillery.com is about web analytics, data science and marketing strategy

# Marketing channel attribution on big data – on the fly calculation with AWK

For the marketing budget owners, the key for decision making about spend is the understanding of the consequences and results driven by the investement. As it was outlined in the previous article by Pascal Moyon, there is no single solution to this puzzle. What it means in practice, a great value brings ability to quickly compare results of different attribution models and cross them with market knowledge, experience and, let’s be honest, intuition. But essential is to retrieve the output of a given attribution model ‘on the spot’ and to do it quickly. Let’s have a look on a simple attribution problem.

If one want to see a marketing team fight, here’s a sample receipe: ask which channel owner should be proud of bringing a $50 sale given a customer path like below: Sample customer path • it the brand strength. • no, it’s the great deal with our PartnerX! • no, you are all wrong – this is the heavy effort of our paid search activity that give the customer recognition of our product • .. and so on! But how would we actually measure each channels contribution? This article aims at showing a quick way to put actual numbers behind such discussion. # One of the many attribution models Suppose we believe, that in our industry the real fight happens at the end of search and compare the offers. In such case, often the goods are substituable and the price sensitivity is high. A good example might be the market of computer parts. You will spend quite a lot of time choosing which CPU you want, but when decided will rather shop by price and shop reputation, warranty legth etc. So we would like to put more stress at the end of the path. One of infinity solutions is a time decay model, where the touch point contributions are inverse proportional to the dime distance from the conversion: $$c_{t} \sim \frac{1}{t_{conversion}-t+1}$$ The maximum weight (before normalizing) will be given to contact points in the day of purchase. Clicks / site entries one day before will have the weight twice smaller. The contribution here quickly dwindles down to the asymptote of zero, but will be always positive. In the sample scenario presened above, we would assign the values in the following way: Output of attribution model # But how can we do it fast on big data? In most scenarios, the clickstream to a website is quite huge. In this case we need a solution, that would not mind having a very long stream of rows and process them fast. In such case, AWK language comes as very handy. The AWK language was created simply to quickly process a big amount of text and to do it ‘on the fly’ – which means with little memory usage. Tha data flows in and out and in the meantime we can do some calcs on top of it. It’s logic is very simple: At the first lines we can define some functions to be used later. In the BEGIN clause, we can set some rules pre-reading the stream. In this example, we tell AWK that the fields will come separated by a tab, and we also want AWK to print with a tab as separator (OFS – output field separator). In the later part, we define what to do with the stream. It is plortant to notice, that we can benefit from perl-style variables like$_ (holds current line), \$1 (holds first column of current line) etc. Ar the END clause, we have final call to close or output the result.
In our case, we need to first define our attribution model:

We do it in 3 steps:

• we need to be able to subtract dates: so first we need to convert them to a AWK-digeastable format
• Having two dates as numeric values we can substract them. The result is in seconds, so we need to convert to days by division by 86400 = 24 * 60 * 60

Now it’s time to code the core of the program. Suppose we have data stored as:

In first column we have date of the event, in second the MD5-hashed customerID, third column is the channel from which the customer landed on our page and in the last we hold the purchase value. The sorting of data is important – we need to have the data sorted by customerID first (so that each of them arrives in a single chunk) and by date descending (as we substract from the date of conversion – we need it to come first). In our AWK program, all we need to do is for each row compute the channel contribution and keep track of changing keys. To do that, we simply check if the predecessor key is equal to the current one. If so, we add the channel it’s contribution. If not, we need to compute the normalizing constant (otherwise we would duplicate income!) and print the final values. After doing so, it is important to reset the temporary holders in order to not mix the different customers. In the END clause, we need to remember that the last customer has not been processed and it’s high time to do it.

The output of the script is fully additive and should be far smaller than the input. Of course, if each customer path is a single point – the number of rows in input & output will be the same. The last part is to aggregate then the values by channel and analyze the output. For the given example, our split of purchase made by the customer will look like this: