As the business world has evolved in the past couple of decades, one variable has become particularly valuable to understand – and as data scientist, you’ll be asked about it a lot – and that is:
Customer “churn ” refers to the ratio of customers that have stopped or are going to stop using a company’s services during a timeframe compared to its total customers. So basically when we talk about churn, we are talking about customer retention.
You can imagine how important that information is for companies with subscription models (gyms, Netflix, Amazon Prime), but all kinds of businesses would benefit from better understanding their churn: restaurants, retail, and even banks – which we will be focused on today with a real world example.
Banks need to be good at predicting everything, especially churn. If they don’t have a proper estimate for how long customers are going to stay with them, they can’t have a reliable estimate for how much money they have available for loans, credit, savings, anything. But the good news about banks is they have loads of data, so let’s take a look at a real world data set and see if we can predict churn.
If you would like to have the churn problem prediction in your portfolio here is a link to my tutorial where you can follow along through my step by step guidance.
Now if you remember my friend RIC, we talked about him in the “Learn to code fast” tutorial. In that video I mention that in the beginning of a big project like this, before we start coding, we first need to refer to our original question. What is it we are trying to solve? Well in this case we are trying to figure out what is the churn rate percentage of our bank customers.
Before we split our data, we need to figure out what are going to be the variables that will be the most important before creating our model. Let’s use our intuition and basically just guess and check which variables are relevant: age, credit score, and how long has a customer already been with the bank might be relevant. So let’s keep that in mind when we build our model.
After performing data analysis, cleaned our data and performed different visualisations in order to find the most relevant correlations between significant variables the results were as follows.
Customer Churn Analytics Results:
In the image above the first thing to notice is that our data is unbalanced, and we have app 80% of people that have not churned and app 20% of people that churned. Meaning that 80% of the bank customers are staying with the company which is a very good sign, but 20% are choosing to drop the bank´s services. Usually when dealing with unbalanced data, there are several techniques to handle this situation, most common and the one I have used in my tutorial is resampling of the dataset.
Whilst performing more data analysis you can see in the image above the majority customers of the bank are between the age of 30 mid 40s – 50s.
In the image above you can see a plot that checks the correlation between age and churn variables. We see that most churns are from people mid20s and 50s-60s. Data which could actually tell that usually people in their 20´s do not yet have a stable career and worklife, hence why they choose to churn, the same thing happening for people mid 50´s- 60´s where it can be quite tough to keep a workplace before going on pension. This being just an assumption, we will see if this assumption is right after we build our classifiers.
In the image above we see the correlation between geography and churn rate. Most churn rate proportional to its customers is in Germany. This metrics could be correlated to most likely the age of the customers in Germany as well as their income. (More to conclude after the results of our classifiers).
Machine Learning Classifier Results:
Moving on to classifiers, the first one I implemented was the random forest classifier, that has predicted a customer churn rate with an accuracy of 86.6% without any parameter tuning, result which is quite good. When I say that we have built our classifier, I mean we have previously done the prep work on our data, we Intuitively decided which would be the variables we would like to plug into our machine learning model, and then went ahead and added the prepped data into our classifier and performed the prediction.
After creating the classifier I have also checked the feature importance of Random Forest, which told me which variables are actually valuable and taken into consideration for a good prediction accuracy. From the plot we see that age, salary and credit score play a huge influence on how long the customers stay with the bank. As we saw from our visualisations above, a whole lot of people in their 20´s, with a smaller salary decided to churn, as well as people in their 50´s with the same issue choose to drop the banks services.
The second classifier we built is Logistic regression, with an accuracy of 80.2%, accuracy much lower than the random forest classifier. Here whenever we try to build a classifier our prediction accuracy rate should aim at 100%, of course 100% is pretty impossible to achieve due to the nature of the data, but the higher the prediction accuracy score the better your classifier performs.
Meaning, the better your machine learning model is at predicting the churn rate, the more you will know about your customers, and the more you could adapt your services to their needs. So a good classification result can ultimately result in more clients, and more stability for your business.
So why is it important to predict churn? Because usually the longer a customer is with you, the more money they are spending with you. But hopefully through this example you also see that the longer a customer is with you, the longer you are learning about them, but also learning about your own company.
When you understand customer’s spending habits you’re also learning about your own products – what accounts are performing well, which ones are not. What age is it useful to spend money on marketing to a client, which age is it a waste. There are answers to these questions, you just need the right data – because, luckily, data is a science.
Fastest Sorting Algorithm?
Knock Knock it’s Quick Sort here… I think you may have heard about me in class, or even from your friends or work. Anyway, in case you haven’t, I want to introduce myself.
A lot of people that work with me say that I am the fastest sorting algorithm, with a time complexity in the best case of O(NlogN), and worst-case time complexity of O( N 2). What I mean by this is that when you give me a medium-sized list, I sort it for you very quickly, but when you give me a large array I gotta be honest with you, I am a little lazy and tedious so it takes me a bit longer to sort the list for you compared to my cousin Merge Sort.
Worst case time of O( N 2) usually happens when the pivot element is picked as the smallest or biggest element of your array, so you gotta watch out for that. This basic condition I pose makes me very slow, so be careful. For a better performance, it’s smartest to pick scattered pivots and not the biggest or smallest element of the array.
Oh, how rude of me – don’t let me bore you with the minute details of my life. Instead, let my friend Bianca take you on a trip to one of my clients that is a store owner – she’s going to show you exactly how I sorted his merchandise:
This clothing retail client, Harry, just had a weekend sale, and he wants to find out what were the top most popular brands of shirts that have been sold so he can stock up for a new promotion. Now since Harry has over 25 brands of shirts this is where I shine: Quick Sort algorithm to the rescue
Now you gotta know a little secret about me! I am also a divide and conquer algorithm, and by divide and conquer I mean I split the big problem into smaller ones, solve those first and then put everything together. Below you can see the steps I follow in order to sort any array.
Step 1 − Make any element as pivot
Step 2 − Partition the array on the basis of pivot
Step 3 − Apply quick sort on left partition recursively
Step 4 − Apply quick sort on right partition recursively
So I basically split the array into smaller arrays then swap elements inside the array based on the comparison with the “pivot” element.
But wait, a lot of people ask me… How to choose the pivot element?
The pivot is chosen arbitrarily, you can choose it at the beginning of your array, at the end or middle. For the sake of not making it very complicated I suggest choosing the first element of the array as the pivot so you can further find its sorted position into the array following the steps above.
There is one BIG RULE you need to keep in mind!!!!
The aim of me the quick sort algorithm is to have all elements on the left of the pivot smaller than the pivot and the elements on the right of the pivot bigger than the pivot.
As long as you follow this rule you will be on your way to sorting your array the proper way using the Quick Sort algorithm.
As the World is constantly on the lookout for new technological advancements, advertisement recommendations for sale leads become a necessity for New and Established business to differentiate themselves from their competition.
This project explores Bayesian Mix Market Modelling and report the findings of the created Machine Learning model with the scope of predicting sales and recommending the most appropriate advertisement strategy for sale leads.
Bayesian inference is a method of assigning a probability distribution to a parameter. These probability distributions express some degree of uncertainty of a value for the parameter. The aim of inference is then no longer to find the “best” single value for the parameters of the model, as in other statistical models. Rather it is to determine the posterior distribution of the parameters. This posterior distribution is determined by a
combination of the likelihood of the data, and prior knowledge of the parameter distribution. The posterior distribution is therefore described by a simple expression of Bayes’ Theorem:
The likelihood is the likelihood distribution of the data. This distribution summarizes the data, by presenting a range of values for the parameters which is accompanied by the likelihood that the current parameter describes the data. The prior distribution is estimated for parameter values before sampling any data. If no prior knowledge is available non-informative estimates, such as a normal distributions, can be used. AdStock is a method of quantifying the effect an advertisement has on demand over time. Adstock is typically dependent on two factors. Firstly, the decay effect of which an advertisements loses impact over time, And secondly, how the demand of a product scales with the amount of advertising typically is not linear. To take this advertisement saturation into account, the adstock must undergo a transformaion. The Adstock value can then be defined recursively as:
Where At is the adstock to time t, and f(xt; θ) is the transformation function which should be chosen to take ad-saturation into account. This project utilizes a sigmoid transformation of the adstock.
Seasonality traits are periodic fluctuations that can not be attributed to the data given, that might increase or decrease the sales. Examples of seasonality traits include Christmas sales and the like, or weekly occurrences such as the weekend.
The Markov Chain Monte Carlo (MCMC) method is used to approximate the posterior distribution of n parameters through random sampling in a n-dimensional probabilistic space.
In the project, all of these methods have been combined to predict future sales, and to recommend which channels are the best to advertise on. Through Bayesian inference the likelihood of the data has been calculated as:
Using the MCMC method the approximation of the posterior distribution for each of the parameters have been found. The effect an advertisement channel has on sales can be evaluated by removing said advertisement from the prediction of a model, and noting the effect of this on prediction. The drop in sales compared to the money spent on this channel can then be depicted as an indication of it’s usefulness.
Intuitively it can be induced that if a model still predicts well, even after the data describing the channel has been removed, that this channel has very limited effect on sales. Similarly, if the model drops significantly in sales after removing the advertisement data, said advertisement must have a positive impact on the sales.
The dataset consists of daily data points spanning over three years, describing the amount of money that was spent on several different types of advertising, and their respective impressions. The dataset also consists of two external factors, the amount of sunshine, and the currency between EUR and USD.
Using a model which uses all five impression features of the dataset, a Root Mean Squared Error (RMSE) of 36.55 was achieved on unseen data.
Moreover, if the same model is applied to the training set, the model achieves an RMSE of 66.04 2. This result can also be seen in table 1.
Table 1 also shows the scores of different model setups on the entire training dataset. The only parameters which were changed in this experiment was the number of features, and more specifically which ones. The seasonality patterns and prior distributions have been left as in the baseline model.
The results of the method in table 2 showed that when removing TV-ads sales dropped ∼ 57%. Removing radio ads the average sales dropped ∼ 17%. Removing ’ooh’ ads, the sale essentially stayed the same. Removing ’searchppc’ and ’display’ ads in fact increased the sales, by respectively ∼ 3.5% and ∼ 13.5%.
To find different seasonality traits, periodic poor predictions were observed. If an inaccuracy in score is consistent in periodic jumps, it is an indication that a seasonality trait is at play, that the model can not describe. These seasonality traits were manually found and implemented in the model, and resulted in a huge increase in prediction scores. An example of such a seasonality traits in the project data is sales on Black Friday. In the data, the sales spike on this particular day, due to the seasonal event. Moreover, a
pattern for each quarter of the year was found. Here the same three months of the year, have somewhat similar tendencies across all three years of the data. These patterns can be seen on figure A.1
The current model is at a stage of good performance. However, improvements to the performance could still be made. The amount of experimentation on hyper parameters have been limited due to time constraints.
However, in table 1 results from testing different combinations of features on the same baseline model can be seen. It can be seen that ’searchppc’, ’ooh’, and ’display’ has the smallest impact on the model when removed. However, working with only the TV and Radio impressions did have a larger impact on the model performance than the first results signaled. The preferred model for this project is therefore the one using
all of the impression features.
According to the established model, it can be deduced that TV-advertisement and radio has the greatest effect on sales. Using Tv-ads, in particular, produces a very large increase in sales. Even though the most money was spent on it (∼7,000,000), the drop in sales compared to money spent when removing TV-ads from prediction is very large. The sales when removing radio-ads did not fall as drastically, also when compared to the amount of money spent (∼ 40% less than tv-ads). Furthermore, it can also be deduced from the model that spending ∼ 3, 000, 000 on ’ooh’ ads could have been money better spent, as it showed to have very little influence on the sales of our model.
As our model shows that money is best spent on TV-ads, the saturation effect of ads is not taken into account. At some point it would be better to spend more money on Radio-ads, when the saturation of TV-ads has become so high that the demand becomes lower per dollar than radio-ads. Finding this point could be an analysis for future work.
Concluding remarks and future work
This project was about Marketing Mix modelling and to predict future sales and recommend optimal channels for advertisement. It was found that spending money on ’Out Of Home (ooh)’ ads was largely a waste of money compared to TV- and radio advertising. This was shown, both by removing the feature completely
from the model and removing it during predictions.
Due to lack of time, the model does not include two variables from the data set: sunshine and US EUR exchange rate. The model therefore has no external factors impacting performance. A good next step would then be to incorporate these variables. Moreover, more work on finding appropriate priors would too be an ideal next step for the project.
A.1 Seasons – Spring, Summer, Autumn, Winter
A.2 Resulting Graph
Figure 2: Predicted values (Blue) Compared to true values (Red). Y-axis depicts the sale to time on X-axis