Cross-selling is frequently used marketing strategy many companies practice these days. The main goal of cross-selling is to increase revenue from the existing customers by selling different product-lines. Especially, finance companies, which offers a number of different services, actively implement cross-selling strategy. JP Morgan Chase bank case would be a great example. It not only provides bank service but also promote credit card and insurance services to its existing customers. Cross-selling could be a cost efficient business tool because companies can use preexisting customer database. Moreover, customers tend to show less resistence toward the brand they already associate with. Therefore, increasing the chance of cross-selling is a critical business interest for many firms. However, deciding whom to target is still a difficult task. The average response rate in the industry is usually less than 1.5%.
The dataset for this project is about a life insurance company in India which tries to sell its car insurance to its existing customers. The project’s main purpose is to find the most optimal and effective predictive model that could increase the chance of cross-selling, using data analytics tools and supervised machine learning methods.
Findings from this project will help the company to find out which customer segment to target for the best cross-selling outcome. The insight gained from this project will also help other finance companies that are trying to find ways to effectively target customer segments that have high potential of purchasing other finance services. The research will also show how companies could incorporate machine learning methods to effectively use the data and find meaningful business insights from them.
Understanding and Cleaning the Data
The dataset used in this project is from Kaggle “Health Insurance Cross Sell Prediction”. This dataset contains 381,109 rows and 12 variables.
Before I start the data cleaning, I checked the structure of data.
Since the main purpose of this project is to predict whether the customer would response or not, I decided to use “Response” as a dependent variable. For the predictors, rest of the variables excluding “id” will be used. So the “id” column will be eliminated. For the categorical variables except for “Region_Code” and “Policy_Sales_Channel”, I will change them into dummy variables instead of transforming the data type into factor. This helps running the models more efficiently. By conducting data exploration, I will decide which parts of data could be eliminated. By drawing histogram of each variable(Exhibit1), I tried to see the distribution of data.
- Gender: There are slightly more males than females.
- Age: Most of customers are under 30.
- Driving License: Most of customers have driving license.
- Region Code: Most of customers are in region Code 28 and 8. We could eliminate region codes that have few customers for our analysis.
- Previously Insured: There are slightly more of those who are not insured.
- Vehicle Age: Most of customers’ vehicle age is less than 2years.
- Vehicle Damage: Slightly more than half of customers experienced vehicle damage.
- Annual Premium: Customers are mainly divided into two part: Low annual plan and high annual plan. Since the graph is skewed we could use log for data transformation(Exhibit2).
- Policy Sales Channel: Except for 3 channels, most channels don’t have meaningful customer base. We could eliminate policy channels that are insignificant.
- Vintage: We can see vintages are almost equally distributed.
- Response: Approximately 14% responded.
I also analyzed the relationship between “Response” and continuous variables using boxplot(Exhibit3).
According to the result, Elder people tended to respond more. We can assume that age will have meaningful impact on the response. There was no significant difference in response rate in terms of annual premium and vintage. We could assume annual premium and vintage are not significant predictors. I also checked the relationship between each continuous variable using scatter plot(Exhibit4).
When considering the trendline, I found there is no significant correlations between each continuous variable. Based on the finding through EDA, I decided to eliminate meaningless values in “Region_Code” and “Policy_Sales_Channel”. I checked the frequency so that I could decide which region codes and policy channels to eliminate. As a result, I decided to eliminate codes and channels that have less than 2000 customers. The distribution of “Region_Code” and “Policy_Sales_Channel” became much simple and clear as shown in Exhibit5.
So the number of rows decreased from 381,109 to 332,006. As mentioned earlier, since “Region_Code” and “Policy_Sales_Channel” are factors, so I changed their datatype into factor. Also, since we don’t need “ID”, I eliminated it. The final version of data I will work on is as below.
Lastly, I ran correlation matrix(Exhibit6) to see the correlations of each variable.
“Vehicle_Age”, “Age” have high positive correlations and “Previously_Insured” has negative correlations with “Response”. Among predictors, “Vehicle_Age vs. Age”,“Vehicle_Damage vs. Age”, and “Vehicle_Damage vs. Vehicle_Age” shows high positive correlations and “Channel vs.Age”, “Previously_Insured vs. Age”, “Vehicle_Damage vs. Previously_Insured”, and “Channel vs. Vehicle_age” show high negative correlations. Predictors with high correlations could be eliminated, but I will make the decision after running the models.
Since we have too many observations, I decided to randomly select 20,000 observations. After that, I conducted data partitioning. I divided the dataset into train(70%) and test data(30%). Models will be built on traindata and tested on the testdata to evaluate their performance. For data partitioning, I used “createDataPartition” function so that train and test dataset could have equal ratio of response value.
With the traindata, I ran Penalized Logistic Regression using Lasso method. Lasso is very powerful tool since it penalizes insignificant variables.
As seen in the above graph, the fitting graph gives two dotted lines. The left one tells the lamda that gives minimum binomial deviance. This lamda value is lamda.min. The right one tells the lamda value that minimizes the number of predictors but still enables accurate prediction. The advantage of using lamda.1se value is that it solves overfitting problem. I used both lamdas to build models on train data and tested them on the test data. I didn’t solely rely on accuracy since it does not tell specificity and sensitivity. Therefore I decided to use AUC(Area Under the Curve) as a main tool to evaluate each model.
Model1: Penalized Logistic Regression Analysis using lambda.min(lambda = 0.0006642758)
- Result: Accuracy: 0.875 / AUC: 0.859
Model2: Penalized Logistic Regression Analysis using lambda.1se(lambda = 0.005143246)
- Result: Accuracy: 0.875 / AUC: 0.859
As seen in the above result, Model2’s accuracy and AUC is same as the Model1. However, since Model2 has less predictors and is simpler than Model1, I decided to stick to Model2. I also ran basic logistics regression and random forest as below.
Model3: Logistics Regression(Using all predictors)
- Result: Accuracy: 0.8748 / AUC: 0.86
Model4: Random Forest(Using all predictors)
- Result: Accuracy: 0.8703 / AUC: 0.840
Model5: Random Forest(Using selected predictors)
- Result: Accuracy: 0.8748 / AUC: 0.775
As seen in the above results, Model3 performs almost the same as Model2. However, since it uses more predictors, we can’t say it’s a good model. Also, Model4 and Model5 under performed Model2. Therefore, I decided to use Model2 as a final model.
According to the Model2, which used lamda value of lambda.1se, “Age”, “Driving_License”, “Region_Code”, “Previously_Insured”, “Vehicle_Age”, “Vehicle_Damage”, and “Policy_Sales_Channel” turned out to be significant variables(Exhibit7). Specifically, for region code, code3, code11, code28 turned out to be important and for policy sales channel, channel 1152, channel 1156, channel 1157, and channel 1160 turned out to be significant.
Among these variables, “Age” and “Previously_Insured” have negative coefficients. Based on this result, I recommend the company target younger customers and focus on those who did not insured in the past. On the contrary, “Driving_License”, “Vehicle_Damage”, and “Vehicle_Age” had very high positive coefficients. Therefore, I could recommend the company to target those who owns driving license and had experienced vehicle damage. Also old car owners would be a good target to choose. In terms of region code, code3, code11, code28 had positive coefficients. Therefore I would recommend the company to target customers who are living in those regions. Lastly, for policy sales channel, channel 1152, 1156, 1160 had negative coefficients. Based on this result, I could recommend the company to avoid these channels for cross- selling purposes or reeducate workers who are in charge of these channels. On the contrary, channel 1157 showed positive coefficient. Therefore I would recommend the company to strengthen the cross-selling effort in that channel.
Since the data doesn’t contain sensitive personal information, I believe there would be no ethical issues. However there could be some risk associated with deploying the model. Since the information in the “Vehicle_Age” variable was imperfect in the sense that it didn’t have further information about the cars that are older than 2years, it would be hard to generalize that vehicle age as a significant predictor.
The insights from this project could be helpful for life insurance companies that are also selling car insurance or planning to enter car insurance market. The findings will help them figure out what predictors they have to use to target customers effectively. Also, since many companies do not use region or channel information to target customers, the model in the project could be a good reference for improving their prediction models.