How to best predict used car prices

11 min readNov 23, 2020

When I first came to the US, one of the most difficult tasks was to buy a car. This was because I had so many options to choose from; from brand new expensive German cars to used Japanese cars with reasonable prices. Also, I had to consider whether the given prices were appropriate. There were sources like Truecar and Carmax, but they didn’t give exact clues how they came up with those price-tags! This made me wonder what if there were models that could give good estimation of car prices. This is how we decided to build price prediction model for used cars.

Business Understanding

Last year, 57.8 million cars were sold in the US and among them, 40.8 million were used cars. It is not an easy task for retailers, buyers, and sellers to estimate the price of a car as there exist many different variables that need to be considered to estimate the price of a car. For example, buyers tend to consider mileage before purchasing a car because this variable affects the maintenance cost of a car and life expectancy of the vehicle. Sellers will also consider mileage and check for other variables to reach a fair price to offer. The solution to this issue could be found by asking for an expert’s opinion. However, each expert weight variables based on his or her preferences which may lead to an unfair judgement and price estimation of a car. For this data analysis project, our team of data analysts worked with a dataset to find a model that accurately estimates the price of a car. We then investigated the market of selling cars in different states to find out which states had thriving markets where it was easier to sell a car. Furthermore, we investigated whether or not there existed a meaningful difference between the proposed price of sold cars in different states. The data mining solution will lead sellers to establish a fair price to offer when a buyer wants to purchase a car.

Understanding the data

We used the dataset “Used Cars Dataset” provided by Kaggle to construct our model and implement other analysis. This large dataset contains 423,857 records and 25 variables/columns of data. Some variables include price, year, manufacturer, model, condition, cylinders, fuel, and transmission (Exhibit 1). The variable we decided to prioritize and inspect was the “price” of a sold car, which is our ultimate goal in creating the model. To predict the price of a sold car, we had to create a model that best predicts the price of car by adopting only the most relevant predictor variables. Our data included two types of data: factors and numbers (Exhibit 1). Factors are variables that have different levels and, in our analysis, each one of these levels will be treated as dummy variables. Numbers are continuous variables and will be treated as such. Before starting to clean the data, we excluded the variables: “id”, “url”, “region_url”, “vin”, and “image_url” because we deemed them to be irrelevant for this analysis.

Cleaning the data

It is important to understand and determine which variables add value to our analysis, and which may be redundant. Domain knowledge plays a vital role in this stage as having a solid understanding of the industry helps to decide which one is relevant and which one is not.

In addition to the above variables, we also eliminated the “county” column because all of its cells contained NULL values. Another column that was eliminated is “size” column because 75% of its values were NULL and the content of that column was uninformative. The next column that was eliminated was the “description” column because it displayed the text that accompanied the ads of selling. This variable does not add value to any aspect of our analysis. The column “region” was also neglected because we had the “state” column which already provided the location information. The final table of variables that is used is shown in Exhibit 1.

One of the decisions that we needed to make was related to the brand and models of the sold car. Understandably, the brand and model of a car is the most decisive factor in determining the price of a car. However, in this analysis, as we don’t have access to super computers to calculate this variety in brands and models of cars, we decided to consider only German and Japanese cars and eliminate the model of the cars. Furthermore, we excluded “Porsche” from our analysis as it had only 6 records. There were other points that required some filtering; for example, we had 8,572 cars with the price of zero which doesn’t make any sense. As a result, we eliminated them from our analysis. Max values of “price” column also contain some bizarre values that we assumed stemmed from some typo. We considered to eliminate cars with prices less than $200,000 because linear regression is extremely sensitive to outliers.

The main issue in our dataset was related to missing values; for example, the “condition” column contains 42,610 NULL values. There are a set of solutions to address the problem of missing values such as: filling cells with the mode or mean of the column, use neural network to predict missed values, or eliminate rows with Null values entirely. Each one of them has its own advantage and disadvantage; for example, we could use neural network to fill NULL values, but it could take few days to calculate all of those values. As a result, considering the fact that we had a large dataset, we decided to eliminate all records that had NULL values. The dataset after this elimination contained 29,493 records. The next step in cleaning the dataset was to eliminate cars with prices less than $1,000. The reason that we eliminated these 449 records was that we noticed that their “price” considering the models and year that the car was manufactured are the monthly lease payment.

By doing EDA on the “year” column, we found a group of classic cars which were few in number but biased our dataset adversely. As a result, we decided to consider cars that were produced after 1995 until 2020. In the next step, we eliminated records with “cylinder” values of “other”. The same approach was done regarding “fuel” variable and 96 records were deleted with value of “other”. The other point about “fuel” column is that it had only 12 records with the value of “electric”. We ignored these records as they made it difficult to draw any significant conclusions. Regarding “odometer”, we decided to only consider values less than 300,000 miles as the records beyond this value are few but made the distribution of this variable highly skewed. Regarding the “title_status” column, we had 11 records for “missing” and 6 records for “parts only”. By the same reason used before, we also eliminated these records. Considering the “transmission” column, there was a “other” value in that column. We eliminated records with “other” as their value in the “transmission” column. In the “type” column, we have 182 records with “other” as their registered value and 1 “bus”. We eliminated both.

Before constructing our linear regression models, it is advised to check for high correlation between independent variables to eschew multicollinearity. After creating the correlation matrix (Exhibit 2), we noticed that there is a high correlation between “odometer” and “year” of the registered cars, and it makes sense, as the older age of a car typically correlates with having a higher mileage. We decided to eliminate the “year” variable as we believe that “odometer” can be a better predictor compared to the “year” variable.

After eliminating the “year” column, we ended up having 27,202 records and 14 variables. However, we will use “lat” and “long” variables to identify thriving markets for selling cars but we will not in the process of developing models.

Modeling

In our search for the best model for price prediction of sold cars , we developed different models in our search. As this is a regression model, we were able to develop linear regression and random forest models. Before jumping into explaining the process of constructing our models, we should describe linear regression and Random Forest models.

Linear regression attempts to model the relationship between two variables and does so by fitting a linear equation to observed data. One variable is considered to be a dependent variable, while the other is explanatory. There is no certainty that one variable causes the other, but analysts should look out for an association between the two before attempting to fit a linear model to observed data.
Random Forest is a supervised learning algorithm that can be used for both classification and regression issues. The “forest” that it builds is an ensemble of decision trees, usually trained with the “bagging method”. Put simply, random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

In this pursuit, we developed 2 multiple linear regression models and 3 Random Forest models. We compared their result using RMSE and residual analysis. Finally, we chose the two most predictive models and ran a cross validation to choose the most accurate model at predicting the price of a car.

We began by constructing a multiple linear regression model using all independent variables and calculating RMSE and analyzing its residuals. We checked for the distribution of residuals and how closely the plot of prediction and actual values displayed an y=x line. Closer to y=x the distribution of prediction and actual values, the better a model is (Exhibit 3). Furthermore, it is important to bear in mind that we must have a normal distribution of residuals to claim that our model works properly. The “model1” metrics are 0.62 for R-squared and 5883 for RMSE. However, we were not satisfied with the result of this model.

The second model of multiple linear regression was created after running “stepwise” command on first model of multiple linear regression. After running the “stepwise” command, we noticed that by controlling all other variables, ‘transmission’ variable doesn’t bear any predictive power. The metrics that we obtained after running the second linear regression didn’t show significant improvement compared to “model1”; however, based on Occam Razor’s rule, we should adopt the simpler model which in this case is “model2”. After creating multiple linear regression, we continued our search with creating Random Forest models. We created multiple Random Forest models and used different parameters to improve the accuracy power of the model. It seemed that our Random Forest models would be better in terms of RMSE compared to our multiple linear regression models. However, to confirm this finding we needed to run cross validation for models that we suspected had the most predictive power from each group. We chose “model2” which is the second linear model and “RandFor1” which had the lowest RMSE when they were used to predict the test dataset. The best RMSE was “RandFor1” which gave us RMSE of 4,643 and was obtained after constructing different models.

After considering 10 folds for each model, we ran the cross validation. We have provided a brief explanation of why we chose cross validation as the best model.

• Cross validation (also called rotation estimate, or out-of-sample testing) is one of the ways to ensure your model is robust and does so by using multiple, sequential holdout samples that cover all data. A holdout sample is used to test the model and this is different from the “classical” method for model testing, which uses all of the data to test the model. However, in classic models, there is no certainty about which portion of data should be adopted as the test data set. Most datasets aren’t homogenous across their entire length, so if you choose the wrong chunk of data, you could invalidate a perfectly good model.

After running the 10-fold cross validation on our best models, we noticed that our “RandFor1” model was the most predictive. We used “RMSE” and “Rsquared” to compare these two models (Exhibit4). As a result, we suggest that our “RandFor1” be used to predict the price of cars.

After choosing the best model to estimate the price of a car, we can analyze whether there is a meaningful difference between price of sold car in different states. Actually, there are differences between the prices that cars are sold in different states. In our analysis we considered the coefficients and P-values of all states in our model. We filtered out all states that had P-values more than 0.05 and set 0 for Alabama state which was the baseline state during the analysis and model creation. We showed this difference between 10 states by sampling from the whole states that have P-value less than 0.05 and plotting them on a scatter plot (Exhibit 5).

Notice that these coefficients are calculated relative to Alabama and they don’t bear any meaning when considered on their own. Furthermore, we plotted the latitude and longitude of places where cars were registered to show the locations which possess a thriving market for sold cars (Exhibit 6).

Deployment

The result of our project can be used by either retailers or buyers by assisting them in following ways:

Retailers or buyers can use this model to estimate the price of a car with more precision and eschew the confusion that may arise from considering different variables. Furthermore, we can also rely on experts’ opinion on price estimation; however, each expert weighs different variables based on his or her opinion. On the other hand, what we have here is an objective model which was developed solely based on the data that we had.
The identification of states who enjoy a thriving market and the insight that we obtained when comparing the coefficients of different states can enable retailers or buyers to make better decisions. For example, we can calculate the amount of change in price of a car if sold in different states and a potential buyer may compare the cost of buying a car in other states and bringing it to the state that he or she lives. Totally this analysis sheds light on one aspect of car market that was neglected.

How to best predict used car prices

Written by Hongguin J. Kim