Using webscraping & NLP to select predictive features for car sales
Webscraping
Like similar online automobile sales websites, https://www.carsales.com.au allow prospective car buyers in Australia to view new and second hand car sales from private sellers and/or dealerships. Car sales are organized by state, seller and car type among other features. The well structured website makes it a suitable candidate for scraping using Python’s Beautiful Soup library after which we can test whether car sales prices can be predicted using scraped features. We use Natural Language Processing courtesy of Python’s nltk library to determine whether any words in the seller’s description of the car are more or less helpful in predicting car price.
I scrape for the most popular car brands sold in NSW, Australia:
- Toyota
- Mazda
- BMW
- Volkswagen
- Ford
- Audi
- Volvo
and my features of interest include:
- Number of Doors
- Number of Seats
- Engine Size
- Number of Cylinders
- Seller description
- Age of car
- Vehicle Type
- Mileage
which I save in a pandas dataframe. The distribution of car sales appears to follow a lognormal distribution, so I take the log of sales price instead which follows a normal distribution:
Figure 1: NSW Car Sale prices appear to follow a LogNormal Distribution
Using quantitative features scraped from the website to predict car sales with 5 fold cross validation using an OLS model, we arrive at an R-square of 79%.
NLP with Seller Descriptions
Next, I try to fit the residuals of the OLS model using NLP on the seller description column. Using the TfidfVectorizer, I use the top 1000 words by frequency as regressing features. I use a Lasso regression to pare down the variables of interest to the top 10 words that have some explanatory power on car sales price. I end up explaining 26% of the residuals of the OLS model above.
Figure 2: Top 10 predictive words in Seller Description
Complete notebook available here