Using webscraping & NLP to select predictive features for car sales

Webscraping

Like similar online automobile sales websites, https://www.carsales.com.au allow prospective car buyers in Australia to view new and second hand car sales from private sellers and/or dealerships. Car sales are organized by state, seller and car type among other features. The well structured website makes it a suitable candidate for scraping using Python’s Beautiful Soup library after which we can test whether car sales prices can be predicted using scraped features. We use Natural Language Processing courtesy of Python’s nltk library to determine whether any words in the seller’s description of the car are more or less helpful in predicting car price.

I scrape for the most popular car brands sold in NSW, Australia:

Toyota
Mazda
BMW
Volkswagen
Ford
Audi
Volvo

and my features of interest include:

Number of Doors
Number of Seats
Engine Size
Number of Cylinders
Seller description
Age of car
Vehicle Type
Mileage

which I save in a pandas dataframe. The distribution of car sales appears to follow a lognormal distribution, so I take the log of sales price instead which follows a normal distribution:

Figure 1: NSW Car Sale prices appear to follow a LogNormal Distribution

_config.yml

Using quantitative features scraped from the website to predict car sales with 5 fold cross validation using an OLS model, we arrive at an R-square of 79%.

_config.yml

NLP with Seller Descriptions

Next, I try to fit the residuals of the OLS model using NLP on the seller description column. Using the TfidfVectorizer, I use the top 1000 words by frequency as regressing features. I use a Lasso regression to pare down the variables of interest to the top 10 words that have some explanatory power on car sales price. I end up explaining 26% of the residuals of the OLS model above.

Figure 2: Top 10 predictive words in Seller Description _config.yml

Complete notebook available here

Written on November 17, 2017