09 Tabular Modeling Deep Dive - The Dataset - 《The fastai book》

This is a very common type of dataset and prediction problem, similar to what you may see in your project or workplace. The dataset is available for download on Kaggle, a website that hosts data science competitions.

Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills. There is nothing like getting hands-on practice and receiving real-time feedback to help you improve your skills.

Kaggle provides:

Interesting datasets
Feedback on how you’re doing
A leaderboard to see what’s good, what’s possible, and what’s state-of-the-art
Blog posts by winning contestants sharing useful tips and techniques

Until now all our datasets have been available to download through fastai’s integrated dataset system. However, the dataset we will be using in this chapter is only available from Kaggle. Therefore, you will need to register on the site, then go to the . On that page click “Rules,” then “I Understand and Accept.” (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data.)

The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using by running this in a notebook cell:

You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called kaggle.json to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell in the notebook associated with this chapter (e.g., creds = '{"username":"xxx","key":"xxx"}'):

In [3]:

creds = ''

Then execute this cell (this only needs to be run once):

In [4]:

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.chmod(0o600)

Now you can download datasets from Kaggle! Pick a path to download the dataset to:

In [5]:

path = URLs.path('bluebook')
path

Path('/home/jhoward/.fastai/archive/bluebook')

In [6]:

And use the Kaggle API to download the dataset to that path, and extract it:

In [7]:


    path.mkdir(parents=true)
    api.competition_download_cli('bluebook-for-bulldozers', path=path)
    file_extract(path/'bluebook-for-bulldozers.zip')
path.ls(file_type='text')

Out[7]:

Now that we have downloaded our dataset, let’s take a look at it!

Look at the Data

Kaggle provides information about some of the fields of our dataset. The Data explains that the key fields in train.csv are:

SalesID:: The unique identifier of the sale.
MachineID:: The unique identifier of a machine. A machine can be sold multiple times.
saleprice:: What the machine sold for at auction (only provided in train.csv).
saledate:: The date of the sale.

In any sort of data science work, it’s important to look at your data directly to make sure you understand the format, how it’s stored, what types of values it holds, etc. Even if you’ve read a description of the data, the actual data may not be what you expect. We’ll start by reading the training set into a Pandas DataFrame. Generally it’s a good idea to specify unless Pandas actually runs out of memory and returns an error. The low_memory parameter, which is True by default, tells Pandas to only look at a few rows of data at a time to figure out what type of data is in each column. This means that Pandas can actually end up using different data type for different rows, which generally leads to data processing errors or model training problems later.

Let’s load our data and have a look at the columns:

In [8]:

df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)

In [9]:

df.columns

Out[9]:

At this point, a good next step is to handle ordinal columns. This refers to columns containing strings or similar, but where those strings have a natural ordering. For instance, here are the levels of ProductSize:

In [10]:

df['ProductSize'].unique()

Out[10]:

array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact'], dtype=object)

We can tell Pandas about a suitable ordering of these levels like so:

In [11]:

sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'

In [12]:

df['ProductSize'] = df['ProductSize'].astype('category')

The most important data column is the dependent variable—that is, the one we want to predict. Recall that a model’s metric is a function that reflects how good the predictions are. It’s important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you. If no variable represents that metric, you should see if you can build the metric from the variables that are available.

However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that of that value will give us what we ultimately need:

In [13]:

In [14]:

df[dep_var] = np.log(df[dep_var])

We are now ready to explore our first machine learning algorithm for tabular data: decision trees.