First Kaggle Competition

By | August 1, 2015
This was my first competition and I finished (cue the drums) 116th/377. A spectacular failure exceeded only by the failure in my next competition. Nevertheless, here is what I did and overall I was not all that inaccurate on the validation set and probably even less inaccurate in real life, surprise surprise. This is a converted IPython notebook, “lessons learned” in the next post.
Note: All confusion matrices reflect workings of the learners on the validation set

Preparation

SciKit

We are using the brand new 0.16.0. Upgrade using:

conda install scikit-learn

Data Preparation

Make sure trainLabels.csv and mix_lbp.csv are both present in the same directory as the .ipynb (this notebook)

Utilities

Developed to faciliate this:

  • tr_utils.py – often used functions
  • train_files.py – aids in file manipulation (loading features)
  • SupervisedLearning.py – thin wrapper arround scikit supervised learning algorithms
  • train_nn.py – neural net using PyBrain

Misc

If <tab> completion is not working, install pyreadline:

easy_install pyreadline

Methodology

We are running three classifiers and blending their results:

  1. SVM
  2. Neural net
  3. Calibrated random forest

mix_lbp.csv contains all the features (below) used for the experiments (as well as for final competition participation). It contains all the samples of the training set. For this demo, we split this set 9:1, where 90% is used for training and 10% for validation.

We then run three different classifiers on these sets (training on the larger one and validating on the smaller one), and try to blend the results by simple voting in order to decrease the resulting log loss (computed on the validation dataset).

Feature Selection

We are training with two types of features:

  1. Features selected from the binary files as described in the 1dlbp article. These produce a histogram of 256 bins for each file
  2. Features selected from .asm files are binary features, indicating whehter a given file contains a certain Windows API. 141 top APIs are picked for this purpose
  3. Number of subroutines in each file (one additional feature). Giving us a 398-dimensional vector
In [2]:
from SupervisedLearning import SKSupervisedLearning
from train_files import TrainFiles
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from tr_utils import vote
import matplotlib.pylab as plt
import numpy as np
from train_nn import createDataSets, train

An auxilary function

Let’s define a function that plots the confusion matrix to see how accurate our predictions really are

In [3]:
def plot_confusion(sl):
    conf_mat = confusion_matrix(sl.Y_test, sl.clf.predict(sl.X_test_scaled)).astype(dtype='float')
    norm_conf_mat = conf_mat / conf_mat.sum(axis = 1)[:, None]

    fig = plt.figure()
    plt.clf()
    ax = fig.add_subplot(111)
    ax.set_aspect(1)
    res = ax.imshow(norm_conf_mat, cmap=plt.cm.jet, 
                    interpolation='nearest')
    cb = fig.colorbar(res)
    labs = np.unique(Y_test)
    x = labs - 1

    plt.xticks(x, labs)
    plt.yticks(x, labs)

    for i in x:
        for j in x:
            ax.text(i - 0.2, j + 0.2, "{:3.0f}".format(norm_conf_mat[j, i] * 100.))
    return conf_mat

Load data from the text file

Loaded data contains all of the training examples.

NOTE: Actually almost all. 8 are missing, because binary features could not be extracted from them.

In [4]:
train_path_mix = "./mix_lbp.csv"
labels_file = "./trainLabels.csv"
X, Y_train, Xt, Y_test = TrainFiles.from_csv(train_path_mix, test_size = 0.1)

The last line above does the following:

  1. Loads the examples from the csv files, assuming the labels are in the last column
  2. Splits the results in a training and a validation dataset using sklearn magic, based on the test_size parameter (defaults to 0.1). $test\_size \in [0, 1]$.
  3. Returns two tuples: (training, training_labels), (testing, testing_labels)

Training

Training consists of training three models

  1. SVM with an RBF kernel
  2. Random foreset with the calibration classifier installed in 0.16.0 of scikit
  3. Neural net

Train SVM

We neatly wrap this into our $\color{green}{SKSupervisedLearning}$ class The procedure is simple:

  1. Instantiate the class with the tuple returned in by the TrainFiles instance or method above and the desired classifier
  2. Apply standard scaling (in scikit this is the Z-score scaling which centers the samples and reduces std to 1). NOTE: This is what the SVM classifier expects
  3. Set training parameters
  4. Call $\color{green}{fit\_and\_validate()}$ to retrieve the $\color{green}{log\_loss}$. This function will compute the log loss on the validation dataset. It will also return the training log loss, which may be interesting but is not out goal. Either way, it is going to be spectacularly small. 🙂
In [5]:
sl = SKSupervisedLearning(SVC, X, Y_train, Xt, Y_test)
sl.fit_standard_scaler()
sl.train_params = {'C': 100, 'gamma': 0.01, 'probability' : True}
ll_trn, ll_tst = sl.fit_and_validate()

print "SVC log loss: ", ll_tst
SVC log loss:  0.0500220173885

You can play with the parameters here to see how log loss changes. SKSupervisedLearning wraps the sklearn grid search technique for searching for optimal parameters in one call. You can take a look at the implementation details.

Let’s plot the confusion matrix to see how well we are doing (values inside squares are %s): (change magic below to %matplotlib qt to get the out-of-browser graph)

In [6]:
%matplotlib inline
conf_svm = plot_confusion(sl)

As expected, we are not doing so well in class 5 where there are very few samples.

Train Neural Net`

This is a fun one, I promise. 🙂

The neural net is built by PyBrain, has just one hidden layer which is equal to $\frac{1}{4}$th of the input layer. The hidden layer activation is sigmoid, the output – softmax (since this is a multi-class neural net), and has bias units for the hidden and the output layers. We use the PyBrain $\color{green}{buildNetwork()}$ function that builds the network in one call.

NOTE: We are still using all the scaled features to train the neural net

I am setting %matplotlib to qt so training can be watched in real time You will see each training epoch charted. The graph on the left shows % error, the one on the right – log loss.

You can play with the “test error” or “epochs” parameter to control how long it runs.Limiting it to just 10 epochs for this experiment

In [6]:
%matplotlib qt
trndata, tstdata = createDataSets(sl.X_train_scaled, Y_train, sl.X_test_scaled, Y_test)
fnn = train(trndata, tstdata, epochs = 10, test_error = 0.07, momentum = 0.15, weight_decay = 0.0001)
epoch:    1   train error:  6.28%   test error:  8.29%  test logloss: 0.3495  train logloss: 0.2659
epoch:    2   train error:  3.98%   test error:  5.52%  test logloss: 0.2491  train logloss: 0.1615
epoch:    3   train error:  2.89%   test error:  5.16%  test logloss: 0.2025  train logloss: 0.1166
epoch:    4   train error:  2.55%   test error:  4.51%  test logloss: 0.1740  train logloss: 0.0970
epoch:    5   train error:  2.02%   test error:  4.60%  test logloss: 0.1768  train logloss: 0.0806
epoch:    6   train error:  1.71%   test error:  4.05%  test logloss: 0.1618  train logloss: 0.0735
epoch:    7   train error:  1.64%   test error:  3.87%  test logloss: 0.1584  train logloss: 0.0656
epoch:    8   train error:  1.54%   test error:  3.41%  test logloss: 0.1436  train logloss: 0.0600
epoch:    9   train error:  1.23%   test error:  3.13%  test logloss: 0.1382  train logloss: 0.0529
epoch:   10   train error:  1.20%   test error:  3.41%  test logloss: 0.1400  train logloss: 0.0536

Train Random Forest with Calibration

Finally, we train the random forest (which happens to train in seconds) with the calibration classifier (which takes 2 hours or so)

Random forests are very accurate, the problem is that they maker over-confident predictions (or at least that is what the predict_proba function, which is supposed to return probabilities of each class gives us). So, god forbid we are ever wrong! Since it predicts the probability of 0 on the correct class, log loss goes to infinity. Calibration classifier makes predict_proba return something sane.

In [7]:
sl_ccrf = SKSupervisedLearning(CalibratedClassifierCV, X, Y_train, Xt, Y_test)
sl_ccrf.train_params = \
    {'base_estimator': RandomForestClassifier(**{'n_estimators' : 7500, 'max_depth' : 200}), 'cv': 10}
sl_ccrf.fit_standard_scaler()
ll_ccrf_trn, ll_ccrf_tst = sl_ccrf.fit_and_validate()

print "Calibrated log loss: ", ll_ccrf_tst
Calibrated log loss:  0.0614970801316

As you can see, we are simply wrapping the $\color{green}{RandomForestClassifier}$ in the $\color{green}{CalibratedClassifier}$. Plot the matrix (after a couple of hours):

In [8]:
%matplotlib inline
conf_ccrf = plot_confusion(sl_ccrf)

Voting

Now we can gather the results of our experiments and blend them. We use a simple weighted voting scheme for that.

$\color{green}{vote}$ function is implemented in tr_utils.py.

Here we are trying to balance the weights of SVM and calibrated RF.

In [9]:
%matplotlib inline
x = 1. / np.arange(1., 6)
y = 1 - x

xx, yy = np.meshgrid(x, y)
lls1 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
lls2 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])

for i, x_ in enumerate(x):
    for j, y_ in enumerate(y):
        proba = vote([sl.proba_test, sl_ccrf.proba_test], [x_, y_])
        lls1[i, j] = log_loss(Y_test, proba)

        proba = vote([sl.proba_test, sl_ccrf.proba_test], [y_, x_])
        lls2[i, j] = log_loss(Y_test, proba)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(121)
ax1 = fig.add_subplot(122)

ax.set_aspect(1)
ax1.set_aspect(1)

res = ax.imshow(lls1, cmap=plt.cm.jet, 
                interpolation='nearest')
res = ax1.imshow(lls2, cmap=plt.cm.jet, 
                interpolation='nearest')

cb = fig.colorbar(res)

The graphs show a “blended” log loss. A matrix on the left blends SVM and RF log losses with weights “favoring” SVM, and the one on the right “favors” RF.

Leave a Reply

Your email address will not be published. Required fields are marked *