Note: All confusion matrices reflect workings of the learners on the validation set
Preparation¶
Data Preparation¶
Make sure trainLabels.csv and mix_lbp.csv are both present in the same directory as the .ipynb (this notebook)
Utilities¶
Developed to faciliate this:
- tr_utils.py – often used functions
- train_files.py – aids in file manipulation (loading features)
- SupervisedLearning.py – thin wrapper arround scikit supervised learning algorithms
- train_nn.py – neural net using PyBrain
Methodology¶
We are running three classifiers and blending their results:
- SVM
- Neural net
- Calibrated random forest
mix_lbp.csv contains all the features (below) used for the experiments (as well as for final competition participation). It contains all the samples of the training set. For this demo, we split this set 9:1, where 90% is used for training and 10% for validation.
We then run three different classifiers on these sets (training on the larger one and validating on the smaller one), and try to blend the results by simple voting in order to decrease the resulting log loss (computed on the validation dataset).
Feature Selection¶
We are training with two types of features:
- Features selected from the binary files as described in the 1dlbp article. These produce a histogram of 256 bins for each file
- Features selected from .asm files are binary features, indicating whehter a given file contains a certain Windows API. 141 top APIs are picked for this purpose
- Number of subroutines in each file (one additional feature). Giving us a 398-dimensional vector
from SupervisedLearning import SKSupervisedLearning
from train_files import TrainFiles
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from tr_utils import vote
import matplotlib.pylab as plt
import numpy as np
from train_nn import createDataSets, train
An auxilary function¶
Let’s define a function that plots the confusion matrix to see how accurate our predictions really are
def plot_confusion(sl):
conf_mat = confusion_matrix(sl.Y_test, sl.clf.predict(sl.X_test_scaled)).astype(dtype='float')
norm_conf_mat = conf_mat / conf_mat.sum(axis = 1)[:, None]
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.set_aspect(1)
res = ax.imshow(norm_conf_mat, cmap=plt.cm.jet,
interpolation='nearest')
cb = fig.colorbar(res)
labs = np.unique(Y_test)
x = labs - 1
plt.xticks(x, labs)
plt.yticks(x, labs)
for i in x:
for j in x:
ax.text(i - 0.2, j + 0.2, "{:3.0f}".format(norm_conf_mat[j, i] * 100.))
return conf_mat
Load data from the text file¶
Loaded data contains all of the training examples.
NOTE: Actually almost all. 8 are missing, because binary features could not be extracted from them.
train_path_mix = "./mix_lbp.csv"
labels_file = "./trainLabels.csv"
X, Y_train, Xt, Y_test = TrainFiles.from_csv(train_path_mix, test_size = 0.1)
The last line above does the following:
- Loads the examples from the csv files, assuming the labels are in the last column
- Splits the results in a training and a validation dataset using sklearn magic, based on the test_size parameter (defaults to 0.1). $test\_size \in [0, 1]$.
- Returns two tuples: (training, training_labels), (testing, testing_labels)
Training¶
Training consists of training three models
- SVM with an RBF kernel
- Random foreset with the calibration classifier installed in 0.16.0 of scikit
- Neural net
Train SVM¶
We neatly wrap this into our $\color{green}{SKSupervisedLearning}$ class The procedure is simple:
- Instantiate the class with the tuple returned in by the TrainFiles instance or method above and the desired classifier
- Apply standard scaling (in scikit this is the Z-score scaling which centers the samples and reduces std to 1). NOTE: This is what the SVM classifier expects
- Set training parameters
- Call $\color{green}{fit\_and\_validate()}$ to retrieve the $\color{green}{log\_loss}$. This function will compute the log loss on the validation dataset. It will also return the training log loss, which may be interesting but is not out goal. Either way, it is going to be spectacularly small. 🙂
sl = SKSupervisedLearning(SVC, X, Y_train, Xt, Y_test)
sl.fit_standard_scaler()
sl.train_params = {'C': 100, 'gamma': 0.01, 'probability' : True}
ll_trn, ll_tst = sl.fit_and_validate()
print "SVC log loss: ", ll_tst
You can play with the parameters here to see how log loss changes. SKSupervisedLearning wraps the sklearn grid search technique for searching for optimal parameters in one call. You can take a look at the implementation details.
Let’s plot the confusion matrix to see how well we are doing (values inside squares are %s): (change magic below to %matplotlib qt to get the out-of-browser graph)
%matplotlib inline
conf_svm = plot_confusion(sl)
As expected, we are not doing so well in class 5 where there are very few samples.
Train Neural Net`¶
This is a fun one, I promise. 🙂
The neural net is built by PyBrain, has just one hidden layer which is equal to $\frac{1}{4}$th of the input layer. The hidden layer activation is sigmoid, the output – softmax (since this is a multi-class neural net), and has bias units for the hidden and the output layers. We use the PyBrain $\color{green}{buildNetwork()}$ function that builds the network in one call.
NOTE: We are still using all the scaled features to train the neural net
I am setting %matplotlib to qt so training can be watched in real time You will see each training epoch charted. The graph on the left shows % error, the one on the right – log loss.
You can play with the “test error” or “epochs” parameter to control how long it runs.Limiting it to just 10 epochs for this experiment
%matplotlib qt
trndata, tstdata = createDataSets(sl.X_train_scaled, Y_train, sl.X_test_scaled, Y_test)
fnn = train(trndata, tstdata, epochs = 10, test_error = 0.07, momentum = 0.15, weight_decay = 0.0001)
Train Random Forest with Calibration¶
Finally, we train the random forest (which happens to train in seconds) with the calibration classifier (which takes 2 hours or so)
Random forests are very accurate, the problem is that they maker over-confident predictions (or at least that is what the predict_proba function, which is supposed to return probabilities of each class gives us). So, god forbid we are ever wrong! Since it predicts the probability of 0 on the correct class, log loss goes to infinity. Calibration classifier makes predict_proba return something sane.
sl_ccrf = SKSupervisedLearning(CalibratedClassifierCV, X, Y_train, Xt, Y_test)
sl_ccrf.train_params = \
{'base_estimator': RandomForestClassifier(**{'n_estimators' : 7500, 'max_depth' : 200}), 'cv': 10}
sl_ccrf.fit_standard_scaler()
ll_ccrf_trn, ll_ccrf_tst = sl_ccrf.fit_and_validate()
print "Calibrated log loss: ", ll_ccrf_tst
As you can see, we are simply wrapping the $\color{green}{RandomForestClassifier}$ in the $\color{green}{CalibratedClassifier}$. Plot the matrix (after a couple of hours):
%matplotlib inline
conf_ccrf = plot_confusion(sl_ccrf)
Voting¶
Now we can gather the results of our experiments and blend them. We use a simple weighted voting scheme for that.
$\color{green}{vote}$ function is implemented in tr_utils.py.
Here we are trying to balance the weights of SVM and calibrated RF.
%matplotlib inline
x = 1. / np.arange(1., 6)
y = 1 - x
xx, yy = np.meshgrid(x, y)
lls1 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
lls2 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
for i, x_ in enumerate(x):
for j, y_ in enumerate(y):
proba = vote([sl.proba_test, sl_ccrf.proba_test], [x_, y_])
lls1[i, j] = log_loss(Y_test, proba)
proba = vote([sl.proba_test, sl_ccrf.proba_test], [y_, x_])
lls2[i, j] = log_loss(Y_test, proba)
fig = plt.figure()
plt.clf()
ax = fig.add_subplot(121)
ax1 = fig.add_subplot(122)
ax.set_aspect(1)
ax1.set_aspect(1)
res = ax.imshow(lls1, cmap=plt.cm.jet,
interpolation='nearest')
res = ax1.imshow(lls2, cmap=plt.cm.jet,
interpolation='nearest')
cb = fig.colorbar(res)
The graphs show a “blended” log loss. A matrix on the left blends SVM and RF log losses with weights “favoring” SVM, and the one on the right “favors” RF.