Author Archives: fierval

Fundus Image Segmentation

Source Code

On my GitHub This code is wrapped in a class which makes it harder to post to a Notebook (a few too many lines for a post).

Segmenting Fundus Images with kNN

After preprocessing the images as described in a previous post, I tried to extract learnable features by applying the k-nearest neighbors algorithm. In order to seed the cluster I have applied my (immense) domain knowledge in the area of retinology. I used the annotations tool that comes with OpenCV. It resides in the applications folder of the OpenCV project created with CMake and built with Visual Studio 2013.

I ran the tool on a small set of images and selected the following regions (to the best of my knowledge acquired after reading articles on retinopathy and watching slides). This produces a text file where each line has the following format:

annotations\1196_right.jpeg $\color{green}{10\ 1318\ 182\ 22}\color{red}{\ 12\ 1356\ 195\ 8}$…

Here after the name of the file are the coordinates of annotated rectangles, where each rectangle is a 4-tuple: $(x, y, \Delta x, \Delta y)$, where $(x, y)$ are the coordinates of the left top corner and the rest are lenghts of the sides of the rectangle.

I have annotated 10 rectangles in the following order:

  • 0 – 1: drusen/exudates
  • 2 – 3: texture
  • 4 – 5: Camera effects
  • 6 – 7: Haemorages/blood vessels
  • 8 – 9: Optic Disc

and did these annotations for 7 images selected at random

I used k-nearest neighbors algorithm to segment each image:

In [17]:
%matplotlib inline
import matplotlib.pylab as plt
%run "../regions_detect_knn.py"
In [18]:
# path to the images preprocessed by histogram specification from 16_left.jpeg
root = '/kaggle/retina/train/labelled' # root of a tree, where each subdirectory is a class label of the image (0 - 4)
annotations = 'annotations.txt' # annotations file
masks_dir = '/kaggle/retina/train/masks' # masks produced during preprocessing, identifying the eye
im_file = '4/16_left.jpeg' # preprocessed image
In [19]:
knn = KNeighborsRegions(root, im_file, annotations, masks_dir, n_neighbors = 3)
show_images([knn._image])

In order to seed the clusters, I read the annotatations from all ten regions of the seven files, and for each region stacked them together, averaged the images and then averaged the resulting image. I thus obtained a 1×1 seed for each of the 5 classes I wanted to learn, 2 seeds for each class (i.e., in the current scheme of things, two neighbors vote for the same class) :

In [20]:
# these functions are not called directly by users of the class
knn._rects = np.array([])
knn._avg_pixels = np.array([])
knn._process_annotations()
knn._get_initial_classes()
# this is what our seeds look like
knn.display_average_pixels()

(These do look like swatches of fabric or paint samples…)

Now we simply apply the KNeighborsClassifier() from scikit-learn to learn our classes:

In [21]:
# first need to set the class labels
knn.labels = [Labels.Drusen, Labels.Drusen, Labels.Background, Labels.Background, 
                    Labels.CameraHue, Labels.CameraHue, 
                    Labels.Haemorage, Labels.Haemorage, Labels.OD, Labels.OD]
segments = knn.analyze_image()
In [24]:
# segmented image
plt.imshow(segments)
plt.show()

Let’s see how we did on regions of interest (drusen, haemorages):

In [26]:
knn.display_artifact(segments, Labels.Drusen, (0, 255, 0), "Drusen")
knn.display_artifact(segments, Labels.Haemorage, (0, 0, 255), "Haemorages")

So kinda ok but very far from ideal. The “Drusen” image shows lots and lots of false positives and the haemorages image includes blood vessels and fovea. I took some steps to resolve these problems (detecting blood vessels, removing bright areas adjacent to the optical disk), but either there was not enough time or the selected methodology was not suitable for this, I was not very succesful at coming up with good features this way.

Nevertheless, in the words of my former colleage (a scientist): “succesful experiment with negative results”.

In [ ]:
 

Fast Image Pre-processing with OpenCV 2.4, C++, CUDA: Memory, CLAHE

Previous couple of posts describe some retina images pre-processing with OpenCV and IPython notebooks. Python is great but having to pre-process about 88,000 images (35,000 train and 53,000 test) I had my doubts about how long it would all take. Besides, I am a huge fan of CUDA, I have a GTX Titan GPU and I am not afraid to use it! OpenCV 2.4.11 in Python, unfortunately, does not support CUDA, so we turn to C++.

Installation

Just a quick note. There are plenty of guides on OpenCV installation. Checking out the sources and building with CMake is a breeze. It is also required for CUDA support.

Sources

All sources for this are on my github.

Preprocessing Steps

We do the following preprocessing on each image:
  1. Find eye contours, create mask
  2. Color transfer via histogram specification
  3. Convert to HSV, apply CLAHE on value channel
  4. Convert back to BGR (this OpenCV, so not RGB, but BGR), apply mask to filter out the background
I will only highlight the more interesting parts of this process briefly.

Data Structures, Memory Management

Unlike Python, where OpenCV images are stored in NumPy arrays, in C++ OpenCV 2.4 uses Mat and GpuMat. Some algorithms work on GPU, some don’t. That means memory moves between RAM and GPU memory may become an issue, since it is one of the more time consuming operations in GPU development. Straight CUDA code for memory moves is cumbersome, but not a problem in OpenCV. On the other hand, when wrapping all kinds of transforms in a class, trying to keep track of where the current object on which a transform should be performed resides, may become complicated and hard to maintain. My first stab at this was, perhaps, not the most elegant:
  1. class TransformImage
  2. {
  3. protected:
  4.  
  5.     Mat _image; //original
  6.     Mat _enhanced; //after all transforms
  7.     Mat _buf; //buffer for intermediate operations
  8.  
  9.     gpu::GpuMat g_image;
  10.     gpu::GpuMat g_enhanced;
  11.     gpu::GpuMat g_buf; 
  12.  
  13.     inline void MakeSafe() { g_enhanced.copyTo(g_buf); }
  14. ...
  15. }
OpenCV APIs for image transforms take a source and a destination parameter. It was not immediately obvious to me in which cases these two may refer to the same structure, so to be safe, I often made use of the buffer. Of course, accessors now take on an additional role of moving images back and forth between regular memory and the GPU:
  1. void setImage(Mat& image) 
  2.     { image.copyTo(_image); g_image.upload(image); g_image.copyTo(g_enhanced); }
  3. Mat& getImage() { return _image; }
  4. void setChannel(Channels channel) { _channel = channel; }
  5. Mat& getEnhanced() 
  6.     { g_enhanced.download(_enhanced); return _enhanced; }
As the code shows, moving stuff between GPU and RAM is a breeze in OpenCV from the coder’s perspective. We use upload() and download() for RAM -> GPU and GPU -> RAM respectively.

CLAHE on GPU

For a discussion of histogram equalization, see this tutorial.

This is sorta out-of-band for this post, but to dilute all this book-keeping, I thought it would be good to demonstrate at least how to do CLAHE on the GPU. The code does not differ much from a CPU code!

My transform class applies the CLAHE algorithm to a single channel of the image. I have g_oneChannel array that holds different channels of the image. CLAHE is designed to work on one channel (grey scale) images. If we want to apply it to an RGB (or, in the case of OpenCV BGR) images, we first need to convert the image to HSV (or HSI or some such), apply the algorithm on the value (intensity) channel, and merge that channel back into the original image. This is where the channel stuff in my C++ code comes from. Fortunately we can do all of this without leaving the GPU.

  1. gpu::GpuMat& ApplyClahe()
  2. {
  3.     Ptr<gpu::CLAHE> clahe = gpu::createCLAHE();
  4.  
  5.     clahe->setClipLimit(4.);
  6.     clahe->setTilesGridSize(Size(16, 16));
  7.     clahe->apply(g_oneChannel[(int)_channel], g_buf);
  8.  
  9.     g_buf.copyTo(g_oneChannel[(int)_channel]);
  10.  
  11.     gpu::merge(g_oneChannel, g_enhanced);
  12.     return g_enhanced;
  13. }

Notice the use of g_buf to hold an intermediate result to be safe. Not sure if we could have skipped it, I didn’t check.

This code is reminiscent of the Python code (as the tutorial shows):

  1. import numpy as np
  2. import cv2
  3.  
  4. # create a CLAHE object (Arguments are optional).
  5. clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
  6. cl1 = clahe.apply(img)
  7. return cl1

Here are the “before” and “after” CLAHE images of Kaggle 16_left.jpeg

eyeeye_enhanced

Matched Filters with OpenCV

Creating Custom Filter Banks with OpenCV

Suppose in order to extract curved lines from the image, we create a bank of filters (matched filters) designed to illicit a response from line segments of different orientation. To reconstruct the curves we apply each filter in the bank to the given image and then return the maximum response:

$$I_{res} = \max\limits_{x,y} (I * F_i)$$

where “*” is a convolution operation, $I$ – original image, $F_i$ – filters in the filter bank.

This article uses matched filters based on Gaussian and derivative of Gaussian for blood vessel detection – an important part of retinal image segmentation.

They use a Gaussian of the form:

$$f(x, y)= \frac{1}{\sqrt{2\pi}s}\exp(-\frac{x^2}{2s^2}) – m,\ for\ |x| \leq t\cdot s,\ |y| \leq L/2 \ \ \ (1)$$

Where $m$ – is the averaged value of the filter: $$m = \frac{\int\limits_{-ts}^{ts} \frac{1}{\sqrt{2\pi}s}\exp(-\frac{x^2}{2s^2})dx }{2ts} $$

They also use the first order derivative of this filter (FDOG):

$$g(x, y) = – \frac{x}{\sqrt{2\pi}s^3}\exp(-\frac{x^2}{2s^2}),\ for\ |x| \leq t\cdot s,\ |y| \leq L/2 \ \ \ (2)$$
In [1]:
# Auxillary stuff

%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
import cv2
import pandas as pd
from math import exp, pi, sqrt
from numbapro import vectorize

def show_images(images,titles=None, scale=1.3):
    """Display a list of images"""
    n_ims = len(images)
    if titles is None: titles = ['(%d)' % i for i in range(1,n_ims + 1)]
    fig = plt.figure()
    n = 1
    for image,title in zip(images,titles):
        a = fig.add_subplot(1,n_ims,n) # Make subplot
        if image.ndim == 2: # Is image grayscale?
            plt.imshow(image, cmap = cm.Greys_r)
        else:
            plt.imshow(cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        a.set_title(title)
        plt.axis("off")
        n += 1
    fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * n_ims / scale)
    plt.show()

To create these kernels we first fill the $2ts \times L$ rectangle with values of its $x$ coordinate. Then we create a ufunc using numbapro $\color{green}{@vectorize}$ decorator so that (1) or (2) above could be applied in one shot. Closures appear to be supported by this decorator, so these functions can be generated dynamically, based on the $s$ parameter. Also, since OpenCV $\color{green}{filter2D}$ API uses correlation rather than convolution to apply filters, we need to use $\color{green}{flip}$ before the kernel is returned.

In [2]:
def _filter_kernel_mf_fdog(L, sigma, t = 3, mf = True):
    dim_y = int(L)
    dim_x = 2 * int(t * sigma)
    arr = np.zeros((dim_y, dim_x), 'f')
    
    ctr_x = dim_x / 2 
    ctr_y = int(dim_y / 2.)

    # an un-natural way to set elements of the array
    # to their x coordinate. 
    # x's are actually columns, so the first dimension of the iterator is used
    it = np.nditer(arr, flags=['multi_index'])
    while not it.finished:
        arr[it.multi_index] = it.multi_index[1] - ctr_x
        it.iternext()

    two_sigma_sq = 2 * sigma * sigma
    sqrt_w_pi_sigma = 1. / (sqrt(2 * pi) * sigma)
    if not mf:
        sqrt_w_pi_sigma = sqrt_w_pi_sigma / sigma ** 2

    @vectorize(['float32(float32)'], target='cpu')
    def k_fun(x):
        return sqrt_w_pi_sigma * exp(-x * x / two_sigma_sq)

    @vectorize(['float32(float32)'], target='cpu')
    def k_fun_derivative(x):
        return -x * sqrt_w_pi_sigma * exp(-x * x / two_sigma_sq)

    if mf:
        kernel = k_fun(arr)
        kernel = kernel - kernel.mean()
    else:
       kernel = k_fun_derivative(arr)

    # return the "convolution" kernel for filter2D
    return cv2.flip(kernel, -1) 

And to call this helper:

In [3]:
def fdog_filter_kernel(L, sigma, t = 3):
    '''
    K = - (x/(sqrt(2 * pi) * sigma ^3)) * exp(-x^2/2sigma^2), |y| <= L/2, |x| < s * t
    '''
    return _filter_kernel_mf_fdog(L, sigma, t, False)

def gaussian_matched_filter_kernel(L, sigma, t = 3):
    '''
    K =  1/(sqrt(2 * pi) * sigma ) * exp(-x^2/2sigma^2), |y| <= L/2, |x| < s * t
    '''
    return _filter_kernel_mf_fdog(L, sigma, t, True)
In [4]:
gf = gaussian_matched_filter_kernel(15, 5)
fdog = fdog_filter_kernel(15, 5)

# visualize:
show_images([gf, fdog], ["Gaussian", "FDOG"])

Now we create a bank of these filters, each one rotated by $15^{\circ}$ from the previous one: $F_i=Rotate(F_{i-1}, 15^{\circ})$ Thus we create a bank of 12 such filters.

Rotation can be expressed as an affine transformation. Thus, in order to accomplish this with OpenCV, we first create a rotation matrix and then apply it using corresponding OpenCV APIs.

In [5]:
def createMatchedFilterBank(K, n = 12):
    '''
    Given a kernel, create matched filter bank
    '''
    rotate = 180 / n
    center = (K.shape[1] / 2, K.shape[0] / 2)
    cur_rot = 0
    kernels = [K]

    for i in range(1, n):
        cur_rot += rotate
        r_mat = cv2.getRotationMatrix2D(center, cur_rot, 1)
        k = cv2.warpAffine(K, r_mat, (K.shape[1], K.shape[0]))
        kernels.append(k)

    return kernels

And to see the results:

In [6]:
bank_gf = createMatchedFilterBank(gf, 4)
bank_fdog = createMatchedFilterBank(fdog, 4)

print "Gaussian"
show_images(bank_gf)
print "FDOG"
show_images(bank_fdog)
Gaussian
FDOG
In [7]:
# here are the values for one of these rotated images:
bank_gf[1]
Out[7]:
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.00706917,  0.02442968,  0.01962757,  0.01276451,
         0.00619412, -0.00023838, -0.0062054 , -0.01145587, -0.01605165,
        -0.01994629, -0.02313657, -0.0257983 , -0.02769874, -0.02931222,
        -0.03055923, -0.03133851, -0.03196213, -0.02729871, -0.00505532],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.00912069,  0.03521829,  0.03231619,  0.02605832,  0.01962757,
         0.01276451,  0.00619412, -0.00023838, -0.0062054 , -0.01145587,
        -0.01605165, -0.01994629, -0.02313657, -0.0257983 , -0.02769874,
        -0.02931222, -0.03055923, -0.03133851, -0.03196212, -0.02729871],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.01008518,
         0.04215763,  0.04169458,  0.03756618,  0.03231619,  0.02605832,
         0.01962757,  0.01276451,  0.00619412, -0.00023838, -0.0062054 ,
        -0.01145587, -0.01605165, -0.01994629, -0.02313657, -0.0257983 ,
        -0.02769874, -0.02931222, -0.03055923, -0.03133851, -0.03196213],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00971224,  0.04303707,
         0.0461037 ,  0.04496814,  0.04169458,  0.03756618,  0.0323162 ,
         0.02605832,  0.01962757,  0.01276451,  0.00619412, -0.00023838,
        -0.0062054 , -0.01145587, -0.01605165, -0.01994629, -0.02313657,
        -0.0257983 , -0.02769874, -0.02931222, -0.03055923, -0.02937985],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.00802595,  0.03868837,  0.04439882,
         0.04590621,  0.0461037 ,  0.04496814,  0.04169458,  0.03756618,
         0.03231619,  0.02605832,  0.01962757,  0.01276451,  0.00619412,
        -0.00023838, -0.0062054 , -0.01145587, -0.01605165, -0.01994629,
        -0.02313657, -0.0257983 , -0.02769874, -0.0274802 , -0.00668483],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.0055217 ,  0.02927613,  0.03669004,  0.0412676 ,
         0.04439883,  0.04590621,  0.0461037 ,  0.04496814,  0.04169458,
         0.03756618,  0.0323162 ,  0.02605832,  0.01962757,  0.01276451,
         0.00619412, -0.00023838, -0.0062054 , -0.01145587, -0.01605165,
        -0.01994629, -0.02313657, -0.02418591, -0.0060591 ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.00253091,  0.01728239,  0.02524208,  0.03122787,  0.03669004,
         0.0412676 ,  0.04439882,  0.04590621,  0.0461037 ,  0.04496814,
         0.04169458,  0.03756618,  0.0323162 ,  0.02605832,  0.01962757,
         0.01276451,  0.00619412, -0.00023838, -0.0062054 , -0.01145587,
        -0.01605165, -0.01869964, -0.00506112,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        , -0.00029528,
         0.00472601,  0.01156989,  0.01843455,  0.02524207,  0.03122787,
         0.03669004,  0.0412676 ,  0.04439882,  0.04590621,  0.0461037 ,
         0.04496814,  0.04169458,  0.03756618,  0.0323162 ,  0.02605832,
         0.01962757,  0.01276451,  0.00619412, -0.00023838, -0.0062054 ,
        -0.01073988, -0.0035113 ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , -0.00268081, -0.0067271 ,
        -0.00134984,  0.00504108,  0.01156989,  0.01843455,  0.02524208,
         0.03122787,  0.03669004,  0.0412676 ,  0.04439882,  0.04590621,
         0.0461037 ,  0.04496814,  0.04169458,  0.03756618,  0.0323162 ,
         0.02605832,  0.01962757,  0.01276451,  0.00619412, -0.00022348,
        -0.00135743,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        , -0.00446562, -0.01579774, -0.01225513,
        -0.00717557, -0.00134984,  0.00504108,  0.01156989,  0.01843455,
         0.02524207,  0.03122787,  0.03669004,  0.0412676 ,  0.04439882,
         0.04590621,  0.0461037 ,  0.04496814,  0.04169458,  0.03756618,
         0.03231619,  0.02605832,  0.01962757,  0.01196673,  0.00135496,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        , -0.00573721, -0.02212451, -0.02041427, -0.01685092,
        -0.01225513, -0.00717557, -0.00134984,  0.00504108,  0.01156989,
         0.01843455,  0.02524208,  0.03122787,  0.03669004,  0.0412676 ,
         0.04439882,  0.04590621,  0.0461037 ,  0.04496814,  0.04169458,
         0.03756618,  0.0323162 ,  0.02442968,  0.00429353,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.00646024, -0.02627413, -0.02622723, -0.02359948, -0.02041427,
        -0.01685092, -0.01225513, -0.00693303, -0.00134984,  0.00504108,
         0.01156989,  0.01843455,  0.02524207,  0.03122787,  0.03669004,
         0.0412676 ,  0.04439882,  0.04590621,  0.0461037 ,  0.04496814,
         0.04169458,  0.03521829,  0.00706917,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.02878211, -0.02953251, -0.02802574, -0.02622723, -0.02359948,
        -0.02041427, -0.01685092, -0.01225513, -0.00717557, -0.00134984,
         0.00504108,  0.01156989,  0.01843455,  0.02524207,  0.03122787,
         0.03669004,  0.04112527,  0.04439882,  0.04590621,  0.0461037 ,
         0.04215763,  0.00912069,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.0314802 , -0.03070092, -0.02953251, -0.02802574, -0.02622723,
        -0.02359948, -0.02041427, -0.01685092, -0.01225513, -0.00717557,
        -0.00134984,  0.00504108,  0.01156989,  0.01843455,  0.02524208,
         0.03122787,  0.03669004,  0.04112527,  0.04439882,  0.04303707,
         0.01008518,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.01385007, -0.0314802 , -0.03070092, -0.02953251, -0.02802574,
        -0.02622723, -0.02359948, -0.02041427, -0.01685092, -0.01225513,
        -0.00717557, -0.00134984,  0.00504108,  0.01156989,  0.01843455,
         0.02524207,  0.03122787,  0.03669004,  0.03855494,  0.00971224,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]], dtype=float32)
In [8]:
#Applying the filter bank
def applyFilters(im, kernels):
    '''
    Given a filter bank, apply them and record maximum response
    '''
    images = np.array([cv2.filter2D(im, -1, k) for k in kernels])
    return np.max(images, 0)

Color Transfer with OpenCV, Python

Color Transfer by Histogram Specification

Auxilary functions (described in the previous post)

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from os import path
import numpy as np
import cv2
import pandas as pd

# display a list of images with titles
def show_images(images,titles=None, scale=1.3):
    """Display a list of images"""
    n_ims = len(images)
    if titles is None: titles = ['(%d)' % i for i in range(1,n_ims + 1)]
    fig = plt.figure()
    n = 1
    for image,title in zip(images,titles):
        a = fig.add_subplot(1,n_ims,n) # Make subplot
        if image.ndim == 2: # Is image grayscale?
            plt.imshow(image)
        else:
            plt.imshow(cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        a.set_title(title)
        plt.axis("off")
        n += 1
    fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * n_ims / scale)
    plt.show()

# finding the eye    
def find_eye(image, thresh = 4, size=256):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Canny edge finder
    edges = np.array([])
    edges = cv2.Canny(gray, thresh, thresh * 3, edges)

    # Find contours
    # second output is hierarchy - we are not interested in it.
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Now let's get only what we need out of it
    hull_contours = cv2.convexHull(np.vstack(np.array(contours)))
    hull = np.vstack(hull_contours)
    
    def createMask((rows, cols), hull):
        # black image
        mask = np.zeros((rows, cols), dtype=np.uint8)
        # blit our contours onto it in white color
        cv2.drawContours(mask, [hull], 0, 255, -1)
        return mask

    mask = createMask(image.shape[0:2], hull)
    
    # returning the hull to illustrate a few issues below
    return mask, hull

Image Preprocessing

  1. Downsample with Gaussian Pyramid
  2. Find the eye (see previous post).
  3. Apply Histogram Specification to normalize color maps
    1. Compute CDF of the reference image
    2. Compute CDF of the input image
    3. Create a mapping from refrence to image CDF
    4. Realize mapping
  4. Apply the mask (above) to clean up the background

The images supplied vary drastically in their palettes. To enable feature extraction we need to first normalize them to the same color scheme. We pick a reference image and apply color transfer to the rest:

In [2]:
img_path = "/kaggle/retina/train/sample"
# this is the image into the colors of which we want to map
ref_image = cv2.imread(path.join(img_path, "6535_left.jpeg"))
# images picked to illustrate different problems arising during algorithm application
image_names = ["16_left.jpeg", "10130_right.jpeg", "21118_left.jpeg"]
image_paths = map(lambda t: path.join(img_path, t), image_names)
images = map(lambda p: cv2.imread(p), image_paths)

# let's see what we've got
image_titles = map(lambda i: path.splitext(i)[0], image_names)
show_images(images, image_titles, scale = 0.8)

show_images([ref_image], ["reference image"], scale = 0.8)
In [3]:
#Pyramid Down & blurr
# Easy-peesy
def pyr_blurr(image):
    return cv2.GaussianBlur(cv2.pyrDown(image), (7, 7), 30.)

ref_image = pyr_blurr(ref_image)
images = map(lambda i: pyr_blurr(i), images)

Histogram Specification

For all 3 color channels of each image:

  1. Create histograms: reference image, input image
  2. Create CDFs: CDF is defined as $H_i= {N \over |I|} \sum\limits_{j=0}^{i}h_j$, where $h_j$ is the j-th bin of histogram $h$, $N$ is the maximum color value (255 in our case)
  3. Map image to reference, i.e. for color $i$ in the input image, find color $j$ in the reference image: $\DeclareMathOperator*{\argmin}{\arg\!\min} j = \argmin_{k}{|H^{inp}_i – H^{ref}_k|}$
In [4]:
# here is the histogram of the reference image green channel:
hist_green = cv2.calcHist([ref_image], [1], None, [256], np.array([0, 255]))
fig = plt.figure()
plt.bar(np.arange(256), hist_green, width=2, color='g', edgecolor='none')
fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * 2)
plt.show()

The above histogram has two peaks with lots of 0 values – not surprising, since we are computing it over the entire image. We need to mask off the background

In [5]:
# we should really take it over the masked region, to remove the irrelevent background
mask, hull = find_eye(ref_image)
hist_green = cv2.calcHist([ref_image], [1], mask, [256], np.array([0, 255]))
fig = plt.figure()
plt.bar(np.arange(256), hist_green, width=2, color='g', edgecolor='none')
fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * 2)
plt.show()

Let’s find the CDF. This code is heavily influenced by the LINQ way of doing everything at once without writing loops. I am starting to do better in Python, though, I promise.

In [6]:
old_images = images
images = [ref_image]

# always return colors between 0 and 255
def saturate (v):
    return map(lambda a: min(max(round(a), 0), 255), v)

# given a list of images, create masks for them
def get_masks(images):
    return map(lambda i: find_eye(i)[0], images)

# list of masks
maskHull = get_masks(images)
# list of lists of channels
channels = map(lambda i: cv2.split(i), images)
# match mask to channel
imMask = zip(channels, maskHull)
# area of the image (the |I| constant in the CDF formaula above)
nonZeros = map(lambda m: cv2.countNonZero(m), maskHull)

# grab three histograms - one for each channel
histPerChannel = map(lambda (c, mask): \
                     [cv2.calcHist([cimage], [0], mask,  [256], np.array([0, 255])) for cimage in c], imMask)
# compute the cdf's. 
# they are normalized & saturated: values over 255 are cut off.
cdfPerChannel = map(lambda (hChan, nz): \
                     [saturate(np.cumsum(h) * 255.0 / nz) for h in hChan], zip(histPerChannel, nonZeros))

# let's see a cdf over the green channel (plot the first image CDF)
fig = plt.figure()
plt.bar(np.arange(256), cdfPerChannel[0][1], width=2, color='g', edgecolor='none')
fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * 2)
plt.show()

Putting it all together

In [7]:
# code above neatly in one function   
def calc_hist(images, masks):
    channels = map(lambda i: cv2.split(i), images)
    imMask = zip(channels, masks)
    nonZeros = map(lambda m: cv2.countNonZero(m), masks)
    
    # grab three histograms - one for each channel
    histPerChannel = map(lambda (c, mask): \
                         [cv2.calcHist([cimage], [0], mask,  [256], np.array([0, 255])) for cimage in c], imMask)
    # compute the cdf's. 
    # they are normalized & saturated: values over 255 are cut off.
    cdfPerChannel = map(lambda (hChan, nz): \
                        [saturate(np.cumsum(h) * 255.0 / nz) for h in hChan], \
                        zip(histPerChannel, nonZeros))
    
    return np.array(cdfPerChannel)

# compute color map based on minimal distances beteen cdf values of ref and input images    
def getMin (ref, img):
    l = [np.argmin(np.abs(ref - i)) for i in img]
    return np.array(l)

# compute and apply color map on all channels of the image
def map_image(image, refHist, imageHist):
    # each of the arguments contains histograms over 3 channels
    mp = np.array([getMin(r, i) for (r, i) in zip(refHist, imageHist)])

    channels = np.array(cv2.split(image))
    mappedChannels = np.array([mp[i,channels[i]] for i in range(0, 3)])
    
    return cv2.merge(mappedChannels).astype(np.uint8)

# compute the histograms on all three channels for all images
def histogram_specification(ref, images, masks):
        cdfs = calc_hist(images, masks)
        mapped = [map_image(images[i], ref[0], cdfs[i, :, :]) for i in range(len(images))]
        return mapped
In [8]:
# the fruit of our labor
images = old_images
refMask = get_masks([ref_image])
refHist = calc_hist([ref_image], refMask)
masks = get_masks(images)
histSpec = histogram_specification(refHist, images, masks)

print "Reference Image"
show_images([ref_image], ["Ref"], scale = 0.9)
print "Original Images"
show_images(images, titles = image_titles, scale = 0.9)
print "Mapped Images"
show_images(histSpec, titles = image_titles, scale = 0.9)
Reference Image
Original Images
Mapped Images

At this point it seems like a good idea to filter out the background noise which appears to be present in image 21118_left: to at least make sure the background is entirely black, so that future steps of the algorithm could mask it out easily.

Not a problem:

In [9]:
def mask_background(image, mask):
    # copy to preserve the original
    im = image.copy()
    im[mask == 0, :] = 0
    return im

maskedBg = [mask_background(h, m) for (h, m) in zip(histSpec, masks)]
print "Masked Background"
show_images(maskedBg, titles=image_titles, scale=0.9)
Masked Background

Now what?! This does not look desirable at all! Since we were histogramming using a mask, and our mask appears to have consumed so much of the actual image, we need to do something about it. Let’s take a look at the mask. This is a problem we handled in the previous post.

In [10]:
mask, _ = find_eye(images[2], 1)
show_images([mask], scale = 0.9)

let’s apply all possible masks and see how they influence the resulting image:

In [11]:
mask, _ = find_eye(images[2], 4)
masks[2] = mask

histSpec = histogram_specification(refHist, [images[2]], [masks[2]])
print "Defective mask: We apply the mask we got by thresholding the Canny filter at 4 prior to color transfer"
show_images(histSpec, titles = image_titles[2:], scale = 0.9)

mask, _ = find_eye(images[2], 1)
masks[2] = mask
histSpec = histogram_specification(refHist, [images[2]], [masks[2]])
print "Corrected Mask: Applying the mask above (Canny threshold = 1) prior to color transfer "
show_images(histSpec, titles = image_titles[2:], scale = 0.9)
maskedBg = [mask_background(h, m) for (h, m) in zip(histSpec, [masks[2]])]
print "Masked Background: Actually mask the background by applying appropriate masks"
show_images(maskedBg, titles=image_titles[2:], scale=0.9)
Defective mask: We apply the mask we got by thresholding the Canny filter at 4 prior to color transfer
Corrected Mask: Applying the mask above (Canny threshold = 1) prior to color transfer 
Masked Background: Actually mask the background by applying appropriate masks

Image Processing with OpenCV and Python

The following few posts are a fallout from the Kaggle Diabetic Retinopathy Detection competition. I have been learning image processing with OpenCV 2.4/C++/GPU, Python 2.7, scikit-image, PIL, etc. It was a lot of fun.

Finding the Eye

We have a dataset of over 35,000 fundus images and we need to do some processing on them. Seems rational to first learn to locate the object of interest and create a mask of it, so that we can reliably separte the irrelevant background from the actual image we are going to process. This is not a particularly hard problem. In this notebook we use OpenCV 2.4 for the job

Plan of attack

The eye itself is a circular object (in theory) clearly separated from the dark background. If we try to find all the edges in the image, the convex hull that unites all these edges should (in theory) be the eye!

In [1]:
# Auxilary function to display a list of images

%matplotlib inline

from matplotlib import pyplot as plt
from os import path
import numpy as np
import cv2
import pandas as pd

def show_images(images,titles=None, scale=1.3):
    """Display a list of images"""
    n_ims = len(images)
    if titles is None: titles = ['(%d)' % i for i in range(1,n_ims + 1)]
    fig = plt.figure()
    n = 1
    for image,title in zip(images,titles):
        a = fig.add_subplot(1,n_ims,n) # Make subplot
        if image.ndim == 2: # Is image grayscale?
            plt.imshow(image)
        else:
            plt.imshow(cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        a.set_title(title)
        plt.axis("off")
        n += 1
    fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * n_ims / scale)
    plt.show()
    

Image Preprocessing

  1. Downsample with Gaussian Pyramid
  2. Find the eye.
    1. Canny edge finder
    2. Find contours
    3. Find the joining convex hull
    4. Create a mask for this hull. The mask is resized to the final image size
In [2]:
img_path = "/kaggle/retina/train/sample"

image_names = ["16_left.jpeg", "10130_right.jpeg", "21118_left.jpeg"]
image_paths = map(lambda t: path.join(img_path, t), image_names)
images = map(lambda p: cv2.imread(p), image_paths)

# let's see what we've got
image_titles = map(lambda i: path.splitext(i)[0], image_names)
show_images(images, image_titles, scale = 0.8)

Prepare: PyrDown & Blurr

In [3]:
#Pyramid Down & blurr
# Easy-peesy
def pyr_blurr(image):
    return cv2.GaussianBlur(cv2.pyrDown(image), (7, 7), 30.)

images = map(lambda i: pyr_blurr(i), images)

Find the Eye

In [4]:
# Function to display contours using OpenCV.
def display_contours(image, contours, color = (255, 0, 0), thickness = -1, title = None):
    # Contours are drawn on the original image, so let's make a copy first
    imShow = image.copy()
    for i in range(0, len(contours)):
        cv2.drawContours(imShow, contours, i, color, thickness)
    show_images([imShow], scale=0.7, titles=title)
    
image = images[1]

# this is a good threshold for Canny edge finder, but it does not always work. We will see how to deal with it furhter on.
thresh = 4
# Searcing for the eye
# Let's see how this works setp-by-step
# convert to a one channel image
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Canny edge finder
edges = np.array([])
edges = cv2.Canny(gray, thresh, thresh * 3, edges)

# Find contours
# second output is hierarchy - we are not interested in it.
contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Let's see what we've got:
display_contours(image, contours, thickness=2)
print "{:d} points".format(len(np.vstack(np.array(contours))))
26935 points
In [5]:
# Now let's get only what we need out of it
hull_contours = cv2.convexHull(np.vstack(np.array(contours)))
hull = np.vstack(hull_contours)

# we only get one contour out of it, let's see it
display_contours(image, [hull], thickness=3, color=(0, 255, 0))

Finally, in order to create a mask, we simply draw the hull on a totally black image, using “fill” collor white with the drawContours API.

In [6]:
# Now let's create a mask for this image
def createMask((rows, cols), hull):
    # black image
    mask = np.zeros((rows, cols), dtype=np.uint8)
    # blit our contours onto it in white color
    cv2.drawContours(mask, [hull], 0, 255, -1)
    return mask

mask = createMask(image.shape[0:2], hull)

show_images([mask])

Putting it all together:

In [7]:
def find_eye(image, thresh = 4):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Canny edge finder
    edges = np.array([])
    edges = cv2.Canny(gray, thresh, thresh * 3, edges)

    # Find contours
    # second output is hierarchy - we are not interested in it.
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Now let's get only what we need out of it
    hull_contours = cv2.convexHull(np.vstack(np.array(contours)))
    hull = np.vstack(hull_contours)
    
    def createMask((rows, cols), hull):
        # black image
        mask = np.zeros((rows, cols), dtype=np.uint8)
        # blit our contours onto it in white color
        cv2.drawContours(mask, [hull], 0, 255, -1)
        return mask

    mask = createMask(image.shape[0:2], hull)
    
    # returning the hull to illustrate a few issues below
    return mask, hull

In order to apply the mask, we just use numpy magic

In [8]:
def mask_background(image, mask):
    # copy to preserve the original
    im = image.copy()
    im[mask == 0, :] = 0
    return im
    
# find the eye and get the background mask
mask, _ = find_eye(image)

maskedBg = mask_background(image, mask)
print "Masked Background"
show_images([maskedBg], scale=0.9)
Masked Background

How about 21118_left? This was an especially low quality image.

In [9]:
mask, _ = find_eye(images[2])
plt.imshow(mask)
plt.show()

Ok. This is not good. The image was too dark, so we need to set a lower threshold for the Canny edge finder

In [10]:
mask, _ = find_eye(images[2], 1)
plt.imshow(mask)
plt.show()

This is much better. A simple heuristic works pretty well on our data set: if the resulting mask occupies less than 41% percent of the image, we reduce the threshold to 1. We must be careful, though. If we approach the problem with this threshold at the very beginning i.e, set the threshold to 1 for all images, for many images we will catch a lot of background noise in our mask and it will be no mask at all!

In [30]:
# Must be careful not to set the threshold too low from the start.
mask, _ = find_eye(images[1], 1)
plt.imshow(mask)
plt.show()

Kaggle Microsoft Malware Competition: Lessons Learned

Here is a link to the winning solution for this competition. Interestingly enough, this wonderful team found that pixel intensity was a feature that worked really well in malware. This is probably why I did not completely bomb out: my “golden feature” was a 1D LBP pattern extracted from the binary images. It is quite amusing if not fascinating, that malware binaries exhibit enough properties of actual images to enable a relatively accurate classification on the basis of purely image processing features.

Thus, I have extracted a couple of feature classes:

  1. 1D LPB pattern from the set of binary files
  2. Count of occurrences of top 150 or so APIs most frequent in the text files making the dataset

This gave me a set of about 400 features (256 from the LBP). I did not perform any dimensionality reduction

The key error was, of course, passing on some application of N-grams. It probably would not be a mistake to say, that all solutions up there on the leaderboard for this contest involved some flavor of N-grams extraction from the disassemblies. The winning solution does an interesting version of it. Unfortunately, I got so wrapped up with looking for a better learning algorithm (or an ensemble of those), that when I finally reached the moment I could start thinking about N-grams (kinda on the surface, really), I was running out of time.

Lessons Learned

  1. Feature engineering is key. Most of the time has to go there. To say that 95% or more is spent getting good features is not an exaggeration. Corollary: Domain knowledge is essential.
  2. Fancy machine learning algorithms without good features are NOT key. If features don’t correlate very well to classes, no amount of learning will matter
  3. Fast failure is essential. Need to be able to try something very quickly, fail, and move on.
  4. Visualizations are indispensable. Investing effort in doing those can save a lot of precious time down the line
  5. Forget everything you learned in a machine learning/data science class. Well, not as crudely as that, but when studying learning methods the huge bias for “clean” datasets is unavoidable, because the studies have to focus on learning algorithms. In reality data sucks badly and whipping it into shape is necessary. Only after that is done, one may start remembering those machine learning/data science classes.

First Kaggle Competition

This was my first competition and I finished (cue the drums) 116th/377. A spectacular failure exceeded only by the failure in my next competition. Nevertheless, here is what I did and overall I was not all that inaccurate on the validation set and probably even less inaccurate in real life, surprise surprise. This is a converted IPython notebook, “lessons learned” in the next post.
Note: All confusion matrices reflect workings of the learners on the validation set

Preparation

SciKit

We are using the brand new 0.16.0. Upgrade using:

conda install scikit-learn

Data Preparation

Make sure trainLabels.csv and mix_lbp.csv are both present in the same directory as the .ipynb (this notebook)

Utilities

Developed to faciliate this:

  • tr_utils.py – often used functions
  • train_files.py – aids in file manipulation (loading features)
  • SupervisedLearning.py – thin wrapper arround scikit supervised learning algorithms
  • train_nn.py – neural net using PyBrain

Misc

If <tab> completion is not working, install pyreadline:

easy_install pyreadline

Methodology

We are running three classifiers and blending their results:

  1. SVM
  2. Neural net
  3. Calibrated random forest

mix_lbp.csv contains all the features (below) used for the experiments (as well as for final competition participation). It contains all the samples of the training set. For this demo, we split this set 9:1, where 90% is used for training and 10% for validation.

We then run three different classifiers on these sets (training on the larger one and validating on the smaller one), and try to blend the results by simple voting in order to decrease the resulting log loss (computed on the validation dataset).

Feature Selection

We are training with two types of features:

  1. Features selected from the binary files as described in the 1dlbp article. These produce a histogram of 256 bins for each file
  2. Features selected from .asm files are binary features, indicating whehter a given file contains a certain Windows API. 141 top APIs are picked for this purpose
  3. Number of subroutines in each file (one additional feature). Giving us a 398-dimensional vector
In [2]:
from SupervisedLearning import SKSupervisedLearning
from train_files import TrainFiles
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from tr_utils import vote
import matplotlib.pylab as plt
import numpy as np
from train_nn import createDataSets, train

An auxilary function

Let’s define a function that plots the confusion matrix to see how accurate our predictions really are

In [3]:
def plot_confusion(sl):
    conf_mat = confusion_matrix(sl.Y_test, sl.clf.predict(sl.X_test_scaled)).astype(dtype='float')
    norm_conf_mat = conf_mat / conf_mat.sum(axis = 1)[:, None]

    fig = plt.figure()
    plt.clf()
    ax = fig.add_subplot(111)
    ax.set_aspect(1)
    res = ax.imshow(norm_conf_mat, cmap=plt.cm.jet, 
                    interpolation='nearest')
    cb = fig.colorbar(res)
    labs = np.unique(Y_test)
    x = labs - 1

    plt.xticks(x, labs)
    plt.yticks(x, labs)

    for i in x:
        for j in x:
            ax.text(i - 0.2, j + 0.2, "{:3.0f}".format(norm_conf_mat[j, i] * 100.))
    return conf_mat

Load data from the text file

Loaded data contains all of the training examples.

NOTE: Actually almost all. 8 are missing, because binary features could not be extracted from them.

In [4]:
train_path_mix = "./mix_lbp.csv"
labels_file = "./trainLabels.csv"
X, Y_train, Xt, Y_test = TrainFiles.from_csv(train_path_mix, test_size = 0.1)

The last line above does the following:

  1. Loads the examples from the csv files, assuming the labels are in the last column
  2. Splits the results in a training and a validation dataset using sklearn magic, based on the test_size parameter (defaults to 0.1). $test\_size \in [0, 1]$.
  3. Returns two tuples: (training, training_labels), (testing, testing_labels)

Training

Training consists of training three models

  1. SVM with an RBF kernel
  2. Random foreset with the calibration classifier installed in 0.16.0 of scikit
  3. Neural net

Train SVM

We neatly wrap this into our $\color{green}{SKSupervisedLearning}$ class The procedure is simple:

  1. Instantiate the class with the tuple returned in by the TrainFiles instance or method above and the desired classifier
  2. Apply standard scaling (in scikit this is the Z-score scaling which centers the samples and reduces std to 1). NOTE: This is what the SVM classifier expects
  3. Set training parameters
  4. Call $\color{green}{fit\_and\_validate()}$ to retrieve the $\color{green}{log\_loss}$. This function will compute the log loss on the validation dataset. It will also return the training log loss, which may be interesting but is not out goal. Either way, it is going to be spectacularly small. 🙂
In [5]:
sl = SKSupervisedLearning(SVC, X, Y_train, Xt, Y_test)
sl.fit_standard_scaler()
sl.train_params = {'C': 100, 'gamma': 0.01, 'probability' : True}
ll_trn, ll_tst = sl.fit_and_validate()

print "SVC log loss: ", ll_tst
SVC log loss:  0.0500220173885

You can play with the parameters here to see how log loss changes. SKSupervisedLearning wraps the sklearn grid search technique for searching for optimal parameters in one call. You can take a look at the implementation details.

Let’s plot the confusion matrix to see how well we are doing (values inside squares are %s): (change magic below to %matplotlib qt to get the out-of-browser graph)

In [6]:
%matplotlib inline
conf_svm = plot_confusion(sl)

As expected, we are not doing so well in class 5 where there are very few samples.

Train Neural Net`

This is a fun one, I promise. 🙂

The neural net is built by PyBrain, has just one hidden layer which is equal to $\frac{1}{4}$th of the input layer. The hidden layer activation is sigmoid, the output – softmax (since this is a multi-class neural net), and has bias units for the hidden and the output layers. We use the PyBrain $\color{green}{buildNetwork()}$ function that builds the network in one call.

NOTE: We are still using all the scaled features to train the neural net

I am setting %matplotlib to qt so training can be watched in real time You will see each training epoch charted. The graph on the left shows % error, the one on the right – log loss.

You can play with the “test error” or “epochs” parameter to control how long it runs.Limiting it to just 10 epochs for this experiment

In [6]:
%matplotlib qt
trndata, tstdata = createDataSets(sl.X_train_scaled, Y_train, sl.X_test_scaled, Y_test)
fnn = train(trndata, tstdata, epochs = 10, test_error = 0.07, momentum = 0.15, weight_decay = 0.0001)
epoch:    1   train error:  6.28%   test error:  8.29%  test logloss: 0.3495  train logloss: 0.2659
epoch:    2   train error:  3.98%   test error:  5.52%  test logloss: 0.2491  train logloss: 0.1615
epoch:    3   train error:  2.89%   test error:  5.16%  test logloss: 0.2025  train logloss: 0.1166
epoch:    4   train error:  2.55%   test error:  4.51%  test logloss: 0.1740  train logloss: 0.0970
epoch:    5   train error:  2.02%   test error:  4.60%  test logloss: 0.1768  train logloss: 0.0806
epoch:    6   train error:  1.71%   test error:  4.05%  test logloss: 0.1618  train logloss: 0.0735
epoch:    7   train error:  1.64%   test error:  3.87%  test logloss: 0.1584  train logloss: 0.0656
epoch:    8   train error:  1.54%   test error:  3.41%  test logloss: 0.1436  train logloss: 0.0600
epoch:    9   train error:  1.23%   test error:  3.13%  test logloss: 0.1382  train logloss: 0.0529
epoch:   10   train error:  1.20%   test error:  3.41%  test logloss: 0.1400  train logloss: 0.0536

Train Random Forest with Calibration

Finally, we train the random forest (which happens to train in seconds) with the calibration classifier (which takes 2 hours or so)

Random forests are very accurate, the problem is that they maker over-confident predictions (or at least that is what the predict_proba function, which is supposed to return probabilities of each class gives us). So, god forbid we are ever wrong! Since it predicts the probability of 0 on the correct class, log loss goes to infinity. Calibration classifier makes predict_proba return something sane.

In [7]:
sl_ccrf = SKSupervisedLearning(CalibratedClassifierCV, X, Y_train, Xt, Y_test)
sl_ccrf.train_params = \
    {'base_estimator': RandomForestClassifier(**{'n_estimators' : 7500, 'max_depth' : 200}), 'cv': 10}
sl_ccrf.fit_standard_scaler()
ll_ccrf_trn, ll_ccrf_tst = sl_ccrf.fit_and_validate()

print "Calibrated log loss: ", ll_ccrf_tst
Calibrated log loss:  0.0614970801316

As you can see, we are simply wrapping the $\color{green}{RandomForestClassifier}$ in the $\color{green}{CalibratedClassifier}$. Plot the matrix (after a couple of hours):

In [8]:
%matplotlib inline
conf_ccrf = plot_confusion(sl_ccrf)

Voting

Now we can gather the results of our experiments and blend them. We use a simple weighted voting scheme for that.

$\color{green}{vote}$ function is implemented in tr_utils.py.

Here we are trying to balance the weights of SVM and calibrated RF.

In [9]:
%matplotlib inline
x = 1. / np.arange(1., 6)
y = 1 - x

xx, yy = np.meshgrid(x, y)
lls1 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
lls2 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])

for i, x_ in enumerate(x):
    for j, y_ in enumerate(y):
        proba = vote([sl.proba_test, sl_ccrf.proba_test], [x_, y_])
        lls1[i, j] = log_loss(Y_test, proba)

        proba = vote([sl.proba_test, sl_ccrf.proba_test], [y_, x_])
        lls2[i, j] = log_loss(Y_test, proba)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(121)
ax1 = fig.add_subplot(122)

ax.set_aspect(1)
ax1.set_aspect(1)

res = ax.imshow(lls1, cmap=plt.cm.jet, 
                interpolation='nearest')
res = ax1.imshow(lls2, cmap=plt.cm.jet, 
                interpolation='nearest')

cb = fig.colorbar(res)

The graphs show a “blended” log loss. A matrix on the left blends SVM and RF log losses with weights “favoring” SVM, and the one on the right “favors” RF.

Extracting 1DLBP Pattern with Python, Numba & CUDA

Now that I have learned how to convert IPython notebooks to WordPress-able html, here is an old post revisited.

Extracting a 1D Local Binary Pattern Histogram on NVIDIA GPU with CUDA and Numbapro

This was done for the Microsoft Malware competition on Kaggle. In this contest, a bunch of malware files needed to be classified in 9 categories. The files were presented in two sets, 10,868 each:

  • text (disassembly of the malware)
  • binary (actual binaries minus PE header to render them harmless)

I chose the 1DLBP algorithm described in the linked paper to extract (hopefully meaningful) features from each binary file. These features, in combination with some text-based features extracted from the disassemblies, proved to be quite effective in training an accurate classifier

Prerequisits

  • basic knowledge of CUDA programming (kernels, grid, blocks, threads, etc)
  • numba
In [1]:
from numba import *
from timeit import default_timer as timer

import numpy as np
import matplotlib.pylab as plt

Algorithm in Brief

Since the paper (above) does a great job explaining the algorithm, I will only describe it very briefly.

For every byte (b) in the image, we look at 4 (or any neighborhood) of bytes immediately to the “left” and at 4 bytes immediately to the “right” of the current byte.

We now have 8 selected bytes, call them c[], where size(c) = 8. Each of the elements of array c is converted into a binary digit, based on the following rule:

In [2]:
c = np.random.randint(0, 256, size=8) #neighborhood bytes to the left and to the right contain these values
b = np.random.randint(256) # center of the neighborhood
digits = np.array([0 if d < b else 1 for d in c]) # this is the number that represents the pattern
print "Center: ", b
print "Bytes around it: ", c
print "Pattern: ", digits 

# In order to compute the actual pattern, we just mulitply each of the digits by a power of two and add them up
powers = 1 << np.arange(0, 8)
print "As number base 10: ", np.dot(powers, digits.T)
Center:  2
Bytes around it:  [252 156   3  24 120  25  85  60]
Pattern:  [1 1 1 1 1 1 1 1]
As number base 10:  255

Which is actually a non-negative integer less than $2^8$.

Then a histogram of these values is built. It has the size of 256 ints and in training experiments proves to be much more effective than a simple byte histogram.

The functions below translate this description into Python:

In [3]:
def extract_1dlbp_cpu(input, neighborhood, p):
    """
    Extract the 1d lbp pattern on CPU
    """
    res = np.zeros(1 << (2 * neighborhood))
    for i in range(neighborhood, len(input) - neighborhood):
        left = input[i - neighborhood : i]
        right = input[i + 1 : i + neighborhood + 1]
        both = np.r_[left, right]
        res[np.sum(p [both >= input[i]])] += 1
    return res

With numba, this can be expressed on the GPU using CUDA. If you have numbapro, CUDA support can be validated:

In [4]:
import numbapro
numbapro.check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
	located at C:\Anaconda\DLLs\cublas64_60.dll
	trying to open library...	ok
Finding cusparse
	located at C:\Anaconda\DLLs\cusparse64_60.dll
	trying to open library...	ok
Finding cufft
	located at C:\Anaconda\DLLs\cufft64_60.dll
	trying to open library...	ok
Finding curand
	located at C:\Anaconda\DLLs\curand64_60.dll
	trying to open library...	ok
Finding nvvm
	located at C:\Anaconda\DLLs\nvvm64_20_0.dll
	trying to open library...	ok
	finding libdevice for compute_20...	ok
	finding libdevice for compute_30...	ok
	finding libdevice for compute_35...	ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0       GeForce GTX TITAN                              [SUPPORTED]
                      compute capability: 3.5
                           pci device id: 0
                              pci bus id: 2
Summary:
	1/1 devices are supported
PASSED
Out[4]:
True

I am working with GTX Titan, 14 SMs (14 * 192 = 2688 SPs), 6 Gb RAM. This used to be cool a year and a half ago, but a lot has happened since then.

The Kernel

In [5]:
@cuda.jit('void(uint8[:], int32, int32[:], int32[:])')
def lbp_kernel(input, neighborhood, powers, h):
    i = cuda.grid(1)
    r = 0
    if i < input.shape[0] - 2 * neighborhood:
        i += neighborhood
        for j in range(i - neighborhood, i):
            if input[j] >= input[i]:
                r += powers[j - i + neighborhood]
    
        for j in range(i + 1, i + neighborhood + 1):
            if input[j] >= input[i]:
                r += powers[j - i + neighborhood - 1]

        cuda.atomic.add(h, r, 1)

This looks like a regular CUDA kernel. It almost looksy “C++y”. Each thread is tasked with computing the pattern (“digits” above) around one single element of the original input array. We compute the pattern, then wait for all of the threads in the block to finish and update the histogram with their results.

Note: implementation of histogramming using atomics is the most straightforward, but not the most efficient one. For our purposes it will do just fine.

Calling the Kernel

That is pretty standard: allocate memory and move it to the GPU.

In [6]:
def extract_1dlbp_gpu(input, neighborhood, d_powers):
    '''
    input - the input array
    neighborhood - size of the neighborhood 
    d_powers - device address of the powers of two constants used to compute the final pattern
    '''
    maxThread = 512
    blockDim = maxThread
    d_input = cuda.to_device(input)

    hist = np.zeros(2 ** (2 * neighborhood), dtype='int32')
    gridDim = (len(input) - 2 * neighborhood + blockDim) / blockDim

    d_hist = cuda.to_device(hist)

    lbp_kernel[gridDim, blockDim](d_input, neighborhood, d_powers, d_hist)
    d_hist.to_host()
    return hist

Testing

Running comparison for arrays of several lengths.

In [7]:
X = np.arange(3, 7)
X = 10 ** X
neighborhood = 4

cpu_times = np.zeros(X.shape[0])
gpu_times = np.zeros(X.shape[0])

p = 1 << np.array(range(0, 2 * neighborhood), dtype='int32')
d_powers = cuda.to_device(p)

for i, x in enumerate(X):
    input = np.random.randint(0, 256, size = x).astype(np.uint8)
    start = timer()
    h_cpu = extract_1dlbp_cpu(input, neighborhood, p)
    cpu_times[i] = timer() - start
    print "Finished on CPU: length: {0}, time: {1:3.4f}s".format(x, cpu_times[i])

    start = timer()
    h_gpu = extract_1dlbp_gpu(input, neighborhood, d_powers)
    gpu_times[i] = timer() - start
    print "Finished on GPU: length: {0}, time: {1:3.4f}s".format(x, gpu_times[i])
    print "All h_cpu == h_gpu: ", (h_cpu == h_gpu).all()
Finished on CPU: length: 1000, time: 0.0331s
Finished on GPU: length: 1000, time: 0.0018s
All h_cpu == h_gpu:  True
Finished on CPU: length: 10000, time: 0.3532s
Finished on GPU: length: 10000, time: 0.0014s
All h_cpu == h_gpu:  True
Finished on CPU: length: 100000, time: 3.1897s
Finished on GPU: length: 100000, time: 0.0023s
All h_cpu == h_gpu:  True
Finished on CPU: length: 1000000, time: 30.9789s
Finished on GPU: length: 1000000, time: 0.0107s
All h_cpu == h_gpu:  True

Charting

Here both axes are log-scaled.

In [64]:
f = plt.figure(figsize=(10, 5))

plt.plot(X, cpu_times, label = "CPU")
plt.plot(X, gpu_times, label = "GPU")
plt.xlabel('input length')
plt.ylabel('time, sec')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.show()

Improvements

The above results are a bit extreme and very boring. GPU is faster than CPU, however, 4 orders of magnitue?! Can we do better on CPU? Obviously the slowdown is due to Python complexity: array manipulations we are using for once… Here are things to try:

  • Mimic CUDA kernel on CPU:
In [8]:
def extract_1dlbp_gpu_debug(input, neighborhood, powers, res):
    maxThread = 512
    blockDim = maxThread
    gridDim = (len(input) - 2 * neighborhood + blockDim) / blockDim
    
    for block in range(0, gridDim):
        for thread in range(0, blockDim):
            r = 0
            i = blockDim * block + thread
            if i < input.shape[0] - 2 * neighborhood:
                i += neighborhood
                for j in range(i - neighborhood, i):
                    if input[j] >= input[i]:
                        r += powers[j - i + neighborhood]
    
                for j in range(i + 1, i + neighborhood + 1):
                    if input[j] >= input[i]:
                        r += powers[j - i + neighborhood - 1]

                res[r] += 1
    return res

A good thing, too, because this is actually a way to debug a CUDA kernel in Python

  • Use numba to compile this simple code to something non-Python. For this we just decorate the definition like so:
In [10]:
@jit("int32[:](uint8[:], int64, int32[:], int32[:])", nopython=True)
def extract_1dlbp_cpu_jit(input, neighborhood, powers, res):
    maxThread = 512
    blockDim = maxThread
    gridDim = (len(input) - 2 * neighborhood + blockDim) / blockDim
    
    for block in range(0, gridDim):
        for thread in range(0, blockDim):
            r = 0
            i = blockDim * block + thread
            if i < input.shape[0] - 2 * neighborhood:
                i += neighborhood
                for j in range(i - neighborhood, i):
                    if input[j] >= input[i]:
                        r += powers[j - i + neighborhood]
    
                for j in range(i + 1, i + neighborhood + 1):
                    if input[j] >= input[i]:
                        r += powers[j - i + neighborhood - 1]

                res[r] += 1
    return res

Now we modify the test:

In [12]:
X = np.arange(3, 7)
X = 10 ** X
neighborhood = 4

cpu_times = np.zeros(X.shape[0])
cpu_times_simple = cpu_times.copy()
cpu_times_jit = cpu_times.copy()
gpu_times = np.zeros(X.shape[0])

p = 1 << np.array(range(0, 2 * neighborhood), dtype='int32')
d_powers = cuda.to_device(p)

for i, x in enumerate(X):
    input = np.random.randint(0, 256, size = x).astype(np.uint8)

    print "Length: {0}".format(x)
    print "--------------"

    start = timer()
    h_cpu = extract_1dlbp_cpu(input, neighborhood, p)
    cpu_times[i] = timer() - start
    print "Finished on CPU: time: {0:3.5f}s".format(cpu_times[i])

    res = np.zeros(1 << (2 * neighborhood), dtype='int32')
    start = timer()
    h_cpu_simple = extract_1dlbp_gpu_debug(input, neighborhood, p, res)
    cpu_times_simple[i] = timer() - start
    print "Finished on CPU (simple): time: {0:3.5f}s".format(cpu_times_simple[i])

    res = np.zeros(1 << (2 * neighborhood), dtype='int32')
    start = timer()
    h_cpu_jit = extract_1dlbp_cpu_jit(input, neighborhood, p, res)
    cpu_times_jit[i] = timer() - start
    print "Finished on CPU (numba: jit): time: {0:3.5f}s".format(cpu_times_jit[i])

    start = timer()
    h_gpu = extract_1dlbp_gpu(input, neighborhood, d_powers)
    gpu_times[i] = timer() - start
    print "Finished on GPU: time: {0:3.5f}s".format(gpu_times[i])
    print "All h_cpu == h_gpu: ", (h_cpu_jit == h_gpu).all() and (h_cpu_simple == h_cpu_jit).all() and (h_cpu == h_cpu_jit).all()
    print ''

f = plt.figure(figsize=(10, 5))

plt.plot(X, cpu_times, label = "CPU")
plt.plot(X, cpu_times_simple, label = "CPU non-vectorized")
plt.plot(X, cpu_times_jit, label = "CPU jit")
plt.plot(X, gpu_times, label = "GPU")
plt.yscale('log')
plt.xscale('log')
plt.xlabel('input length')
plt.ylabel('time, sec')
plt.legend()
plt.show()
Length: 1000
--------------
Finished on CPU: time: 0.02263s
Finished on CPU (simple): time: 0.00430s
Finished on CPU (numba: jit): time: 0.00018s
Finished on GPU: time: 0.00281s
All h_cpu == h_gpu:  True

Length: 10000
--------------
Finished on CPU: time: 0.28701s
Finished on CPU (simple): time: 0.04830s
Finished on CPU (numba: jit): time: 0.00045s
Finished on GPU: time: 0.00144s
All h_cpu == h_gpu:  True

Length: 100000
--------------
Finished on CPU: time: 2.88549s
Finished on CPU (simple): time: 0.57943s
Finished on CPU (numba: jit): time: 0.00408s
Finished on GPU: time: 0.00274s
All h_cpu == h_gpu:  True

Length: 1000000
--------------
Finished on CPU: time: 29.97346s
Finished on CPU (simple): time: 5.73384s
Finished on CPU (numba: jit): time: 0.03651s
Finished on GPU: time: 0.00268s
All h_cpu == h_gpu:  True

Now we are cooking! This grapph is much more interesting than the one above. GPU is still the winner, but, as expected, the investment of moving things between RAM and GPU memory pays off once the size of our data is sufficiently large. The GPU doesn’t get out of bed for anything smaller than about $10^5$. For anything else, if perf is crucial (as it very often is), we are better off writing it in a simple way and jitting. In other words: the best thing is not to write it in Python, but to do it in C++ from the get-go instead. Which appears to be a foregone conclusion, of course.