Monthly Archives: August 2015

Fast Image Pre-processing with OpenCV 2.4, C++, CUDA: Memory, CLAHE

Previous couple of posts describe some retina images pre-processing with OpenCV and IPython notebooks. Python is great but having to pre-process about 88,000 images (35,000 train and 53,000 test) I had my doubts about how long it would all take. Besides, I am a huge fan of CUDA, I have a GTX Titan GPU and I am not afraid to use it! OpenCV 2.4.11 in Python, unfortunately, does not support CUDA, so we turn to C++.

Installation

Just a quick note. There are plenty of guides on OpenCV installation. Checking out the sources and building with CMake is a breeze. It is also required for CUDA support.

Sources

All sources for this are on my github.

Preprocessing Steps

We do the following preprocessing on each image:

Find eye contours, create mask
Color transfer via histogram specification
Convert to HSV, apply CLAHE on value channel
Convert back to BGR (this OpenCV, so not RGB, but BGR), apply mask to filter out the background

I will only highlight the more interesting parts of this process briefly.

Data Structures, Memory Management

Unlike Python, where OpenCV images are stored in NumPy arrays, in C++ OpenCV 2.4 uses Mat and GpuMat. Some algorithms work on GPU, some don’t. That means memory moves between RAM and GPU memory may become an issue, since it is one of the more time consuming operations in GPU development. Straight CUDA code for memory moves is cumbersome, but not a problem in OpenCV. On the other hand, when wrapping all kinds of transforms in a class, trying to keep track of where the current object on which a transform should be performed resides, may become complicated and hard to maintain. My first stab at this was, perhaps, not the most elegant:

```
class TransformImage
```
```
{
```
```
protected:
```
```
 
```
```
    Mat _image; //original
```

    Mat _enhanced; //after all transforms

    Mat _buf; //buffer for intermediate operations

```
 
```
```
    gpu::GpuMat g_image;
```
```
    gpu::GpuMat g_enhanced;
```
```
    gpu::GpuMat g_buf; 
```
```
 
```

    inline void MakeSafe() { g_enhanced.copyTo(g_buf); }

```
...
```
```
}
```

OpenCV APIs for image transforms take a source and a destination parameter. It was not immediately obvious to me in which cases these two may refer to the same structure, so to be safe, I often made use of the buffer. Of course, accessors now take on an additional role of moving images back and forth between regular memory and the GPU:

```
void setImage(Mat& image) 
```

    { image.copyTo(_image); g_image.upload(image); g_image.copyTo(g_enhanced); }

```
Mat& getImage() { return _image; }
```

void setChannel(Channels channel) { _channel = channel; }

```
Mat& getEnhanced() 
```

    { g_enhanced.download(_enhanced); return _enhanced; }

As the code shows, moving stuff between GPU and RAM is a breeze in OpenCV from the coder’s perspective. We use upload() and download() for RAM -> GPU and GPU -> RAM respectively.

CLAHE on GPU

For a discussion of histogram equalization, see this tutorial.

This is sorta out-of-band for this post, but to dilute all this book-keeping, I thought it would be good to demonstrate at least how to do CLAHE on the GPU. The code does not differ much from a CPU code!

My transform class applies the CLAHE algorithm to a single channel of the image. I have g_oneChannel array that holds different channels of the image. CLAHE is designed to work on one channel (grey scale) images. If we want to apply it to an RGB (or, in the case of OpenCV BGR) images, we first need to convert the image to HSV (or HSI or some such), apply the algorithm on the value (intensity) channel, and merge that channel back into the original image. This is where the channel stuff in my C++ code comes from. Fortunately we can do all of this without leaving the GPU.

```
gpu::GpuMat& ApplyClahe()
```
```
{
```

    Ptr<gpu::CLAHE> clahe = gpu::createCLAHE();

```
 
```
```
    clahe->setClipLimit(4.);
```

    clahe->setTilesGridSize(Size(16, 16));

    clahe->apply(g_oneChannel[(int)_channel], g_buf);

```
 
```

    g_buf.copyTo(g_oneChannel[(int)_channel]);

```
 
```

    gpu::merge(g_oneChannel, g_enhanced);

```
    return g_enhanced;
```
```
}
```

Notice the use of g_buf to hold an intermediate result to be safe. Not sure if we could have skipped it, I didn’t check.

This code is reminiscent of the Python code (as the tutorial shows):

```
import numpy as np
```
```
import cv2
```
```
 
```

# create a CLAHE object (Arguments are optional).

clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))

```
cl1 = clahe.apply(img)
```
```
return cl1
```

Here are the “before” and “after” CLAHE images of Kaggle 16_left.jpeg

Matched Filters with OpenCV

Creating Custom Filter Banks with OpenCV¶

Suppose in order to extract curved lines from the image, we create a bank of filters (matched filters) designed to illicit a response from line segments of different orientation. To reconstruct the curves we apply each filter in the bank to the given image and then return the maximum response:

$$I_{res} = \max\limits_{x,y} (I * F_i)$$

where “*” is a convolution operation, $I$ – original image, $F_i$ – filters in the filter bank.

This article uses matched filters based on Gaussian and derivative of Gaussian for blood vessel detection – an important part of retinal image segmentation.

They use a Gaussian of the form:

$$f(x, y)= \frac{1}{\sqrt{2\pi}s}\exp(-\frac{x^2}{2s^2}) – m,\ for\ |x| \leq t\cdot s,\ |y| \leq L/2 \ \ \ (1)$$

Where $m$ – is the averaged value of the filter: $$m = \frac{\int\limits_{-ts}^{ts} \frac{1}{\sqrt{2\pi}s}\exp(-\frac{x^2}{2s^2})dx }{2ts} $$

They also use the first order derivative of this filter (FDOG):

$$g(x, y) = – \frac{x}{\sqrt{2\pi}s^3}\exp(-\frac{x^2}{2s^2}),\ for\ |x| \leq t\cdot s,\ |y| \leq L/2 \ \ \ (2)$$

In [1]:

# Auxillary stuff

%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
import cv2
import pandas as pd
from math import exp, pi, sqrt
from numbapro import vectorize

def show_images(images,titles=None, scale=1.3):
    """Display a list of images"""
    n_ims = len(images)
    if titles is None: titles = ['(%d)' % i for i in range(1,n_ims + 1)]
    fig = plt.figure()
    n = 1
    for image,title in zip(images,titles):
        a = fig.add_subplot(1,n_ims,n) # Make subplot
        if image.ndim == 2: # Is image grayscale?
            plt.imshow(image, cmap = cm.Greys_r)
        else:
            plt.imshow(cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        a.set_title(title)
        plt.axis("off")
        n += 1
    fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * n_ims / scale)
    plt.show()

To create these kernels we first fill the $2ts \times L$ rectangle with values of its $x$ coordinate. Then we create a ufunc using numbapro $\color{green}{@vectorize}$ decorator so that (1) or (2) above could be applied in one shot. Closures appear to be supported by this decorator, so these functions can be generated dynamically, based on the $s$ parameter. Also, since OpenCV $\color{green}{filter2D}$ API uses correlation rather than convolution to apply filters, we need to use $\color{green}{flip}$ before the kernel is returned.

In [2]:

def _filter_kernel_mf_fdog(L, sigma, t = 3, mf = True):
    dim_y = int(L)
    dim_x = 2 * int(t * sigma)
    arr = np.zeros((dim_y, dim_x), 'f')
    
    ctr_x = dim_x / 2 
    ctr_y = int(dim_y / 2.)

    # an un-natural way to set elements of the array
    # to their x coordinate. 
    # x's are actually columns, so the first dimension of the iterator is used
    it = np.nditer(arr, flags=['multi_index'])
    while not it.finished:
        arr[it.multi_index] = it.multi_index[1] - ctr_x
        it.iternext()

    two_sigma_sq = 2 * sigma * sigma
    sqrt_w_pi_sigma = 1. / (sqrt(2 * pi) * sigma)
    if not mf:
        sqrt_w_pi_sigma = sqrt_w_pi_sigma / sigma ** 2

    @vectorize(['float32(float32)'], target='cpu')
    def k_fun(x):
        return sqrt_w_pi_sigma * exp(-x * x / two_sigma_sq)

    @vectorize(['float32(float32)'], target='cpu')
    def k_fun_derivative(x):
        return -x * sqrt_w_pi_sigma * exp(-x * x / two_sigma_sq)

    if mf:
        kernel = k_fun(arr)
        kernel = kernel - kernel.mean()
    else:
       kernel = k_fun_derivative(arr)

    # return the "convolution" kernel for filter2D
    return cv2.flip(kernel, -1)

And to call this helper:

In [3]:

def fdog_filter_kernel(L, sigma, t = 3):
    '''
    K = - (x/(sqrt(2 * pi) * sigma ^3)) * exp(-x^2/2sigma^2), |y| <= L/2, |x| < s * t
    '''
    return _filter_kernel_mf_fdog(L, sigma, t, False)

def gaussian_matched_filter_kernel(L, sigma, t = 3):
    '''
    K =  1/(sqrt(2 * pi) * sigma ) * exp(-x^2/2sigma^2), |y| <= L/2, |x| < s * t
    '''
    return _filter_kernel_mf_fdog(L, sigma, t, True)

In [4]:

gf = gaussian_matched_filter_kernel(15, 5)
fdog = fdog_filter_kernel(15, 5)

# visualize:
show_images([gf, fdog], ["Gaussian", "FDOG"])

Now we create a bank of these filters, each one rotated by $15^{\circ}$ from the previous one: $F_i=Rotate(F_{i-1}, 15^{\circ})$ Thus we create a bank of 12 such filters.

Rotation can be expressed as an affine transformation. Thus, in order to accomplish this with OpenCV, we first create a rotation matrix and then apply it using corresponding OpenCV APIs.

In [5]:

def createMatchedFilterBank(K, n = 12):
    '''
    Given a kernel, create matched filter bank
    '''
    rotate = 180 / n
    center = (K.shape[1] / 2, K.shape[0] / 2)
    cur_rot = 0
    kernels = [K]

    for i in range(1, n):
        cur_rot += rotate
        r_mat = cv2.getRotationMatrix2D(center, cur_rot, 1)
        k = cv2.warpAffine(K, r_mat, (K.shape[1], K.shape[0]))
        kernels.append(k)

    return kernels

And to see the results:

In [6]:

bank_gf = createMatchedFilterBank(gf, 4)
bank_fdog = createMatchedFilterBank(fdog, 4)

print "Gaussian"
show_images(bank_gf)
print "FDOG"
show_images(bank_fdog)

Gaussian

FDOG

In [7]:

# here are the values for one of these rotated images:
bank_gf[1]

Out[7]:

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.00706917,  0.02442968,  0.01962757,  0.01276451,
         0.00619412, -0.00023838, -0.0062054 , -0.01145587, -0.01605165,
        -0.01994629, -0.02313657, -0.0257983 , -0.02769874, -0.02931222,
        -0.03055923, -0.03133851, -0.03196213, -0.02729871, -0.00505532],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.00912069,  0.03521829,  0.03231619,  0.02605832,  0.01962757,
         0.01276451,  0.00619412, -0.00023838, -0.0062054 , -0.01145587,
        -0.01605165, -0.01994629, -0.02313657, -0.0257983 , -0.02769874,
        -0.02931222, -0.03055923, -0.03133851, -0.03196212, -0.02729871],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.01008518,
         0.04215763,  0.04169458,  0.03756618,  0.03231619,  0.02605832,
         0.01962757,  0.01276451,  0.00619412, -0.00023838, -0.0062054 ,
        -0.01145587, -0.01605165, -0.01994629, -0.02313657, -0.0257983 ,
        -0.02769874, -0.02931222, -0.03055923, -0.03133851, -0.03196213],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.00971224,  0.04303707,
         0.0461037 ,  0.04496814,  0.04169458,  0.03756618,  0.0323162 ,
         0.02605832,  0.01962757,  0.01276451,  0.00619412, -0.00023838,
        -0.0062054 , -0.01145587, -0.01605165, -0.01994629, -0.02313657,
        -0.0257983 , -0.02769874, -0.02931222, -0.03055923, -0.02937985],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.00802595,  0.03868837,  0.04439882,
         0.04590621,  0.0461037 ,  0.04496814,  0.04169458,  0.03756618,
         0.03231619,  0.02605832,  0.01962757,  0.01276451,  0.00619412,
        -0.00023838, -0.0062054 , -0.01145587, -0.01605165, -0.01994629,
        -0.02313657, -0.0257983 , -0.02769874, -0.0274802 , -0.00668483],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.0055217 ,  0.02927613,  0.03669004,  0.0412676 ,
         0.04439883,  0.04590621,  0.0461037 ,  0.04496814,  0.04169458,
         0.03756618,  0.0323162 ,  0.02605832,  0.01962757,  0.01276451,
         0.00619412, -0.00023838, -0.0062054 , -0.01145587, -0.01605165,
        -0.01994629, -0.02313657, -0.02418591, -0.0060591 ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.00253091,  0.01728239,  0.02524208,  0.03122787,  0.03669004,
         0.0412676 ,  0.04439882,  0.04590621,  0.0461037 ,  0.04496814,
         0.04169458,  0.03756618,  0.0323162 ,  0.02605832,  0.01962757,
         0.01276451,  0.00619412, -0.00023838, -0.0062054 , -0.01145587,
        -0.01605165, -0.01869964, -0.00506112,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        , -0.00029528,
         0.00472601,  0.01156989,  0.01843455,  0.02524207,  0.03122787,
         0.03669004,  0.0412676 ,  0.04439882,  0.04590621,  0.0461037 ,
         0.04496814,  0.04169458,  0.03756618,  0.0323162 ,  0.02605832,
         0.01962757,  0.01276451,  0.00619412, -0.00023838, -0.0062054 ,
        -0.01073988, -0.0035113 ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , -0.00268081, -0.0067271 ,
        -0.00134984,  0.00504108,  0.01156989,  0.01843455,  0.02524208,
         0.03122787,  0.03669004,  0.0412676 ,  0.04439882,  0.04590621,
         0.0461037 ,  0.04496814,  0.04169458,  0.03756618,  0.0323162 ,
         0.02605832,  0.01962757,  0.01276451,  0.00619412, -0.00022348,
        -0.00135743,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        , -0.00446562, -0.01579774, -0.01225513,
        -0.00717557, -0.00134984,  0.00504108,  0.01156989,  0.01843455,
         0.02524207,  0.03122787,  0.03669004,  0.0412676 ,  0.04439882,
         0.04590621,  0.0461037 ,  0.04496814,  0.04169458,  0.03756618,
         0.03231619,  0.02605832,  0.01962757,  0.01196673,  0.00135496,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        , -0.00573721, -0.02212451, -0.02041427, -0.01685092,
        -0.01225513, -0.00717557, -0.00134984,  0.00504108,  0.01156989,
         0.01843455,  0.02524208,  0.03122787,  0.03669004,  0.0412676 ,
         0.04439882,  0.04590621,  0.0461037 ,  0.04496814,  0.04169458,
         0.03756618,  0.0323162 ,  0.02442968,  0.00429353,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.00646024, -0.02627413, -0.02622723, -0.02359948, -0.02041427,
        -0.01685092, -0.01225513, -0.00693303, -0.00134984,  0.00504108,
         0.01156989,  0.01843455,  0.02524207,  0.03122787,  0.03669004,
         0.0412676 ,  0.04439882,  0.04590621,  0.0461037 ,  0.04496814,
         0.04169458,  0.03521829,  0.00706917,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.02878211, -0.02953251, -0.02802574, -0.02622723, -0.02359948,
        -0.02041427, -0.01685092, -0.01225513, -0.00717557, -0.00134984,
         0.00504108,  0.01156989,  0.01843455,  0.02524207,  0.03122787,
         0.03669004,  0.04112527,  0.04439882,  0.04590621,  0.0461037 ,
         0.04215763,  0.00912069,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.0314802 , -0.03070092, -0.02953251, -0.02802574, -0.02622723,
        -0.02359948, -0.02041427, -0.01685092, -0.01225513, -0.00717557,
        -0.00134984,  0.00504108,  0.01156989,  0.01843455,  0.02524208,
         0.03122787,  0.03669004,  0.04112527,  0.04439882,  0.04303707,
         0.01008518,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.01385007, -0.0314802 , -0.03070092, -0.02953251, -0.02802574,
        -0.02622723, -0.02359948, -0.02041427, -0.01685092, -0.01225513,
        -0.00717557, -0.00134984,  0.00504108,  0.01156989,  0.01843455,
         0.02524207,  0.03122787,  0.03669004,  0.03855494,  0.00971224,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]], dtype=float32)

In [8]:

#Applying the filter bank
def applyFilters(im, kernels):
    '''
    Given a filter bank, apply them and record maximum response
    '''
    images = np.array([cv2.filter2D(im, -1, k) for k in kernels])
    return np.max(images, 0)

Color Transfer with OpenCV, Python

Color Transfer by Histogram Specification¶

Auxilary functions (described in the previous post)¶

In [1]:

%matplotlib inline

from matplotlib import pyplot as plt
from os import path
import numpy as np
import cv2
import pandas as pd

# display a list of images with titles
def show_images(images,titles=None, scale=1.3):
    """Display a list of images"""
    n_ims = len(images)
    if titles is None: titles = ['(%d)' % i for i in range(1,n_ims + 1)]
    fig = plt.figure()
    n = 1
    for image,title in zip(images,titles):
        a = fig.add_subplot(1,n_ims,n) # Make subplot
        if image.ndim == 2: # Is image grayscale?
            plt.imshow(image)
        else:
            plt.imshow(cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        a.set_title(title)
        plt.axis("off")
        n += 1
    fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * n_ims / scale)
    plt.show()

# finding the eye    
def find_eye(image, thresh = 4, size=256):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Canny edge finder
    edges = np.array([])
    edges = cv2.Canny(gray, thresh, thresh * 3, edges)

    # Find contours
    # second output is hierarchy - we are not interested in it.
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Now let's get only what we need out of it
    hull_contours = cv2.convexHull(np.vstack(np.array(contours)))
    hull = np.vstack(hull_contours)
    
    def createMask((rows, cols), hull):
        # black image
        mask = np.zeros((rows, cols), dtype=np.uint8)
        # blit our contours onto it in white color
        cv2.drawContours(mask, [hull], 0, 255, -1)
        return mask

    mask = createMask(image.shape[0:2], hull)
    
    # returning the hull to illustrate a few issues below
    return mask, hull

Image Preprocessing¶

Downsample with Gaussian Pyramid
Find the eye (see previous post).
Apply Histogram Specification to normalize color maps
1. Compute CDF of the reference image
2. Compute CDF of the input image
3. Create a mapping from refrence to image CDF
4. Realize mapping
Apply the mask (above) to clean up the background

The images supplied vary drastically in their palettes. To enable feature extraction we need to first normalize them to the same color scheme. We pick a reference image and apply color transfer to the rest:

In [2]:

img_path = "/kaggle/retina/train/sample"
# this is the image into the colors of which we want to map
ref_image = cv2.imread(path.join(img_path, "6535_left.jpeg"))
# images picked to illustrate different problems arising during algorithm application
image_names = ["16_left.jpeg", "10130_right.jpeg", "21118_left.jpeg"]
image_paths = map(lambda t: path.join(img_path, t), image_names)
images = map(lambda p: cv2.imread(p), image_paths)

# let's see what we've got
image_titles = map(lambda i: path.splitext(i)[0], image_names)
show_images(images, image_titles, scale = 0.8)

show_images([ref_image], ["reference image"], scale = 0.8)

In [3]:

#Pyramid Down & blurr
# Easy-peesy
def pyr_blurr(image):
    return cv2.GaussianBlur(cv2.pyrDown(image), (7, 7), 30.)

ref_image = pyr_blurr(ref_image)
images = map(lambda i: pyr_blurr(i), images)

Histogram Specification¶

For all 3 color channels of each image:¶

Create histograms: reference image, input image
Create CDFs: CDF is defined as $H_i= {N \over |I|} \sum\limits_{j=0}^{i}h_j$, where $h_j$ is the j-th bin of histogram $h$, $N$ is the maximum color value (255 in our case)
Map image to reference, i.e. for color $i$ in the input image, find color $j$ in the reference image: $\DeclareMathOperator*{\argmin}{\arg\!\min} j = \argmin_{k}{|H^{inp}_i – H^{ref}_k|}$

In [4]:

# here is the histogram of the reference image green channel:
hist_green = cv2.calcHist([ref_image], [1], None, [256], np.array([0, 255]))
fig = plt.figure()
plt.bar(np.arange(256), hist_green, width=2, color='g', edgecolor='none')
fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * 2)
plt.show()

The above histogram has two peaks with lots of 0 values – not surprising, since we are computing it over the entire image. We need to mask off the background

In [5]:

# we should really take it over the masked region, to remove the irrelevent background
mask, hull = find_eye(ref_image)
hist_green = cv2.calcHist([ref_image], [1], mask, [256], np.array([0, 255]))
fig = plt.figure()
plt.bar(np.arange(256), hist_green, width=2, color='g', edgecolor='none')
fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * 2)
plt.show()

Let’s find the CDF. This code is heavily influenced by the LINQ way of doing everything at once without writing loops. I am starting to do better in Python, though, I promise.

In [6]:

old_images = images
images = [ref_image]

# always return colors between 0 and 255
def saturate (v):
    return map(lambda a: min(max(round(a), 0), 255), v)

# given a list of images, create masks for them
def get_masks(images):
    return map(lambda i: find_eye(i)[0], images)

# list of masks
maskHull = get_masks(images)
# list of lists of channels
channels = map(lambda i: cv2.split(i), images)
# match mask to channel
imMask = zip(channels, maskHull)
# area of the image (the |I| constant in the CDF formaula above)
nonZeros = map(lambda m: cv2.countNonZero(m), maskHull)

# grab three histograms - one for each channel
histPerChannel = map(lambda (c, mask): \
                     [cv2.calcHist([cimage], [0], mask,  [256], np.array([0, 255])) for cimage in c], imMask)
# compute the cdf's. 
# they are normalized & saturated: values over 255 are cut off.
cdfPerChannel = map(lambda (hChan, nz): \
                     [saturate(np.cumsum(h) * 255.0 / nz) for h in hChan], zip(histPerChannel, nonZeros))

# let's see a cdf over the green channel (plot the first image CDF)
fig = plt.figure()
plt.bar(np.arange(256), cdfPerChannel[0][1], width=2, color='g', edgecolor='none')
fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * 2)
plt.show()

Putting it all together¶

In [7]:

# code above neatly in one function   
def calc_hist(images, masks):
    channels = map(lambda i: cv2.split(i), images)
    imMask = zip(channels, masks)
    nonZeros = map(lambda m: cv2.countNonZero(m), masks)
    
    # grab three histograms - one for each channel
    histPerChannel = map(lambda (c, mask): \
                         [cv2.calcHist([cimage], [0], mask,  [256], np.array([0, 255])) for cimage in c], imMask)
    # compute the cdf's. 
    # they are normalized & saturated: values over 255 are cut off.
    cdfPerChannel = map(lambda (hChan, nz): \
                        [saturate(np.cumsum(h) * 255.0 / nz) for h in hChan], \
                        zip(histPerChannel, nonZeros))
    
    return np.array(cdfPerChannel)

# compute color map based on minimal distances beteen cdf values of ref and input images    
def getMin (ref, img):
    l = [np.argmin(np.abs(ref - i)) for i in img]
    return np.array(l)

# compute and apply color map on all channels of the image
def map_image(image, refHist, imageHist):
    # each of the arguments contains histograms over 3 channels
    mp = np.array([getMin(r, i) for (r, i) in zip(refHist, imageHist)])

    channels = np.array(cv2.split(image))
    mappedChannels = np.array([mp[i,channels[i]] for i in range(0, 3)])
    
    return cv2.merge(mappedChannels).astype(np.uint8)

# compute the histograms on all three channels for all images
def histogram_specification(ref, images, masks):
        cdfs = calc_hist(images, masks)
        mapped = [map_image(images[i], ref[0], cdfs[i, :, :]) for i in range(len(images))]
        return mapped

In [8]:

# the fruit of our labor
images = old_images
refMask = get_masks([ref_image])
refHist = calc_hist([ref_image], refMask)
masks = get_masks(images)
histSpec = histogram_specification(refHist, images, masks)

print "Reference Image"
show_images([ref_image], ["Ref"], scale = 0.9)
print "Original Images"
show_images(images, titles = image_titles, scale = 0.9)
print "Mapped Images"
show_images(histSpec, titles = image_titles, scale = 0.9)

Reference Image

Original Images

Mapped Images

At this point it seems like a good idea to filter out the background noise which appears to be present in image 21118_left: to at least make sure the background is entirely black, so that future steps of the algorithm could mask it out easily.

Not a problem:

In [9]:

def mask_background(image, mask):
    # copy to preserve the original
    im = image.copy()
    im[mask == 0, :] = 0
    return im

maskedBg = [mask_background(h, m) for (h, m) in zip(histSpec, masks)]
print "Masked Background"
show_images(maskedBg, titles=image_titles, scale=0.9)

Masked Background

Now what?! This does not look desirable at all! Since we were histogramming using a mask, and our mask appears to have consumed so much of the actual image, we need to do something about it. Let’s take a look at the mask. This is a problem we handled in the previous post.

In [10]:

mask, _ = find_eye(images[2], 1)
show_images([mask], scale = 0.9)

let’s apply all possible masks and see how they influence the resulting image:

In [11]:

mask, _ = find_eye(images[2], 4)
masks[2] = mask

histSpec = histogram_specification(refHist, [images[2]], [masks[2]])
print "Defective mask: We apply the mask we got by thresholding the Canny filter at 4 prior to color transfer"
show_images(histSpec, titles = image_titles[2:], scale = 0.9)

mask, _ = find_eye(images[2], 1)
masks[2] = mask
histSpec = histogram_specification(refHist, [images[2]], [masks[2]])
print "Corrected Mask: Applying the mask above (Canny threshold = 1) prior to color transfer "
show_images(histSpec, titles = image_titles[2:], scale = 0.9)
maskedBg = [mask_background(h, m) for (h, m) in zip(histSpec, [masks[2]])]
print "Masked Background: Actually mask the background by applying appropriate masks"
show_images(maskedBg, titles=image_titles[2:], scale=0.9)

Defective mask: We apply the mask we got by thresholding the Canny filter at 4 prior to color transfer

Corrected Mask: Applying the mask above (Canny threshold = 1) prior to color transfer

Masked Background: Actually mask the background by applying appropriate masks

Image Processing with OpenCV and Python

The following few posts are a fallout from the Kaggle Diabetic Retinopathy Detection competition. I have been learning image processing with OpenCV 2.4/C++/GPU, Python 2.7, scikit-image, PIL, etc. It was a lot of fun.

Finding the Eye¶

We have a dataset of over 35,000 fundus images and we need to do some processing on them. Seems rational to first learn to locate the object of interest and create a mask of it, so that we can reliably separte the irrelevant background from the actual image we are going to process. This is not a particularly hard problem. In this notebook we use OpenCV 2.4 for the job

Plan of attack¶

The eye itself is a circular object (in theory) clearly separated from the dark background. If we try to find all the edges in the image, the convex hull that unites all these edges should (in theory) be the eye!

In [1]:

# Auxilary function to display a list of images

%matplotlib inline

from matplotlib import pyplot as plt
from os import path
import numpy as np
import cv2
import pandas as pd

def show_images(images,titles=None, scale=1.3):
    """Display a list of images"""
    n_ims = len(images)
    if titles is None: titles = ['(%d)' % i for i in range(1,n_ims + 1)]
    fig = plt.figure()
    n = 1
    for image,title in zip(images,titles):
        a = fig.add_subplot(1,n_ims,n) # Make subplot
        if image.ndim == 2: # Is image grayscale?
            plt.imshow(image)
        else:
            plt.imshow(cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        a.set_title(title)
        plt.axis("off")
        n += 1
    fig.set_size_inches(np.array(fig.get_size_inches(), dtype=np.float) * n_ims / scale)
    plt.show()

Image Preprocessing¶

Downsample with Gaussian Pyramid
Find the eye.
1. Canny edge finder
2. Find contours
3. Find the joining convex hull
4. Create a mask for this hull. The mask is resized to the final image size

In [2]:

img_path = "/kaggle/retina/train/sample"

image_names = ["16_left.jpeg", "10130_right.jpeg", "21118_left.jpeg"]
image_paths = map(lambda t: path.join(img_path, t), image_names)
images = map(lambda p: cv2.imread(p), image_paths)

# let's see what we've got
image_titles = map(lambda i: path.splitext(i)[0], image_names)
show_images(images, image_titles, scale = 0.8)

Prepare: PyrDown & Blurr¶

In [3]:

#Pyramid Down & blurr
# Easy-peesy
def pyr_blurr(image):
    return cv2.GaussianBlur(cv2.pyrDown(image), (7, 7), 30.)

images = map(lambda i: pyr_blurr(i), images)

Find the Eye¶

In [4]:

# Function to display contours using OpenCV.
def display_contours(image, contours, color = (255, 0, 0), thickness = -1, title = None):
    # Contours are drawn on the original image, so let's make a copy first
    imShow = image.copy()
    for i in range(0, len(contours)):
        cv2.drawContours(imShow, contours, i, color, thickness)
    show_images([imShow], scale=0.7, titles=title)
    
image = images[1]

# this is a good threshold for Canny edge finder, but it does not always work. We will see how to deal with it furhter on.
thresh = 4
# Searcing for the eye
# Let's see how this works setp-by-step
# convert to a one channel image
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Canny edge finder
edges = np.array([])
edges = cv2.Canny(gray, thresh, thresh * 3, edges)

# Find contours
# second output is hierarchy - we are not interested in it.
contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Let's see what we've got:
display_contours(image, contours, thickness=2)
print "{:d} points".format(len(np.vstack(np.array(contours))))

26935 points

In [5]:

# Now let's get only what we need out of it
hull_contours = cv2.convexHull(np.vstack(np.array(contours)))
hull = np.vstack(hull_contours)

# we only get one contour out of it, let's see it
display_contours(image, [hull], thickness=3, color=(0, 255, 0))

Finally, in order to create a mask, we simply draw the hull on a totally black image, using “fill” collor white with the drawContours API.

In [6]:

# Now let's create a mask for this image
def createMask((rows, cols), hull):
    # black image
    mask = np.zeros((rows, cols), dtype=np.uint8)
    # blit our contours onto it in white color
    cv2.drawContours(mask, [hull], 0, 255, -1)
    return mask

mask = createMask(image.shape[0:2], hull)

show_images([mask])

Putting it all together:¶

In [7]:

def find_eye(image, thresh = 4):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Canny edge finder
    edges = np.array([])
    edges = cv2.Canny(gray, thresh, thresh * 3, edges)

    # Find contours
    # second output is hierarchy - we are not interested in it.
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Now let's get only what we need out of it
    hull_contours = cv2.convexHull(np.vstack(np.array(contours)))
    hull = np.vstack(hull_contours)
    
    def createMask((rows, cols), hull):
        # black image
        mask = np.zeros((rows, cols), dtype=np.uint8)
        # blit our contours onto it in white color
        cv2.drawContours(mask, [hull], 0, 255, -1)
        return mask

    mask = createMask(image.shape[0:2], hull)
    
    # returning the hull to illustrate a few issues below
    return mask, hull

In order to apply the mask, we just use numpy magic

In [8]:

def mask_background(image, mask):
    # copy to preserve the original
    im = image.copy()
    im[mask == 0, :] = 0
    return im
    
# find the eye and get the background mask
mask, _ = find_eye(image)

maskedBg = mask_background(image, mask)
print "Masked Background"
show_images([maskedBg], scale=0.9)

Masked Background

How about 21118_left? This was an especially low quality image.

In [9]:

mask, _ = find_eye(images[2])
plt.imshow(mask)
plt.show()

Ok. This is not good. The image was too dark, so we need to set a lower threshold for the Canny edge finder

In [10]:

mask, _ = find_eye(images[2], 1)
plt.imshow(mask)
plt.show()

This is much better. A simple heuristic works pretty well on our data set: if the resulting mask occupies less than 41% percent of the image, we reduce the threshold to 1. We must be careful, though. If we approach the problem with this threshold at the very beginning i.e, set the threshold to 1 for all images, for many images we will catch a lot of background noise in our mask and it will be no mask at all!

In [30]:

# Must be careful not to set the threshold too low from the start.
mask, _ = find_eye(images[1], 1)
plt.imshow(mask)
plt.show()

Kaggle Microsoft Malware Competition: Lessons Learned

Here is a link to the winning solution for this competition. Interestingly enough, this wonderful team found that pixel intensity was a feature that worked really well in malware. This is probably why I did not completely bomb out: my “golden feature” was a 1D LBP pattern extracted from the binary images. It is quite amusing if not fascinating, that malware binaries exhibit enough properties of actual images to enable a relatively accurate classification on the basis of purely image processing features.

Thus, I have extracted a couple of feature classes:

1D LPB pattern from the set of binary files
Count of occurrences of top 150 or so APIs most frequent in the text files making the dataset

This gave me a set of about 400 features (256 from the LBP). I did not perform any dimensionality reduction

The key error was, of course, passing on some application of N-grams. It probably would not be a mistake to say, that all solutions up there on the leaderboard for this contest involved some flavor of N-grams extraction from the disassemblies. The winning solution does an interesting version of it. Unfortunately, I got so wrapped up with looking for a better learning algorithm (or an ensemble of those), that when I finally reached the moment I could start thinking about N-grams (kinda on the surface, really), I was running out of time.

Lessons Learned

Feature engineering is key. Most of the time has to go there. To say that 95% or more is spent getting good features is not an exaggeration. Corollary: Domain knowledge is essential.
Fancy machine learning algorithms without good features are NOT key. If features don’t correlate very well to classes, no amount of learning will matter
Fast failure is essential. Need to be able to try something very quickly, fail, and move on.
Visualizations are indispensable. Investing effort in doing those can save a lot of precious time down the line
Forget everything you learned in a machine learning/data science class. Well, not as crudely as that, but when studying learning methods the huge bias for “clean” datasets is unavoidable, because the studies have to focus on learning algorithms. In reality data sucks badly and whipping it into shape is necessary. Only after that is done, one may start remembering those machine learning/data science classes.

First Kaggle Competition

This was my first competition and I finished (cue the drums) 116th/377. A spectacular failure exceeded only by the failure in my next competition. Nevertheless, here is what I did and overall I was not all that inaccurate on the validation set and probably even less inaccurate in real life, surprise surprise. This is a converted IPython notebook, “lessons learned” in the next post.
Note: All confusion matrices reflect workings of the learners on the validation set

Preparation¶

SciKit¶

We are using the brand new 0.16.0. Upgrade using:

conda install scikit-learn

Data Preparation¶

Make sure trainLabels.csv and mix_lbp.csv are both present in the same directory as the .ipynb (this notebook)

Utilities¶

Developed to faciliate this:

tr_utils.py – often used functions
train_files.py – aids in file manipulation (loading features)
SupervisedLearning.py – thin wrapper arround scikit supervised learning algorithms
train_nn.py – neural net using PyBrain

Misc¶

If <tab> completion is not working, install pyreadline:

easy_install pyreadline

Methodology¶

We are running three classifiers and blending their results:

SVM
Neural net
Calibrated random forest

mix_lbp.csv contains all the features (below) used for the experiments (as well as for final competition participation). It contains all the samples of the training set. For this demo, we split this set 9:1, where 90% is used for training and 10% for validation.

We then run three different classifiers on these sets (training on the larger one and validating on the smaller one), and try to blend the results by simple voting in order to decrease the resulting log loss (computed on the validation dataset).

Feature Selection¶

We are training with two types of features:

Features selected from the binary files as described in the 1dlbp article. These produce a histogram of 256 bins for each file
Features selected from .asm files are binary features, indicating whehter a given file contains a certain Windows API. 141 top APIs are picked for this purpose
Number of subroutines in each file (one additional feature). Giving us a 398-dimensional vector

In [2]:

from SupervisedLearning import SKSupervisedLearning
from train_files import TrainFiles
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from tr_utils import vote
import matplotlib.pylab as plt
import numpy as np
from train_nn import createDataSets, train

An auxilary function¶

Let’s define a function that plots the confusion matrix to see how accurate our predictions really are

In [3]:

def plot_confusion(sl):
    conf_mat = confusion_matrix(sl.Y_test, sl.clf.predict(sl.X_test_scaled)).astype(dtype='float')
    norm_conf_mat = conf_mat / conf_mat.sum(axis = 1)[:, None]

    fig = plt.figure()
    plt.clf()
    ax = fig.add_subplot(111)
    ax.set_aspect(1)
    res = ax.imshow(norm_conf_mat, cmap=plt.cm.jet, 
                    interpolation='nearest')
    cb = fig.colorbar(res)
    labs = np.unique(Y_test)
    x = labs - 1

    plt.xticks(x, labs)
    plt.yticks(x, labs)

    for i in x:
        for j in x:
            ax.text(i - 0.2, j + 0.2, "{:3.0f}".format(norm_conf_mat[j, i] * 100.))
    return conf_mat

Load data from the text file¶

Loaded data contains all of the training examples.

NOTE: Actually almost all. 8 are missing, because binary features could not be extracted from them.

In [4]:

train_path_mix = "./mix_lbp.csv"
labels_file = "./trainLabels.csv"
X, Y_train, Xt, Y_test = TrainFiles.from_csv(train_path_mix, test_size = 0.1)

The last line above does the following:

Loads the examples from the csv files, assuming the labels are in the last column
Splits the results in a training and a validation dataset using sklearn magic, based on the test_size parameter (defaults to 0.1). $test\_size \in [0, 1]$.
Returns two tuples: (training, training_labels), (testing, testing_labels)

Training¶

Training consists of training three models

SVM with an RBF kernel
Random foreset with the calibration classifier installed in 0.16.0 of scikit
Neural net

Train SVM¶

We neatly wrap this into our $\color{green}{SKSupervisedLearning}$ class The procedure is simple:

Instantiate the class with the tuple returned in by the TrainFiles instance or method above and the desired classifier
Apply standard scaling (in scikit this is the Z-score scaling which centers the samples and reduces std to 1). NOTE: This is what the SVM classifier expects
Set training parameters
Call $\color{green}{fit\_and\_validate()}$ to retrieve the $\color{green}{log\_loss}$. This function will compute the log loss on the validation dataset. It will also return the training log loss, which may be interesting but is not out goal. Either way, it is going to be spectacularly small. 🙂

In [5]:

sl = SKSupervisedLearning(SVC, X, Y_train, Xt, Y_test)
sl.fit_standard_scaler()
sl.train_params = {'C': 100, 'gamma': 0.01, 'probability' : True}
ll_trn, ll_tst = sl.fit_and_validate()

print "SVC log loss: ", ll_tst

SVC log loss:  0.0500220173885

You can play with the parameters here to see how log loss changes. SKSupervisedLearning wraps the sklearn grid search technique for searching for optimal parameters in one call. You can take a look at the implementation details.

Let’s plot the confusion matrix to see how well we are doing (values inside squares are %s): (change magic below to %matplotlib qt to get the out-of-browser graph)

In [6]:

%matplotlib inline
conf_svm = plot_confusion(sl)

As expected, we are not doing so well in class 5 where there are very few samples.

Train Neural Net`¶

This is a fun one, I promise. 🙂

The neural net is built by PyBrain, has just one hidden layer which is equal to $\frac{1}{4}$th of the input layer. The hidden layer activation is sigmoid, the output – softmax (since this is a multi-class neural net), and has bias units for the hidden and the output layers. We use the PyBrain $\color{green}{buildNetwork()}$ function that builds the network in one call.

NOTE: We are still using all the scaled features to train the neural net

I am setting %matplotlib to qt so training can be watched in real time You will see each training epoch charted. The graph on the left shows % error, the one on the right – log loss.

You can play with the “test error” or “epochs” parameter to control how long it runs.Limiting it to just 10 epochs for this experiment

In [6]:

%matplotlib qt
trndata, tstdata = createDataSets(sl.X_train_scaled, Y_train, sl.X_test_scaled, Y_test)
fnn = train(trndata, tstdata, epochs = 10, test_error = 0.07, momentum = 0.15, weight_decay = 0.0001)

epoch:    1   train error:  6.28%   test error:  8.29%  test logloss: 0.3495  train logloss: 0.2659
epoch:    2   train error:  3.98%   test error:  5.52%  test logloss: 0.2491  train logloss: 0.1615
epoch:    3   train error:  2.89%   test error:  5.16%  test logloss: 0.2025  train logloss: 0.1166
epoch:    4   train error:  2.55%   test error:  4.51%  test logloss: 0.1740  train logloss: 0.0970
epoch:    5   train error:  2.02%   test error:  4.60%  test logloss: 0.1768  train logloss: 0.0806
epoch:    6   train error:  1.71%   test error:  4.05%  test logloss: 0.1618  train logloss: 0.0735
epoch:    7   train error:  1.64%   test error:  3.87%  test logloss: 0.1584  train logloss: 0.0656
epoch:    8   train error:  1.54%   test error:  3.41%  test logloss: 0.1436  train logloss: 0.0600
epoch:    9   train error:  1.23%   test error:  3.13%  test logloss: 0.1382  train logloss: 0.0529
epoch:   10   train error:  1.20%   test error:  3.41%  test logloss: 0.1400  train logloss: 0.0536

Train Random Forest with Calibration¶

Finally, we train the random forest (which happens to train in seconds) with the calibration classifier (which takes 2 hours or so)

Random forests are very accurate, the problem is that they maker over-confident predictions (or at least that is what the predict_proba function, which is supposed to return probabilities of each class gives us). So, god forbid we are ever wrong! Since it predicts the probability of 0 on the correct class, log loss goes to infinity. Calibration classifier makes predict_proba return something sane.

In [7]:

sl_ccrf = SKSupervisedLearning(CalibratedClassifierCV, X, Y_train, Xt, Y_test)
sl_ccrf.train_params = \
    {'base_estimator': RandomForestClassifier(**{'n_estimators' : 7500, 'max_depth' : 200}), 'cv': 10}
sl_ccrf.fit_standard_scaler()
ll_ccrf_trn, ll_ccrf_tst = sl_ccrf.fit_and_validate()

print "Calibrated log loss: ", ll_ccrf_tst

Calibrated log loss:  0.0614970801316

As you can see, we are simply wrapping the $\color{green}{RandomForestClassifier}$ in the $\color{green}{CalibratedClassifier}$. Plot the matrix (after a couple of hours):

In [8]:

%matplotlib inline
conf_ccrf = plot_confusion(sl_ccrf)

Voting¶

Now we can gather the results of our experiments and blend them. We use a simple weighted voting scheme for that.

$\color{green}{vote}$ function is implemented in tr_utils.py.

Here we are trying to balance the weights of SVM and calibrated RF.

In [9]:

%matplotlib inline
x = 1. / np.arange(1., 6)
y = 1 - x

xx, yy = np.meshgrid(x, y)
lls1 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])
lls2 = np.zeros(xx.shape[0] * yy.shape[0]).reshape(xx.shape[0], yy.shape[0])

for i, x_ in enumerate(x):
    for j, y_ in enumerate(y):
        proba = vote([sl.proba_test, sl_ccrf.proba_test], [x_, y_])
        lls1[i, j] = log_loss(Y_test, proba)

        proba = vote([sl.proba_test, sl_ccrf.proba_test], [y_, x_])
        lls2[i, j] = log_loss(Y_test, proba)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(121)
ax1 = fig.add_subplot(122)

ax.set_aspect(1)
ax1.set_aspect(1)

res = ax.imshow(lls1, cmap=plt.cm.jet, 
                interpolation='nearest')
res = ax1.imshow(lls2, cmap=plt.cm.jet, 
                interpolation='nearest')

cb = fig.colorbar(res)

The graphs show a “blended” log loss. A matrix on the left blends SVM and RF log losses with weights “favoring” SVM, and the one on the right “favors” RF.