Here is a link to the winning solution for this competition. Interestingly enough, this wonderful team found that pixel intensity was a feature that worked really well in malware. This is probably why I did not completely bomb out: my “golden feature” was a 1D LBP pattern extracted from the binary images. It is quite amusing if not fascinating, that malware binaries exhibit enough properties of actual images to enable a relatively accurate classification on the basis of purely image processing features.
Thus, I have extracted a couple of feature classes:
- 1D LPB pattern from the set of binary files
- Count of occurrences of top 150 or so APIs most frequent in the text files making the dataset
This gave me a set of about 400 features (256 from the LBP). I did not perform any dimensionality reduction
The key error was, of course, passing on some application of N-grams. It probably would not be a mistake to say, that all solutions up there on the leaderboard for this contest involved some flavor of N-grams extraction from the disassemblies. The winning solution does an interesting version of it. Unfortunately, I got so wrapped up with looking for a better learning algorithm (or an ensemble of those), that when I finally reached the moment I could start thinking about N-grams (kinda on the surface, really), I was running out of time.
Lessons Learned
- Feature engineering is key. Most of the time has to go there. To say that 95% or more is spent getting good features is not an exaggeration. Corollary: Domain knowledge is essential.
- Fancy machine learning algorithms without good features are NOT key. If features don’t correlate very well to classes, no amount of learning will matter
- Fast failure is essential. Need to be able to try something very quickly, fail, and move on.
- Visualizations are indispensable. Investing effort in doing those can save a lot of precious time down the line
- Forget everything you learned in a machine learning/data science class. Well, not as crudely as that, but when studying learning methods the huge bias for “clean” datasets is unavoidable, because the studies have to focus on learning algorithms. In reality data sucks badly and whipping it into shape is necessary. Only after that is done, one may start remembering those machine learning/data science classes.