Kaggle: Open Source Competitions

Introduction

Kaggle is an online platform that hosts Machine Learning competitions. The Kaggle rank distribution looks like this:

Rank	#Holders
Grandmaster	262
Master	1,843
Expert	8,191
Contributer	70,159
Novice	102,352

Note: One conclusion that i would draw from the rank distribution is that there are not that many persistant people in the ML field, and there is a rather big amount of hype and just scratching the surface going on.

Build a good ML model in 3 stages

Turn your business problem into a ML problem (build the dataset right)
Build a good ML model
- pick the right approach
- do good feature engineering
- statistically evaluate the model using (k-fold) Cross-Validation
- use a regularizer or dropout to avoid overfitting
Productionize the model

High performing models

Gradient boosting (XGBoost) seems to outperform random forests (also an ensemble method) by a little bit
Random forests (the ensemble variant of decision trees)

What seperates winning entries from others

Good feature engineering. Here are some creative examples from top submissions:
- Extend a text dataset with Googe-Translate (non-linear, because if you translate to another language and then back again the result won’t necessarily be the original sentence) from $A \rightarrow B \rightarrow A’$ with $A$ and $B$ being the same sentence in different languages.
- Extract a lot (>70) of different features from just a date or timestamp, such as the season (summer/winter), if it’s weekday or weekend, also merging with other data such as events (holiday or not?)
Appropriate and rich image augmentation for computer vision (CV) tasks
For Reinforcement Learning competitions with a simulator that only offers a couple of testruns a day, an excellent strategy is to actually rebuild the simulator (mimick the rewards and structure as close as possible within reasonable timespans) to have a local evaluation tool available for tuning

Reinforcement Learning in Kaggle

Kaggle now also invests into Reinforcement Learning through simulation-based challenges, like the Lux AI competition (see image above).

References

A lot of info from this blogpost comes an interview with Anthony Goldbloom, founder of Kaggle. He talks about approaches that are commonly used by winning competetors and how submissions evolved over the years as Data Science matured and Deep Learning entered the field as another competitive approach.