Kaggle: Open Source Competitions
Introduction
Kaggle is an online platform that hosts Machine Learning competitions. The Kaggle rank distribution looks like this:
Rank | #Holders | ||
Grandmaster | 262 | ||
Master | 1,843 | ||
Expert | 8,191 | ||
Contributer | 70,159 | ||
Novice | 102,352 |
Note: One conclusion that i would draw from the rank distribution is that there are not that many persistant people in the ML field, and there is a rather big amount of hype and just scratching the surface going on.
Build a good ML model in 3 stages
- Turn your business problem into a ML problem (build the dataset right)
- Build a good ML model
- pick the right approach
- do good feature engineering
- statistically evaluate the model using (k-fold) Cross-Validation
- use a regularizer or dropout to avoid overfitting
- Productionize the model
High performing models
- Gradient boosting (XGBoost) seems to outperform random forests (also an ensemble method) by a little bit
- Random forests (the ensemble variant of decision trees)
What seperates winning entries from others
- Good feature engineering. Here are some creative examples from top submissions:
- Extend a text dataset with Googe-Translate (non-linear, because if you translate to another language and then back again the result won’t necessarily be the original sentence) from $A \rightarrow B \rightarrow A’$ with $A$ and $B$ being the same sentence in different languages.
- Extract a lot (>70) of different features from just a date or timestamp, such as the season (summer/winter), if it’s weekday or weekend, also merging with other data such as events (holiday or not?)
- Appropriate and rich image augmentation for computer vision (CV) tasks
- For Reinforcement Learning competitions with a simulator that only offers a couple of testruns a day, an excellent strategy is to actually rebuild the simulator (mimick the rewards and structure as close as possible within reasonable timespans) to have a local evaluation tool available for tuning
Reinforcement Learning in Kaggle
Kaggle now also invests into Reinforcement Learning through simulation-based challenges, like the Lux AI competition (see image above).
References
A lot of info from this blogpost comes an interview with Anthony Goldbloom, founder of Kaggle. He talks about approaches that are commonly used by winning competetors and how submissions evolved over the years as Data Science matured and Deep Learning entered the field as another competitive approach.