Week 4 – Galvanize Data Science Bootcamp

This week, we covered non-parametric supervised (and kNN unsupervised) machine learning with the following methods:

  • SVMs
  • K Nearest Neighbors
  • Decision Trees
  • Bagging
  • Random Forests
  • Boosting (Adaboost and Gradient boosting)

Mathematically, SVMs are rather beautiful. They follow the same mathematical rigor of linear / logistic regression by numerically solving a cost function, yet allow separation of data by non-linear means using a kernel trick to extrapolate data into a new dimension.

Though decision trees are not as elegant mathematically, in practice they simply work and are conceptually easy to understand. Bootstrap sampling methods such as bagging and random forests are merely extensions of the tree concept allowing you create many overfit models which are averaged, resulting in a single model with good accuracy and low variance. Boosting achieved at a similar result but using the opposite approach: first build weak trees with poor accuracy, but low variance, and slowly add more trees based on the weighted missclassifications or residuals of the previous tree’s decisions. This can take time to build, but overtime, allows very good accuracy while keeping variance low from the start.

This week’s exercises were tough. They ran late, and additional reading at home kept the day quite busy. Aside from the new material we learned above, we also began learning new methods for optimizing our models as we develop them using scikit learn’s pipeline and gridsearchCV classes. Galvanize does a very good job of focusing on the core concepts of a new algorithm, while introducing a new best practice for building models in each of the exercises.

The week ended by splitting into teams of four and competing in Kaggle’s Blue Book for Bulldozers Challenge. Galvanize setup a scoring system so teams could submit models on test data and automatically post scores to our cohort’s slack channel – the competition was on! With only four hours, our team attacked this competition structurally by splitting into two teams of two: one team focused on building the pipeline for testing, while the other team focused on cleaning up the data. After 3 hours, our model pipeline was built and we had clean data to start sending in. We had an hour to try linear regression, random forests, and gradient boosting. We optimized model parameters using grid search CV and won the competition among our cohort using random forests method. After this competition we were surprised to find out we were within the top 100 models submitted to Kaggle’s competition. Reflecting on this experience, it seemed clear that our strategy for splitting the work yielded the best efficiency, allowing us to try different models and allow time for optimization.