Week 6 - Galvanize Data Science Bootcamp

Week 6 was our last week of the very intense core curriculum. As I write this post, we are entering our break week (an entire week off to review previous material, flush out ideas for capstone projects, and prepare for recruiting. We then have two more weeks of specialized topics and a couple weeks for capstone projects.

This week’s material was fun and lighter than Week 5 -- thank the lord. The focus was on various techniques of recommendation systems using the graphlab module for python which I thought is nicely written and a very fast numerical method for calculating recommendations. We also did some more work on NLP using non-negative matrix factorization – a subset of U-V decomposition that allows more intuitive understanding of the latent features in unsupervised learning. In our sprints, we used the same NYT data as in Week 5, and were able to discover much better groupings of articles with NMF modeling.

We also completed two group challenges this week. In my opinion, these are the most fun days in the program. The idea is to split into teams of 4 and get the best scores for regression / classification problems. Our challenges this week were:

  1. Churn prediction in Ride Share data – (super exciting dataset! Spoiler Alert: it wasn’t from Flywheel J). Idea was to predict whether a user would churn or not.
  2. Recommender model to give best suggestions to new users on new movies which were not used in training. We were given user info (Age, Sex, Location, etc) and movie info (title, year, genre, etc) to allow for learning based on their profiles and movietypes.

Churn prediction challenge:

Whoa this was fun. Our data set was much cleaner than our previous Kaggle Blue Book Challenge, so we got to spend more time modeling, and less time cleaning – yay! My group focused on modifying the metric used to rate our our model. In the problem statement, it was ok to have false positives (people who have not churned but we predict they have), but very important to limit false negatives (people who have churned but we don’t predict them to). By optimizing our models to these constraints we found the best classifiers to be random forests and adaboost ensemble models. Unfortunately, it was hard to compare our results against other groups because we optimized to a different metric, though I expect our approach to be inline with expectations in business.

Movie to user recommendation challenge:

This challenge was pretty easy using graphlab, but became very frustrating to debug errors as graphlab does not notifiy you if datatypes are incorrect for modeling -> two hours later, we had something working J. We were able to draw up competitive accuracy scores for rating new movies and new users based on their associated feature matrices. One key to our good ratings was the fact that we modeled the test split of our cross-validated training data to represent mostly unknown movies and unknown users which is what we were trying to model in our real test data. This enabled our model iterations to more accurately reflect real improvements in model scores on real test data.

Finally, the week ended with a resume workshop and branding exercises. Building our own brands and selling our skills is key to going out and getting the job you want. Now its time to relax and review during break week!