Week 8 - Galvanize Data Science Bootcamp

Week 8 is the last week of formalized curricula at Galvanize. It covers visualization, web app development via Flask and Bootstrap, and a two-day challenge to build a full end-to-end data pipeline for a predicting fraud on EventBrite's dataset.

We started the week by extending our visualization skills with Bokeh. Bokeh allows for interactive visualization within modern web browsers utilizing graphics built on d3.js. The graphics are sick. They are simple to build, and remove the messy javascript requirements needed to work directly with d3.js. The afternoon was flexible: explore a dataset, and extract insight through plotting. I downloaded and plotted crime data in the Seattle area. The result is below and shows a heat map according to crime rates for various districts of the city. Working with geographic data is difficult. Shapely was required to work with the shape files for districting, while descartes was required to patch the polygons. Furthermore, these crime rates need to be normalized by the populations within each district which may or may not have the same boundaries as the census data.

Seattle_crimes.png

 

Next we learned Flask and Bootstrap which are tools to create nice looking web apps or dashboards with limited web development / javascript knowledge. My guess is that this is in the curricula specifically so we can make flashy apps for our projects. Either way, pretty quick to learn and simple to use.

On to the big ticket item for the week. A two day challenge to build a full end-to-end pipeline for predicting fraud based on EventBrite's dataset. The deliverable was a webapp dashboard to track and alert fraudulent events in real-time (for real-time analysis, an instructor setup a server to ping events every few seconds). We worked in teams of four and began executing. The first day was mainly comprised of feature engineering, and modeling. However, in making an application like this, we were very aware that our models, vectorizers, and pipeline must scale and align - a difficult task working on a team of four. However, after careful planning and a lot of whiteboarding, we had a working vectorizer and model for predicting low, medium, and high likelihood of fraud based on the following probability output of the model:

The second day was focused on building a web app and dashboard along with backend postgres database to store risk predictions. We connected to streaming events from the server using requests module, vectorized and predicted the risk for each event. Risk calculations and event details were then stored in a postgres database and accessed via a Flask webapp dashboard which displayed events that were likely to be fraudulent.

Building this full end to end product was a great experience on several fronts:

  1. Working with a team to build a scalable end to end product
  2. Building a working dashboard that could be used by executives to make decisions
  3. Attacking all parts of the data pipeline (exploratory analysis, feature engineering, model building, webapp development, backend database engineering) to create a single data product

Next week begins 3 weeks of our personal capstone projects. I will simply write a single post for my project once complete. Can't wait!