Week 7 - Galvanize Data Science Bootcamp

After a 1 week break, we are now back in full swing. Unfortunately we lost 2 cohort members during the break as they interviewed at companies and accepted offers. Really proud of them and glad to see the tangible results that the Galvanize Data Science program can lead to.

The focus of Week 7 was machine learning at scale. Days this week were focused on :

  • Threading and multiprocessing
  • MapReduce
  • Apache Spark
  • Apache Spark on AWS

I was very excited to get my hands on MapReduce and Apache Spark. It is such a hot skillset to have, and most big data employers want data scientists who have this experience.

The MapReduce framework came out of Google in 2004 and is based on the map and reduce functions commonly utilized in functional programming. The following example converts each word in a string to an acronym by:

  1. Mapping a lambda function which extracts the first letter, and makes it uppercase
  2. Reducing a list of uppercase letters into one string by lambda function

  def acronym(s):
	chars_list = map(lambda x: x[0].upper(), s.split())
	return reduce(lambda x, y: x+y, chars_list)

Now imagine the case with billions of words in a string. It would take too long to do this on one processor so it would need to be parallelized. Each word would be shipped to a different processor within a managed cluster for the map function and a reduce function would assemble back all the capitalized first letters into a single string. This is a great way to process very large datasets, but is limited in the functions you can apply.

Enter Apache Spark which is a framework that extends MapReduce by adding functionality for machine learning algorithms, graph database support, and additional SQL-like queries.

Initially we built clusters on our individual quad-core macbooks utilizing tmux to multiplex master and slave nodes. We developed queries for Spark SQL and joined tables / pulled data similar to PostgreSQL / SQLite. We then fired up 20 cores on AWS and did similar queries. Apache Spark on AWS is a game changer for machine learning on large datasets and hence it is such a highly sought skill.

My takeaways for this week were the impressive benefits to parallelization. However the cost is pretty great. First, it takes a lot of technical knowledge to set up AWS and get all the modules and environments working properly (I was actually surprised how difficult this was given that AWS and Apache Spark have made big data analytics accessible for all). The other expense is the shear abstraction in understanding errors. Running split packets of data (RDDs) on multiple cores, on multiple executors, controlled by a driver with which you interact makes it near impossible to debug. Simple advice: only use Apache Spark on AWS if your data is too big to analyze on your macbook.