This week’s focus was Natural Language Processing and Neural Networks. These are big topics, so hours were the toughest we’ve seen to date.
Beginning with neural networks, we built models to differentiate cats from dogs, and forests from beaches. Image featurization was critical: colors mattered much more for beaches vs forests while edge detection was much better to differentiate cats vs dogs. The instructors put together some benchmarks for their prediction accuracy with non-neural net models and asked us to beat their scores. Cool challenge and I think we were all able to do so which really emphasized the power of these models for image analysis.
We then moved into natural language processing, but to get the full learning, it required some serious improvements on our data pipelines and infrastructure knowledge. We began by teaching ourselves mongodb as a non-schema database to hold any data we scrape from the web. We then learned webscraping, and by the end of the day had pulled more than 1,000 NYTimes articles into our own database.
This accomplishment was a holy shit moment for me. I can honestly say it the most exciting day of the course thus far, and I most certainly want to get better at scraping through my capstone.
Now that we had a solid text dataset, we learned unsupervised classification techniques like kMeans along with required featurization techniques which are difficult and complicated in text analysis. Thus we needed techniques to simplify these absurdly large feature sized datasets.
Enter dimensionality reduction via Principal Component Analysis. This is an eigenvector problem to rotate your data matrix onto a new axis with reduced number of features that are linearly independent, yet describe often >90% of the data. This can be very important to model your data where memory issues may be a concern, though interpretability is often limited because meaning is extracted from your rotated matrix, and not the original dataset.
We ended the week by discussing project ideas. We walked through a few examples from previous cohorts and people are pretty amazing. Check out these incredible and inspiring ideas from previous Galvanize fellows: