Recapping the first four weeks of my data science class with Data Science Dojo

About a month back, I shared that I started a 16-week Data Science course with Data Science Dojo. In that post, I mentioned that I didn’t necessarily want to jump feet first into a new data science career. Instead, I wanted to tell better stories with data, understand different machine learning models, and add predictive analytics skills to my marketing toolbelt. Here’s an update on what I’ve learned in the first four weeks of the program. It’s been fun and challenging and eye-opening.

Week 1 - Data exploration, visualization, and feature engineering
In week 1, we discussed the foundational elements of predictive analytics. Before getting to the sexy part of machine learning and predictive modeling, we learned that you must ensure your dataset is of high quality (garbage in, garbage out). We used R to explore, clean, and visualize our dataset to hypothesize and communicate insights. I discovered that approximately 80% of your time would go into these janitorial-type tasks because even the best algorithms can’t save you from inadequate data. 

As a marketer, I realized that I should be asking more in-depth questions about my data (how it was collected and the quality and variety of the data). I should be particular when working with data scientists to ensure they fully understand the business problem we’re trying to solve.

Week 2 - Types of visualization techniques and feature engineering
We worked on an R coding exercise to visualize and explore two real-life datasets. One fascinating dataset included passenger information from the Titanic disaster, where we must develop a model to determine who survived and who died in the crash. Using the Titanic data, we created box plots, density plots, and multi-dimensional plots (plus a few others) using a library in R called ggplot. Because you’re often using vast datasets, plotting and visualizing your data allows you to interpret data more efficiently and highlight trends. 

I learned was that it’s so important to question everything about your dataset. Clarify any of the column’s (features) headings and don’t make any initial assumptions about your data. 

Week 3: Predictive analytics, classification, and Decision trees
What a fun week! After exploring, visualizing, and feature engineering our dataset in the first two weeks, we got to be hands-on and develop a decision tree algorithm for our Titanic dataset. This is the sexy 20% of the work!

I learned that decision trees could be a great choice when you’re solving a classification problem (predicting a label, i.e., Survived or Dead), and when your data is labeled and known (supervised learning). From there, we took 70% of the Titanic data and trained our decision tree model on it. To see how well we did, we took the other 30% and tested our model. It was sort of like an open-book test since we already knew the real answers to the 30% testing data, we could grade how well our decision tree was able to predict the correct outcomes from the Titanic data. 

Week 4: Finding near-optimal solutions for your Decision Tree
Here we expanded on our work from the previous week. We discussed which attributes to test, determine the best splits, and when a Decision Tree should stop splitting. We calculated how to maximize the information gain at each split in the Decision Tree with tools like Gini Impurity (how mixed are the two classes in the groups), Entropy (disorder/messiness of data), and Classification errors. Luckily R will calculate much of this for you without having to break out your old calculus books. 

I am excited to share updates in month two as we reach the halfway point in the class. Please let me know if you have any questions about the course up to this point.  

Note: I chose Data Science Dojo over others like General Assembly after talking to other marketing professionals who had completed the course. They assured me that the coding in the class wouldn’t overwhelm me (my biggest concern). I found that Data Dojo gives you the foundation of the R programming language but that they supply tons of resources and examples to cherry-pick your way through a problem. Also, Data Science Dojo was the only course I found that came with the backing of a University. After completing the class, I’ll receive 7 hours of continuing education from The University of New Mexico.

Bob Hazlett