Our Trail Map
This tutorial features an end-to-end data science & natural language processing pipeline, starting with raw data and running through preparing, modeling, visualizing, and analyzing the data. We'll touch on the following points:
A tour of the dataset
Introduction to text processing with spaCy
Automatic phrase modeling
Topic modeling with LDA
Visualizing topic models with pyLDAvis
Word vector models with word2vec
Visualizing word2vec with t-SNE
...and we might even learn a thing or two about Python along the way.
Let's get started!
The Yelp Dataset
The Yelp Dataset is a dataset published by the business review service Yelp for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it — it's largely about food, after all!
Note: If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:
Please visit the Yelp dataset webpage here
Click "Get the Data"
Please review, agree to, and respect Yelp's terms of use!
The dataset downloads as a compressed .tgz file; uncompress it
Place the uncompressed dataset files (yelp_academic_dataset_business.json, etc.) in a directory named yelp_dataset_challenge_academic_dataset
Place the yelp_dataset_challenge_academic_dataset within the data directory in the Modern NLP in Python project folder
That's it! You're ready to go.
The current iteration of the Yelp dataset (as of this demo) consists of the following data:
552K users
77K businesses
2.2M user reviews
When focusing on restaurants alone, there are approximately 22K restaurants with approximately 1M user reviews written about them.
The data is provided in a handful of files in .json format. We'll be using the following files for our demo:
The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples.