Our Trail Map
This tutorial features an end-to-end data science & natural language processing pipeline, starting with
raw data
and running through
preparing
,
modeling
,
visualizing
, and
analyzing
the data. We'll touch on the following points:
-
A tour of the dataset
-
Introduction to text processing with spaCy
-
Automatic phrase modeling
-
Topic modeling with LDA
-
Visualizing topic models with pyLDAvis
-
Word vector models with word2vec
-
Visualizing word2vec with t-SNE
...and we might even learn a thing or two about Python along the way.
Let's get started!
The Yelp Dataset
The Yelp Dataset
is a dataset published by the business review service
Yelp
for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it — it's largely about food, after all!
Note:
If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:
-
Please visit the Yelp dataset webpage
here
-
Click "Get the Data"
-
Please review, agree to, and respect Yelp's terms of use!
-
The dataset downloads as a compressed .tgz file; uncompress it
-
Place the uncompressed dataset files (
yelp_academic_dataset_business.json
, etc.) in a directory named
yelp_dataset_challenge_academic_dataset
-
Place the
yelp_dataset_challenge_academic_dataset
within the
data
directory in the
Modern NLP in Python
project folder
That's it! You're ready to go.
The current iteration of the Yelp dataset (as of this demo) consists of the following data:
-
552K
users
-
77K
businesses
-
2.2M