专栏名称: 机器学习研究会
机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织,旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外,协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。
目录
相关文章推荐
爱可可-爱生活  ·  //@爱可可-爱生活:今日开奖,欢迎参与~- ... ·  22 小时前  
爱可可-爱生活  ·  【nano-simsiam:一个极简的PyT ... ·  2 天前  
爱可可-爱生活  ·  【llm_note:一个全面的大语言模型(L ... ·  2 天前  
51好读  ›  专栏  ›  机器学习研究会

【推荐】(Python)现代自然语言处理:Yelp百万评论分析实例

机器学习研究会  · 公众号  · AI  · 2017-03-20 18:54

正文



点击上方“机器学习研究会”可以订阅哦

摘要
 

转自:爱可可-爱生活

Our Trail Map

This tutorial features an end-to-end data science & natural language processing pipeline, starting with raw data and running through preparingmodelingvisualizing, and analyzing the data. We'll touch on the following points:

  1. A tour of the dataset

  2. Introduction to text processing with spaCy

  3. Automatic phrase modeling

  4. Topic modeling with LDA

  5. Visualizing topic models with pyLDAvis

  6. Word vector models with word2vec

  7. Visualizing word2vec with t-SNE

...and we might even learn a thing or two about Python along the way.

Let's get started!

The Yelp Dataset

The Yelp Dataset is a dataset published by the business review service Yelp for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it — it's largely about food, after all!

Note: If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:

  1. Please visit the Yelp dataset webpage here

  2. Click "Get the Data"

  3. Please review, agree to, and respect Yelp's terms of use!

  4. The dataset downloads as a compressed .tgz file; uncompress it

  5. Place the uncompressed dataset files (yelp_academic_dataset_business.json, etc.) in a directory named yelp_dataset_challenge_academic_dataset

  6. Place the yelp_dataset_challenge_academic_dataset within the data directory in the Modern NLP in Python project folder

That's it! You're ready to go.

The current iteration of the Yelp dataset (as of this demo) consists of the following data:

  • 552K users

  • 77K businesses

  • 2.2M user reviews

When focusing on restaurants alone, there are approximately 22K restaurants with approximately 1M user reviews written about them.

The data is provided in a handful of files in .json format. We'll be using the following files for our demo:

  • yelp_academic_dataset_business.json — the records for individual businesses

  • yelp_academic_dataset_review.json — the records for reviews users wrote about businesses

The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples.


链接:

http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb


原文链接:

http://weibo.com/1402400261/EACamCado?ref=collection&type=comment#_rnd1489998367365

“完整内容”请点击【阅读原文】
↓↓↓