专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【学习】对Stack Overflow 问题进行文本挖掘

机器学习研究会 · 公众号 · AI · 2017-07-08 21:56

正文

点击上方“机器学习研究会”可以订阅哦

摘要

转自：网路冷眼

This week, my fellow Stack Overflow data scientist David Robinson and I are happy to announce the publication of our book Text Mining with R with O’Reilly. We are so excited to see this project out in the world, and so relieved to finally be finished with it! Text data is being generated all the time around us, in healthcare, finance, tech, and beyond; text mining allows us to transform that unstructured text data into real insight that can increase understanding and inform decision-making. In our book, we demonstrate how using tidy data principles can make text mining easier and more effective. Let’s mark this happy occasion with an exploration of Stack Overflow text data, and show how natural language processing techniques we cover in our book can be applied to real-world data to gain insight.

For this analysis, I’ll use Stack Overflow questions from StackSample, a dataset of text from 10% of Stack Overflow questions and answers on programming topics that is freely available on Kaggle. The code that I’m using in this post is available as a kernel on Kaggle, so you can fork it for your own exploration.

This analysis focuses only on questions posted on Stack Overflow, and uses topic modeling to dig into the text.

What is topic modeling?

Topic modeling is a machine learning method for discovering “topics” that occur in a collection of documents. It is a powerful tool for organizing large collections of raw text. Topic modeling is an unsupervised method, which means that I as the analyst don’t decide ahead of time what the topics will be about; we can find topics within text even if we’re not sure what we’re looking for ahead of time. Topic modeling can be used to discover underlying structure within text. In the context of the kind of topic model I’ll implement (LDA topic modeling),

every document is a mixture of topics and
every topic is a mixture of words.

Documents can share topics, and topics can share words, in any proportions. In our case for this analysis, each Stack Overflow question is a document. Let’s imagine (for the sake of explanation) that there are two topics, one that is made up of the three words “table”, “select”, and “join” and a second that is made up of the three words “function”, “print”, and “return.” One question might be 100% topic 2, and another question might be 50% topic 1 and 50% topic 2. The statistical modeling process of topic modeling finds the topics in the text dataset we are dealing with, which words contribute to the topics, and which topics contribute to which documents.

Modeling Stack Overflow questions

For this blog post, I fit a model with 12 topics to this dataset. The question of how to choose the number of topics in topic modeling is a complicated one, but in this case, 12 topics gives us a good result for exploration. The process of building this topic model also involves cleaning text, removing stop words, and building a document-term matrix, all considerations covered in our book.

One of the most compelling reasons to adopt tidy data principles when doing topic modeling is that we can easily explore which words contribute the most to which topics, and which topics contribute the most to which documents (questions on Stack Overflow, in this case). This is how we find out what kind of content corresponds to the topics fit by the model. Let’s look at that for these specific questions. Which words are most important for each topic, in this model with 12 topics?

链接：

https://stackoverflow.blog/2017/07/06/text-mining-stack-overflow-questions/

原文链接：

https://m.weibo.cn/1715118170/4126965948253945

“完整内容”请点击【阅读原文】

↓↓↓