专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【学习】词向量直觉理解

机器学习研究会 · 公众号 · AI · 2017-06-07 21:36

正文

点击上方“机器学习研究会”可以订阅哦

摘要

转自：爱可可-爱生活

Introduction

Before we start, have a look at the below examples.

You open Google and search for a news article on the ongoing Champions trophy and get hundreds of search results in return about it.
Nate silver analysed millions of tweets and correctly predicted the results of 49 out of 50 states in 2008 U.S Presidential Elections.
You type a sentence in google translate in English and get an Equivalent Chinese conversion.

So what do the above examples have in common?

You possible guessed it right – TEXT processing. All the above three scenarios deal with humongous amount of text to perform different range of tasks like clustering in the google search example, classification in the second and Machine Translation in the third.

Humans can deal with text format quite intuitively but provided we have millions of documents being generated in a single day, we cannot have humans performing the above the three tasks. It is neither scalable nor effective

So, how do we make computers of today perform clustering, classification etc on a text data since we know that they are generally inefficient at handling and processing strings or texts for any fruitful outputs?

Sure, a computer can match two strings and tell you whether they are same or not. But how do we make computers tell you about football or Ronaldo when you search for Messi? How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lie in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.

And all of these are implemented by using Word Embeddings or numerical representations of texts so that computers may handle them.

Below, we will see formally what are Word Embeddings and their different types and how we can actually implement them to perform the tasks like returning efficient Google search results.

What are Word Embeddings?
Different types of Word Embeddings
2.1 Frequency based Embedding
2.1.1 Count Vectors
2.1.2 TF-IDF
2.1.3 Co-Occurrence Matrix
2.2 Prediction based Embedding
2.2.1 CBOW
2.2.2 Skip-Gram
Word Embeddings use case scenarios(what all can be done using word embeddings? eg: similarity, odd one out etc.)
Using pre-trained Word Vectors
Training your own Word Vectors
End Notes

链接：

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

原文链接：

http://m.weibo.cn/1402400261/4115731496708803

“完整内容”请点击【阅读原文】

↓↓↓

推荐文章

宝玉xp · o1 pro不是聊天对话模型，不适合长会话，最好一条消息内把上下-20250114035356

3 天前

爱可可-爱生活 · 积分重置的最后一刻靠着o1把Poe干爆了，一点儿没浪费！ -20250112075001

4 天前

爱可可-爱生活 · 通俗版解读查看图片-20250112080051

4 天前

爱可可-爱生活 · 【[95星]Kokoro-FastAPI：基于 Docker 的-20250111141300

5 天前

宝玉xp · 转发微博-20250110065741

1 周前

毒药 · 冯小刚：我也看过盗版，说我连狗不如，我没意见丨毒药精选

8 年前

猎云网 · 传百度糯米和外卖要打包卖给美团，百度再次否认

8 年前

青塔 · 北大:明年起全面停止夜大、网络教育等学历继续教育招生！

8 年前

奥斯CAR · 这些被中国车企“吞掉”的国外品牌，现在活得咋样？

7 年前

腾讯财讯 · 征集令：腾讯证券带你去跟“股神”学投资

7 年前

【学习】词向量直觉理解

正文

Introduction

Table of Contents