专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【数据集】人工智能领域比较常见的数据集汇总

机器学习研究会 · 公众号 · AI · 2017-02-13 18:54

正文

点击上方 “机器学习研究会” 可以订阅哦

摘要

转自：王威廉

It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.

Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI.

However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility .

With that said, it can be hard to piece through what public datasets are useful to look at, which are viable for a proof of concept, and what datasets can be useful as a potential product or feature validation step before you collect your own proprietary data.

It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling . Standard datasets can be used as validation or a good starting point for building a more tailored solution.

This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world.

Computer Vision

MNIST
CIFAR 10 & CIFAR 100
ImageNet
LSUN
PASCAL VOC
SVHN
MS COCO
Visual Genome
Labeled Faces in the Wild

Natural Language

Text Classification Datasets
WikiTex
Question Pairs
SQuAD
CMU Q/A Dataset
Maluuba Datasets
Billion Words
Common Crawl
bAbi
The Children’s Book Test
Stanford Sentiment Treebank
20 Newsgroups
Reuters
IMDB
UCI’s Spambase

Speech

Most speech recognition datasets are proprietary — the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.