It has
never been easier to build AI or machine learning-based systems than it
is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.
Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is
data
— lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the
democratization of data
is a necessary step towards accelerating AI.
However, most products involving machine learning or AI rely heavily on
proprietary
datasets that are often not released, as this provides implicit
defensibility
.
With that said, it can be hard to piece through what
public
datasets
are useful to look at, which are viable for a proof of concept, and
what datasets can be useful as a potential product or feature validation
step
before
you collect your own proprietary data.
It’s important to remember that good performance on data set
doesn’t guarantee
a machine learning system will perform well in real product scenarios.
Most people in AI forget that the hardest part of building a new AI
solution or product is not the AI or algorithms —
it’s the data collection and labeling
. Standard datasets can be used as validation or a good starting point for building a more tailored solution.
This
week, a few machine learning experts and I were talking about all this.
To make your life easier, we’ve collected an (opinionated) list of some
open datasets that you can’t afford
not
to know about in the AI world.
Computer Vision
-
MNIST
-
CIFAR 10 & CIFAR 100
-
ImageNet
-
LSUN
-
PASCAL VOC
-
SVHN
-
MS COCO
-
Visual Genome
-
Labeled Faces in the Wild
Natural Language
-
Text Classification Datasets
-
WikiTex
-
Question Pairs
-
SQuAD
-
CMU Q/A Dataset
-
Maluuba Datasets
-
Billion Words
-
Common Crawl
-
bAbi
-
The Children’s Book Test
-
Stanford Sentiment Treebank
-
20 Newsgroups
-
Reuters
-
IMDB
-
UCI’s Spambase
Speech
Most
speech recognition datasets are proprietary — the data holds a lot of
value for the company that curates. Most datasets available in the field
are quite old.