Top 50 Open Source Datasets for AI projects
Generic Datasets
Kaggle Datasets – https://www.kaggle.com/datasets
Google Dataset Search – https://datasetsearch.research.google.com
OpenML – https://www.openml.org
Awesome Public Datasets – https://github.com/awesomedata/awesome-public-datasets
UCI Machine Learning Repository – https://archive.ics.uci.edu/ml/index.php
Computer Vision Datasets
COCO (Common Objects in Context) https://cocodataset.org
ImageNet – http://www.image-net.org
Open Images Dataset – https://storage.googleapis.com/openimages/web/index.html
CelebA (Face Attributes) – http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
LFW (Labeled Faces in the Wild) – http://vis-www.cs.umass.edu/lfw
NLP (Natural Language Processing) Datasets
Common Crawl – https://commoncrawl.org
WikiText – https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset
The Pile by EleutherAI – https://pile.eleuther.ai
OpenWebText – https://skylion007.github.io/OpenWebTextCorpus
Healthcare Datasets
MIMIC-III – https://physionet.org/content/mimiciii/1.4
NIH Chest X-ray Dataset – https://nihcc.app.box.com/v/ChestXray-NIHCC
LUNA16 – https://luna16.grand-challenge.org
ADNI (Alzheimer’s Data) – http://adni.loni.usc.edu
BCI Competition Datasets – http://www.bbci.de/competition
Finance and Economics Datasets
Quandl – https://www.quandl.com
World Bank Open Data – https://data.worldbank.org
FRED – https://fred.stlouisfed.org
Yahoo Finance – https://finance.yahoo.com
Google Finance – https://www.google.com/finance
Geospatial and Satellite Data
Landsat Data – https://landsat.gsfc.nasa.gov/data
OpenStreetMap – https://www.openstreetmap.org
EarthData by NASA – https://earthdata.nasa.gov
Sentinel-2 by ESA – https://scihub.copernicus.eu
Planet Open Data – https://www.planet.com/open-data
Autonomous Driving Datasets
KITTI Dataset – http://www.cvlibs.net/datasets/kitti
Waymo Open Dataset – https://waymo.com/open
NuScenes – https://www.nuscenes.org
ApolloScape – http://apolloscape.auto
Cityscapes – https://www.cityscapes-dataset.com
Audio and Speech Datasets
LibriSpeech – https://www.openslr.org/12
VoxCeleb – https://www.robots.ox.ac.uk/~vgg/data/voxceleb
Mozilla Common Voice – https://commonvoice.mozilla.org
TED-LIUM – https://www.openslr.org/7
Google Speech Commands -https://www.tensorflow.org/datasets/catalog/speech_commands
Reinforcement Learning Datasets
OpenAI Gym – https://gym.openai.com
MuJoCo – https://mujoco.org
DeepMind Control Suite – https://www.deepmind.com/open-source/dm-control
RL Bench – https://github.com/stepjam/RLBench
Meta-World – https://meta-world.github.io
Miscellaneous Datasets
20 Newsgroups – http://qwone.com/~jason/20Newsgroups
Amazon Reviews – https://nijianmo.github.io/amazon/index.html
MovieLens – https://grouplens.org/datasets/movielens
Hugging Face Datasets – https://huggingface.co/datasets
Laion-400M – https://laion.ai/blog/laion-400-open-dataset
Top 100 Open Source Datasets - https://medium.com/analytics-vidhya/top-100-open-source-datasets-for-data-science-cd5a8d67cc3d