| Resource | Name | Category | Summary | 
                | SLR1 | Yesno | Speech | Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long. | 
| SLR2 | OpenFST | Software | A mirror of the OpenFst toolkit | 
| SLR3 | sph2pipe | Software | A mirror of the sph2pipe software | 
| SLR4 | sctk | Software | A mirror of the sctk scoring software | 
| SLR5 | MSU Switchboard transcipts | Text | A mirror of the Mississippi State transcripts and lexicon for Switchboard. | 
| SLR6 | Vystadial | Speech | English and Czech data, mirrored from the Vystadial project | 
| SLR7 | TED-LIUM | Speech | English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here) | 
| SLR8 | Sprakbanken | Text | Danish pronunciation dictionary generated using eSpeak | 
| SLR9 | The AMI pack | Text | Some auxiliary non-speech data used to build AMI systems with Kaldi | 
| SLR10 | SRE Data | Misc | Various files from SRE data that NIST used to host online | 
| SLR11 | LibriSpeech language models, vocabulary and G2P models | Text | Language modelling resources, for use with the LibriSpeech ASR corpus | 
| SLR12 | LibriSpeech ASR corpus | Speech | Large-scale (1000 hours) corpus of read English speech | 
| SLR13 | RWCP Sound Scene Database | Speech + Software | A database of recordings of real-world sounds and measured room impulse responses | 
| SLR14 | BEEP Dictionary | Text | Phonemic transcriptions of over 250,000 English words. (British English pronunciations) | 
| SLR15 | SRE Speaker List | Misc | A list linking speakers across NIST SRE corpra | 
| SLR16 | The AMI Corpus | Speech | Acoustic speech data and meta-data from The AMI corpus. | 
| SLR17 | MUSAN | Audio | A corpus of music, speech, and noise | 
| SLR18 | THCHS-30 | Speech | A Free Chinese Speech Corpus Released by CSLT@Tsinghua University | 
| SLR19 | TED-LIUMv2 | Audio | TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here) | 
| SLR20 | Aachen Impulse Response Database | Audio | Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here) | 
| SLR21 | Spanish Word list | Text | A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword). | 
| SLR22 | THUYG-20 | Speech | A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University | 
| SLR23 | NIST LRE 2007 Key | Misc | A file containing metadata for the utterances in the LRE 2007 evaluation | 
| SLR24 | Iban | Speech | Iban language text and speech corpora for ASR | 
| SLR25 | ALFFA (African Languages in the Field: speech Fundamentals and Automation) | Speech | Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository | 
| SLR26 | Simulated Room Impulse Response Database | Audio | A database of simulated room impulse responses | 
| SLR27 | Cantab-TEDLIUM Release 1.1 (February 2015) | Text | Cantab Research Language models for the TEDLIUM database | 
| SLR28 | Room Impulse Response and Noise Database | Audio | A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision. | 
| SLR29 | Sprakbanken_Swe | Text | Swedish pronunciation dictionary | 
| SLR30 | Sinhala TTS | Speech | Sinhalese multi-speaker TTS corpora | 
| SLR31 | Mini LibriSpeech ASR corpus | Speech | Subset of LibriSpeech corpus for purpose of regression testing | 
| SLR32 | High quality TTS data for four South African languages (af, st, tn, xh) | Speech | Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa. | 
| SLR33 | Aishell | Speech | Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd | 
| SLR34 | Santiago Spanish Lexicon | Text | A pronouncing dictionary for the Spanish language. | 
| SLR35 | Large Javanese ASR training data set | Speech | Javanese ASR training data set containing ~185K utterances. | 
| SLR36 | Large Sundanese ASR training data set | Speech | Sundanese ASR training data set containing ~220K utterances. | 
| SLR37 | High quality TTS data for Bengali languages | Speech | Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN). | 
| SLR38 | Free ST Chinese Mandarin Corpus | Speech | A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances; | 
| SLR39 | Heroico | Speech | Spanish data, mirrored from the LDC | 
| SLR40 | Zeroth-Korean | Speech Corpus for Automatic Speech Recognition | Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth) | 
| SLR41 | High quality TTS data for Javanese. | Speech | Multi-speaker TTS data for Javanese (jv-ID) | 
| SLR42 | High quality TTS data for Khmer. | Speech | Multi-speaker TTS data for Khmer (km-KH) | 
| SLR43 | High quality TTS data for Nepali. | Speech | Multi-speaker TTS data for Nepali (ne-NP) | 
| SLR44 | High quality TTS data for Sundanese. | Speech | Multi-speaker TTS data for Sundanese (su-ID) | 
| SLR45 | Free ST American English Corpus | Speech | A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances; | 
| SLR46 | Tunisian_MSA | Speech | Tunisian Modern Standard Arabic | 
| SLR47 | Primewords Chinese Corpus Set 1 | Speech | Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data. | 
| SLR48 | MADCAT Arabic data splits | Other | Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus | 
| SLR49 | VoxCeleb Data | Misc | Various files for the VoxCeleb datasets | 
| SLR50 | MADCAT Chinese data splits | Other | Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus | 
| SLR51 | TED-LIUM Release 3 | Speech | TED-LIUM corpus release 3 | 
| SLR52 | Large Sinhala ASR training data set | Speech | Sinhala ASR training data set containing ~185K utterances. | 
| SLR53 | Large Bengali ASR training data set | Speech | Bengali ASR training data set containing ~196K utterances. | 
| SLR54 | Large Nepali ASR training data set | Speech | Nepali ASR training data set containing ~157K utterances. | 
| SLR55 | CLMAD | Text | A Chinese Language Model Adaptation Dataset (CLMAD). | 
| SLR56 | IAM Aachen splits | Other | Aachen data splits (train/test/val) for the IAM dataset. | 
| SLR57 | African Accented French | Speech | Recordings of African Accented French speech. | 
| SLR58 | Pansori-TEDxKR | Speech | Korean speech corpus generated from Korean language TEDx talks | 
| SLR59 | ParlamentParla | Speech | Catalan speech corpus generated from Catalan Parliamentary sessions | 
| SLR60 | LibriTTS corpus | Speech | Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus | 
| SLR61 | Crowdsourced high-quality Argentinian Spanish speech data set. | Speech | Data set which contains 5739 recordings of native speakers of Spanish | 
| SLR62 | aidatatang_200zh | Speech | A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%. | 
| SLR63 | Crowdsourced high-quality Malayalam multi-speaker speech data set. | Speech | Data set which contains recordings of native speakers of Malayalam. | 
| SLR64 | Crowdsourced high-quality Marathi multi-speaker speech data set. | Speech | Data set which contains recordings of native speakers of Marathi | 
| SLR65 | Crowdsourced high-quality Tamil multi-speaker speech data set. | Speech | Data set which contains recordings of native speakers of Tamil. | 
| SLR66 | Crowdsourced high-quality Telugu multi-speaker speech data set. | Speech | Data set which contains recordings of native speakers of Telugu. | 
| SLR67 | TEDx Spanish Corpus | Speech | Spanish data taken from the TEDx Talks | 
| SLR68 | MAGICDATA Mandarin Chinese Read Speech Corpus | Speech | The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%. | 
| SLR69 | Crowdsourced high-quality Catalan speech data set. | Speech | Data set which contains recordings of Catalan. | 
| SLR70 | Crowdsourced high-quality Nigerian English speech data set. | Speech | Data set which contains recordings of Nigerian English. | 
| SLR71 | Crowdsourced high-quality Chilean Spanish speech data set. | Speech | Data set which contains recordings of Chilean Spanish. | 
| SLR72 | Crowdsourced high-quality Colombian Spanish speech data set. | Speech | Data set which contains recordings of Colombian Spanish. | 
| SLR73 | Crowdsourced high-quality Peruvian Spanish speech data set. | Speech | Data set which contains recordings of Peruvian Spanish. | 
| SLR74 | Crowdsourced high-quality Puerto Rico Spanish speech data set. | Speech | Data set which contains recordings of Puerto Rico Spanish. | 
| SLR75 | Crowdsourced high-quality Venezuelan Spanish speech data set. | Speech | Data set which contains recordings of Venezuelan Spanish. | 
| SLR76 | Crowdsourced high-quality Basque speech data set. | Speech | Data set which contains recordings of Basque. | 
| SLR77 | Crowdsourced high-quality Galician speech data set. | Speech | Data set which contains recordings of Galician. | 
| SLR78 | Crowdsourced high-quality Gujarati multi-speaker speech data set. | Speech | Data set which contains recordings of native speakers of Gujarati. | 
| SLR79 | Crowdsourced high-quality Kannada multi-speaker speech data set. | Speech | Data set which contains recordings of native speakers of Kannada. | 
| SLR80 | Crowdsourced high-quality Burmese speech data set. | Speech | Data set which contains recordings of Burmese. | 
| SLR81 | Small Audio Clips | Speech | Contains 20 one-second audio clips from various sources, for testing compression algorithms | 
| SLR82 | CN-Celeb | Speech | A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University | 
| SLR83 | Crowdsourced high-quality UK and Ireland English Dialect speech data set. | Speech | Data set which contains male and female recordings of English from various dialects of the UK and Ireland. | 
| SLR84 | ScribbleLens | Handwriting | Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research. | 
| SLR85 | HI-MIA | Speech | A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019 | 
| SLR86 | Crowdsourced high-quality Yoruba speech data set. | Speech | Data set which contains recordings of Yoruba. | 
| SLR87 | MobvoiHotwords | Speech | Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD | 
| SLR88 | Att-HACK | Speech | French Expressive Speech Database with Social Attitudes | 
| SLR89 | Yoloxóchitl-Mixtec | Speech | Yolóxochitl Mixtec Speech with Transcription | 
| SLR92 | Puebla-Nahuatl | Speech | Puebla Nahuatl Speech with Transcription | 
| SLR93 | AISHELL-3 | Speech | Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd. | 
| SLR94 | Multilingual LibriSpeech (MLS) | Speech | A large multilingual corpus derived from LibriVox audiobooks | 
| SLR95 | Thorsten Müller (German Neutral-TTS dataset) | Speech | Free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training | 
| SLR96 | Russian LibriSpeech (RuLS) | Speech | This dataset is based on LibriVox audiobooks | 
| SLR97 | Deeply Korean read speech corpus | Speech | Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone. | 
| SLR98 | Deeply parent-child vocal interaction dataset | Speech | The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone. | 
| SLR99 | Deeply Nonverbal Vocalization Dataset | Audio | A human nonverbal vocal sound dataset by Deeply Inc. | 
| SLR100 | Multilingual TEDx | Speech | a multilingual corpus of TEDx talks for speech recognition and translation | 
| SLR101 | speechocean762 | Speech | Pronunciation scoring dataset, labeled independently by five human experts | 
| SLR102 | Kazakh Speech Corpus (KSC) | Speech | A crowdsourced open-source Kazakh speech corpus developed by ISSAI (330 hours) | 
| SLR103 | Multilingual and code-switching ASR Challenge Dataset - sub-task1 | Speech | Datasets for sub-task1 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/) | 
| SLR104 | Multilingual and code-switching ASR Challenge Dataset - sub-task2 | Speech | Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/) | 
| SLR105 | nicolingua-0003-west-african-radio-corpus | Speech | West African Radio Corpus | 
| SLR106 | nicolingua-0004-west-african-va-asr-corpus | Speech | West African Virtual Assistant Speech Recognition Corpus | 
| SLR107 | Totonac Resources | Speech | Totonac Speech with Transcription | 
| SLR108 | MediaSpeech | Speech | French, Arabic, Turkish and Spanish media speech datasets | 
| SLR109 | Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS) | Speech | A multi-speaker English dataset for training text-to-speech models | 
| SLR110 | Thorsten Müller (German Emotional-TTS dataset) | Speech | Free EMOTIONAL single german speaker dataset (Neutral, Disgusted, Angry, Amused, Surprised, Sleepy, Drunk, Whispering) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for TTS training | 
| SLR111 | AISHELL-4 | Speech | A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Beijing Shell Shell Technology Co.,Ltd | 
| SLR112 | Samromur 21.05 | Speech | Samrómur Icelandic Speech corpus approved for release in May 2021 | 
| SLR113 | SEOUL CORPUS | Speech | The Korean Corpus of Spontaneous Speech (aka, Seoul Corpus), created from the NRF(Korea)-funded project | 
| SLR114 | Golos | Speech | Russian ASR dataset (1240 hours) with trained acoustic and language models | 
| SLR115 | EmoV_DB | Speech | a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English (https://github.com/numediart/EmoV-DB) | 
| SLR116 | Samrómur Queries 21.12 | Speech | Samrómur Icelandic Speech corpus focused on queries and approved for release in December 2021 | 
| SLR117 | Samrómur Children 21.09 | Speech | Samrómur Icelandic Speech from children (ages 4-17 years) approved for release in September 2021 | 
| SLR118 | 1111 Hours Hindi ASR Challenge | Speech | Datasets for 1111 Hours Hindi ASR Challenge Closed, Self Supervised Closed and Open - 2022  (https://sites.google.com/view/gramvaaniasrchallenge/home) | 
| SLR119 | AliMeeting | Speech | A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Alibaba Group | 
| SLR120 | HI-MIA-CW | Speech | A Free Mandarin Supplemental Speech Corpus to HI-MIA Database, whose contents are negative samples for wake-up words "Hi, Mia". | 
| SLR121 | WenetSpeech | Speech | A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition | 
| SLR122 | Kashmiri Data Corpus | Speech | An audio and text corpus for the Kashmiri language | 
| SLR123 | MAGICDATA Mandarin Chinese Conversational Speech Corpus | Speech | The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data. | 
| SLR124 | TIBMD@MUC speech data set | Speech | A Tibetan multi-dialect speech data ( 84.33 hours) | 
| SLR125 | Basic LAnguage Resource Kit 1.0 for Faroese | Speech | Faroese Speech corpus approved for release in July 2022 | 
| SLR126 | IISc-MILE Kannada ASR Corpus | Speech | Kannada transcribed speech corpus for ASR | 
| SLR127 | IISc-MILE Tamil ASR Corpus | Speech | Tamil transcribed speech corpus for ASR | 
| SLR128 | Samrómur Unverified 22.07 | Speech | Samrómur Icelandic Speech, 2,200 hours of mostly unverified data approved for release in July 2022 | 
| SLR129 | BibleTTS | Speech | A large, high-fidelity, multilingual, and uniquely African speech corpus | 
| SLR130 | Samrómur L2 22.09 | Speech | Samrómur Icelandic Speech, 150 hours from people with Icelandic as a second language. Approved for release in July 2022 | 
| SLR131 | Samrómur Mimic 22.09 | Speech | Samrómur Icelandic Speech, 66.7 hours of speech where users mimic utterances. Approved in September 2022 | 
| SLR132 | Mohammed | Speech | Arabic speech to text Quran data | 
| SLR133 | XBMU-AMDO31 | Speech | Tibetan Amdo dialect speech data from NLIT, Northwest Minzu University | 
| SLR134 | SASPEECH | Speech | Hebrew speech and transcripts by a single speaker (30 hours) | 
| SLR135 | Libri-Mixed-Speakers | Speech | English audio of simultaneous speakers derived from LibriTTS | 
| SLR136 | EMNS | Speech, text-to-speech, automatic speech recognition | An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech. | 
| SLR137 | Silbo Gomero Speech Corpus | Speech | Corpus of the Silbo Gomero whistled language, based on 49 minutes of recordings created by 4 whistlers. | 
| SLR138 | SHALCAS22A | Speech | A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. | 
| SLR139 | Audiocite.net | Speech | Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project. | 
| SLR140 | Kazakh Speech Dataset (KSD) | Speech | High-quality open source Kazakh speech corpus developed by the Department of Artificial Intelligence and Big Data of Al-Farabi Kazakh National University (554 hours) | 
| SLR141 | LibriTTS-R | Speech | Sound quality improved version of the LibriTTS corpus which is a large-scale corpus of English speech designed for TTS use | 
| SLR142 | The MC Speech Dataset | Speech | Free speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish | 
| SLR143 | Nepali Text-to-Speech Data (Male and Female) | Speech | Nepali speech and corresponding text data in male and female voice | 
| SLR144 | SlideSpeech | Audio-Visual Speech | A Large-scale English Multi-Modal Audio-Visual Corpus, provided by Alibaba Group | 
| SLR145 | LibriSpeech-PC | Text | LibriSpeech text with Punctuation and Capitalization | 
| SLR146 | CML-TTS Dataset | Speech | CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages | 
| SLR147 | Veracruz Orizaba Nahuatl Endangered Language | Speech | Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv) | 
| SLR148 | Tepetzintla Zacatlan Nahuatl Endangered Language | Speech | Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi) | 
| SLR149 | Tibetan Greetings | Speech | Selected Tibetan greetings speech data  categorized according to the dialectal region. | 
| SLR150 | CHiME-6 | Speech | English multi-channel far field meeting data used in the CHiME-6 Challenge. It is derived from CHiME-5 by fixing some array synchronization errors. | 
| SLR151 | Kallaama | Speech | Wolof, Pulaar and Sereer data | 
| SLR152 | Pragmatic Similarity Judgments | Speech | Judgments of perceived similarity between utterance pairs from dialogs, in English and Spanish. | 
| SLR153 | Yerevan City Magazine | Text | A Free Armenian News Text Corpus, provided by Qaghaki Amsagir LLC (Yerevan City Magazine, evnmag.com) | 
| SLR154 | ArmenianGrqaserAudioBooks | Speech | Cutted, Segmented, Processed (speech, text) paired data, derived from the Grqaser.org audiobooks | 
| SLR155 | SBCSAE | Speech | The Santa Barbara Corpus of Spoken American English, mirrored from UCSB | 
| SLR156 | SMIIP-TV dataset | Speech | A short-term time-varying speaker verificaition dataset | 
| SLR157 | Sagalee | Speech | Automatic Speech Recognition Dataset for Oromo Language |