What data we host
We are open to hosting any type of data that's useful for speech recognition and related tasks, that needs a stable URL where it can be downloaded from. We may think more carefully in cases where the data is very large (e.g. tens of gigabytes or more).Submitting your data
The process of adding data to OpenSLR is as follows. First you might want to quickly check with us whether the data you want to contribute is something we want to host; you can email jtrmal@gmail.com. If we think it's a good idea, you can prepare a .tar.gz file containing a directory with your data in it.
The format of submitted data
The directory that you transfer to us as a .tar.gz file should not contain subdirectories; it should just contain the files you want to host and two special files calledinfo.txt and
            about.html whose format we'll explain below.  Here is an example of such a directory:
# ls /var/www/openslr/resources/6 about.html data_voip_cs.tgz data_voip_en.tgz info.txtNote: the .tgz files inside it are the actual files that we're offering for download (and there is no limitation on their names or file-type, except for the no-subdirectories rule). What you would transfer to us is a .tar.gz file containing /var/www/openslr/resources/6, i.e. the four files you see in the listing above. This information is used to automatically populate the web-page at http://www.openslr.org/6/. An example of what the
info.txt file looks like is as follows:
root@www:/var/www/openslr# cat /var/www/openslr/resources/6/info.txt name: Vystadial summary: English and Czech data, mirrored from the Vystadial project category: speech license: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0 US) file: data_voip_cs.tgz Czech speech and transcripts file: data_voip_en.tgz English speech and transcripts alternate_url: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6 Czech data alternate_url: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4 English dataThis is a plain-text file that will be parsed by php scripts on our site. Some of the fields are mandatory and must appear only once: the
 name, 
   summary,   category and license fields.
  The name field gives
  the name of your resource, which shouldn't be too long.  The summary
  is a short-sentence-length description of the resource.
  The category will normally be either
  "speech", "text" or "software" but it can have other values too.
  The license line should be concise; it can just summarize the
  license, which we assumed is explained more fully in the download itself or in
  the about.html file.  There
  may be multiple instances of the file field; each one corresponds to one
  of the files in the directory you sent us.  The text after the filename in the file
  field is optional; if your resource only contains one file it may not be necessary.
  The alternate_url field is optional and if it occurs, may be repeated;
  the text after the URL is optional.
 
The about.html file is generic HTML which will be included in the "about this resource"
 section of the automatically generated webpage.  Just send us a first guess and you can edit it later
 if needed.  In our example, the about.html file looks like this:
This data is transcribed telephone converation data, in English and Czech.
<p>
The data collection process and development of these training scripts was partly
funded by the Ministry of Education, Youth and Sports of the Czech Republic
under the grant agreement LK11221 and core research funding of Charles
University in Prague.
<p>
You can cite the data using the following BibTeX entry:
<pre>
@inproceedings{korvas_2014,
  title={{Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license}},
  author={Korvas, Mat\v{e}j and Pl\'{a}tek, Ond\v{r}ej and Du\v{s}ek, Ond\v{r}ej and \v{Z}ilka, Luk\'{a}\v{s} and Jur\v{c}\'{i}\v{c}ek, Filip},
  booktitle={Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC 2014)},
  pages={To Appear},
  year={2014},
}
</pre>
Once you have your .tar.gz file containing the info.txt, about.html files and your
actual data, you can transfer it to us (we'll have to discuss the exact mechanism if it's too big to fit in email)
and we'll check it and put it on the site.