Supported Dataset Formats
CAIMAN-ASR supports reading data from four formats:
Format | Modes | Description | Docs |
---|---|---|---|
JSON | training + validation | All audio as wav or flac files in a single directory hierarchy with transcripts in json file(s) referencing these audio files. | [link] |
Webdataset | training + validation | Audio <key>.{flac,wav} files stored with associated <key>.txt transcripts in tar file shards. Format described here | [link] |
Directories | validation | Audio (wav or flac) files and the respective text transcripts are in two separate directories. | [link] |
Hugging Face | training (using provided conversion script) + validation | Hugging Face Hub datasets | [link] |
To train on your own proprietary dataset you will need to arrange for it to be in the WebDataset
or JSON
format.
A worked example of how to do this for the JSON
format is provided in json_format.md.
The script hugging_face_to_json.py
converts a Hugging Face dataset to the JSON
format; see here for more details.