Supported Dataset Formats

CAIMAN-ASR supports reading data from four formats:

FormatModesDescriptionDocs
JSONtraining + validationAll audio as wav or flac files in a single directory hierarchy with transcripts in json file(s) referencing these audio files.[link]
Webdatasettraining + validationAudio <key>.{flac,wav} files stored with associated <key>.txt transcripts in tar file shards. Format described here[link]
DirectoriesvalidationAudio (wav or flac) files and the respective text transcripts are in two separate directories.[link]
Hugging Facetraining (using provided conversion script) + validationHugging Face Hub datasets[link]

To train on your own proprietary dataset you will need to arrange for it to be in the WebDataset or JSON format. A worked example of how to do this for the JSON format is provided in json_format.md. The script hugging_face_to_json.py converts a Hugging Face dataset to the JSON format; see here for more details.

Note

If you have a feature request to support training/validation on a different format, please open a GitHub issue.