Supported dataset formats

Supported Dataset Formats

CAIMAN-ASR supports reading data from four formats:

Format	Modes	Description	Docs
`JSON`	training + validation	All audio as wav or flac files in a single directory hierarchy with transcripts in json file(s) referencing these audio files.	[link]
`Webdataset`	training + validation	Audio `<key>.{flac,wav}` files stored with associated `<key>.txt` transcripts in tar file shards. Format described here	[link]
`Directories`	validation	Audio (wav or flac) files and the respective text transcripts are in two separate directories.	[link]
`Hugging Face`	validation	Hugging Face Hub datasets; see here for more info.	[link]

To train on your own proprietary dataset you will need to arrange for it to be in the WebDataset or JSON format. A worked example of how to do this for the JSON format is provided in json_format.md.

Note

If you have a feature request to support training/validation on a different format, please open a GitHub issue.