ULCA Hindi ASR Dataset Corpus #7898

muni7085 · 2023-11-16T13:41:53Z

muni7085
Nov 16, 2023

I working on Hindi ASR system. For that I collecting transcribed Hindi speech data. In this process I found the ULCA ASR Dataset Corpus in GitHub. But the links seems to be broken. I found the NVIDIA RIVA Hindi ASR models are trained on ULCA Hindi ASR Dataset Corpus. Is there any other sources to get this dataset?
Can anyone please help 🙏.

1-800-BAD-CODE · 2023-11-23T16:13:39Z

1-800-BAD-CODE
Nov 23, 2023

I'm unaffiliated with nvidia and that model you mention, but I'm familiar with this data so I'll give you my two cents.

The reason you cannot find ULCA for download is because it contains explicitly-copyrighted materials. The authors presumably do not have permission to distribute the data, and they ghost you if you ask about it (Open-Speech-EkStep/ULCA-asr-dataset-corpus#4).

If you are not already aware of it, you may be interested in the Shrutilipi dataset.

Though Shrutilipi is also collected from broadcasts, it differs from ULCA in these ways:

It's collected from a single source (easier to verify whether it's ok for you to use it)
It's collected from a government-affiliated source (less likely to bring litigation/take-down notices)
It's collected in and distributed from India
The authors describe in detail where it came from and how it was collected
The authors use an appropriate license (they license the packaging, not the data)

1 reply

muni7085 Nov 28, 2023
Author

I appreciate your answer. Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ULCA Hindi ASR Dataset Corpus #7898

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

ULCA Hindi ASR Dataset Corpus #7898

muni7085 Nov 16, 2023

Replies: 1 comment · 1 reply

1-800-BAD-CODE Nov 23, 2023

muni7085 Nov 28, 2023 Author

muni7085
Nov 16, 2023

Replies: 1 comment 1 reply

1-800-BAD-CODE
Nov 23, 2023

muni7085 Nov 28, 2023
Author