Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this be used to mute non speech parts of an audio? #27

Open
orionflame opened this issue Apr 5, 2024 · 3 comments
Open

Can this be used to mute non speech parts of an audio? #27

orionflame opened this issue Apr 5, 2024 · 3 comments

Comments

@orionflame
Copy link

Hi,

I have a lot of narration done by myself for a tutorial that I made so I am trying to clean up the audio files to remove anything non speech related which is majority throat clearing, etc. Here is a very short sample:

https://www.dropbox.com/scl/fi/kotmse874x4rsi86kr8f8/voice3.mp3?rlkey=l5m56g5axort1ru70goo3rvch&dl=1

I couldn't install this library locally yet due to some dependency errors so I used the huggingface version (time res = 1.6) and got this:

0.0s-6.9s: pretty much everything you could want that occur around the normal vector not
6.9s-13.3s: along it. Keenan Crane is one of the leading
13.3s-17.2s: researchers in computational geometry.

So the first thing that popped up is I said Keenan 3 times which were retakes so they normally shouldn't exist except the last one. You can see this in the audio. Is this library also doing de-duplication of words?

For tags I got these:
0.0s-1.6s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking
1.6s-3.2s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking
3.2s-4.8s: Speech, Inside, small room, Clicking, Speech synthesizer, Narration, monologue
4.8s-6.4s: Speech, Narration, monologue, Speech synthesizer, Male speech, man speaking
6.4s-8.0s: Speech, Narration, monologue, Clicking, Speech synthesizer, Inside, small room
8.0s-9.6s: Speech, Clicking, Inside, small room
9.6s-11.2s: Speech, Clicking, Inside, small room, Narration, monologue, Male speech, man speaking
11.2s-12.8s: Speech, Speech synthesizer
12.8s-14.4s: Sine wave
14.4s-16.0s: Sine wave, Hum, Chime, White noise, Boiling

How can I use these tags to only let speech to exist? I already wrote the code that mutes any parts between words that uses timestamps. I tried whisper but it still kept coughing, throat clearing parts.

I tried whisperHallu but that also had some issues cropping some words halfway.

All I need is to keep only the speech parts. After this I will have to figure out a way to remove retakes which sometimes it's one word but sometimes it's half a sentence repeated multiple times but it's always the last one that would be kept.

Any ideas?

@dgoryeo
Copy link

dgoryeo commented Aug 10, 2024

Hi @orionflame , did you by anychance found a solution to your question?

@orionflame
Copy link
Author

Hi @orionflame , did you by anychance found a solution to your question?

Unfortunately no. You have any leads.

@dgoryeo
Copy link

dgoryeo commented Aug 11, 2024

I was wondering if one can distill (527-class AudioSet labels) to much smaller list of events, say less than 10 to be used for this method:

audio_tag_result = whisper.parse_at_label(result, language='follow_asr', top_k=5, p_threshold=-1, include_class_list=list(range(527)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants