Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request + my code: audio cleanup #108

Open
thiswillbeyourgithub opened this issue Oct 7, 2024 · 7 comments
Open

Feature Request + my code: audio cleanup #108

thiswillbeyourgithub opened this issue Oct 7, 2024 · 7 comments

Comments

@thiswillbeyourgithub
Copy link
Contributor

Hi,

I'm a happy user of faster-whisper-server. I mainly use it as a whisper backend for open-web-ui and recently opened an issue to share my code for high quality audio cleaning that remove silences INSIDE the audio (whereas vad only handles the start of the audio). As I ran into issues several time with faster-whisper-server where a sentence would get repeated ad nauseam because I stopped talking for more than 30s in a recording, I thought about requesting this feature here and sending my code to deduplicate efforts.

Here's the link to the issue: open-webui/open-webui#5972

Here's the content :

Not sure opening a Feature Request is okay with you for that but I don't have the time to do a PR and saw you sort of struggled with the audio cleaning :)

( Edit: forgot to mention reasons why I think should run in all situation and not just to reduce file size:

  1. Reduced costs
  2. Whisper works by 30s sections. If you hesitate on something during the recording, pause to think, and end up not speaking for more than 30s whisper enters a loop where it justs repeats strings even if you start talking again later. For some use case of openwebui this happens a lot (I use it a lot to help me reason through medical school)
  3. Also, given the feature to send audio files directly, having a cleanup function already enabled would help a lot especially for things like conferences where we can have some silences.

)

In a private repo I have a piece of code that might be useful to openwebui. It uses torchaudio to apply sox commands on audio and cleans the audio so that any silence longer than X amount of time would get squished.

That sounds trivial but it's not: pydub's implementation is not scalable (it has lots of performance issues ESPECIALLY on the silence related code that can take all the cpu, throw OOM and appear to hang even though it's just super slow) so it's not suitable for open-webui in my opinion.

Sox has none of those issues but is soo complex and unintuitive that it took me a while to get the parameters working but now (and for over a year) it's perfect (can clean hours of audio in seconds). But there are some kinks about file format that may require using soundfile too as a dep. Short story is that torchaudio does not support all the same format as pydub, soundfile, etc. On my linux installing sox inside the docker was as simple as apt update && apt install sox.

Anyway here's the code if you're interested. If you do, I would appreciate being credited with my github username in the commit author :)

The gist is this:

        # sox effect when loading a sound                                          
        preprocess_sox_effects: List[str] = [                                      
                # normalize audio                                                  
                ["norm"],                                                          
                                                                                   
                # isolate voice frequency                                          
                # -2 is for a steeper filtering                                    
                # ["highpass", "-1", "100"],                                       
                # ["lowpass", "-1", "3000"],                                       
                # # removes high frequency and very low ones                       
                ["highpass", "-2", "50"],                                          
                ["lowpass", "-2", "5000"],                                         
                                                                                   
                # max silence should be 1s                                         
                ["silence", "-l", "1", "0", "0.5%", "-1", "1.0", "0.5%"],          
                                                                                   
                # # remove leading silence                                         
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # # and ending silence, this might be unecessary for splitted audio
                # ["reverse"],                                                     
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # ["reverse"],                                                     
                                                                                   
                # add blank sound to help whisper                                  
                ["pad", "0.2@0"],                                                  
                ] 

       # But in some situations it was not enough so I sometimes used an extra processing after the first. Maybe use it as a last resort if the audio is still too long.
                                                                                   
        # sox effect when forcing the processing of a sound                        
        force_preprocess_sox_effects: List[str] = [                                
                # normalize audio                                                  
                ["norm"],                                                          
                                                                                   
                # filter for voice                                                 
                ["highpass", "-2", "50"],                                          
                ["lowpass", "-2", "5000"],                                         
                                                                                   
                # max silence should be 1s                                         
                ["silence", "-l", "1", "0", "2%", "-1", "1.0", "2%"],              
                                                                                   
                # # remove leading silence                                         
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # # and ending silence, this might be unecessary for splitted audio
                # ["reverse"],                                                     
                # ["vad", "-p", "0.2", "-t", "5"],                                 
                # ["reverse"],                                                     
                                                                                   
                # add blank sound to help whisper                                  
                ["pad", "0.2@0"],                                                  
                ]   

Notes: I'm not sure the padding is all that useful and technically can waste money. Also the vad seemed redundant with the silence commands. Not sure if normalizing is useful. The core of my contribution is those damned silence arguments, modify them at your own risks.

And the gist of the actual processing is this:

        # load from file                                                                                                                                                                                        
        waveform, sample_rate = torchaudio.load(audio_mp3_path)                                                                                                                                                 
                                                                                                                                                                                                                
        waveform, sample_rate = torchaudio.sox_effects.apply_effects_tensor(                                                                                                                                    
                waveform,                                                                                                                                                                                       
                sample_rate,                                                                                                                                                                                    
                shared.preprocess_sox_effects, /shared.pre        [1/1]                                                                                                                                         
                )                                                                                                                                                                                               
                                                                                                                                                                                                                
        # write to file as wav                                                                                                                                                                                  
        sf.write(str(audio_mp3_path), waveform.numpy().T, sample_rate, format='wav')                                                                                                                            
        temp = AudioSegment.from_wav(audio_mp3_path)                                                                                                                                                            
        new_path = Path(audio_mp3_path).parent / (Path(audio_mp3_path).stem + "_proc" + Path(audio_mp3_path).suffix)                                                                                            
        temp.export(new_path, format="mp3") 

I needed those conversions for my code (I'm using gradio) but not sure it's needed here so you might not have to include soundfile.

@fedirz
Copy link
Owner

fedirz commented Oct 9, 2024

Have you tried using the vad_filter instead?

@thiswillbeyourgithub
Copy link
Contributor Author

thiswillbeyourgithub commented Oct 9, 2024

AFAIK vad_filter is "voice activity detection" filter, so it's job is to crop the only the start of the audio, not silence in the middle or the end. Torchaudio even recommends using VAD then reverse + VAD + reverse to crop both ends. But it's not able to remove silence at other places.

@cyberluke
Copy link

cyberluke commented Oct 12, 2024

I think everyone who's using generic or too simple implementation of VAD is wrong because it detects energy (based on amplitude): "Energy in audio processing is typically calculated as the square of the amplitude values."

Here is a better implementation for Whisper: https://github.com/wavey-ai/mel-spec?tab=readme-ov-file

Basically you want to implement VAD based on frequency spectrum and there you can recognize speech much better than just noise vs quiet!

Also this implementation can detect quiet between audio parts and supports streaming for whisper. Not just start or end.

@thiswillbeyourgithub
Copy link
Contributor Author

I think everyone who's using generic or too simple implementation of VAD is wrong because it detects energy (based on amplitude): "Energy in audio processing is typically calculated as the square of the amplitude values."

Here is a better implementation for Whisper: https://github.com/wavey-ai/mel-spec?tab=readme-ov-file

Basically you want to implement VAD based on frequency spectrum and there you can recognize speech much better than just noise vs quiet!

Also this implementation can detect quiet between audio parts and supports streaming for whisper. Not just start or end.

This seems to require significantly more work to setup than just using sox via torch, which maybe is even already a dep. But i'm not the owner and maybe his skills make it easy enough. In any case I personaly have never noticed the lack of precision of the "naive" amplitude based approach.

@cyberluke
Copy link

cyberluke commented Oct 12, 2024

@thiswillbeyourgithub What? Please, I did not want to offend anyone.

  1. Feature extraction and Complexity analysis
    The Mel-Spec is super simple, it is just one class here with 4 methods and that includes also a constructor: VoiceActivityDetector here: https://github.com/wavey-ai/mel-spec/blob/main/src/vad.rs

STFT, Mel-Spec, Quantization - that all is already included in PyTorch (torch, torchaudio)

Everything this is already included in faster-whisper, which is a dependency. You can check it here: https://github.com/SYSTRAN/faster-whisper/blob/master/requirements.txt

This is how to do it in PyTorch:

# Compute STFT
stft = torch.stft(
    waveform,
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=win_length,
    return_complex=True
)
# Convert to magnitude spectrogram
spectrogram = torch.abs(stft)

# Convert to Mel-scale
n_mels = 128
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=n_fft,
    hop_length=hop_length,
    n_mels=n_mels
)(waveform)

They have example of real time streaming from microphone and they perform it even in Javascript in the browser: https://github.com/wavey-ai/mel-spec/blob/main/examples/browser/worker.js ...but they say they test it on Macbook M2. I'm not sure if mobile devices would handle that.

  1. Your solution is not that bad, but I happen to come here because VAD is not working for me. You also mention that when there is more audio coming without silence it will overload the processing. Your solution might fix the problem with repeating of translation, but I think it is only partial solution.

Why you apply low pass and high pass filter from 50 to 5000? That is not frequency range of human voice:

A Female voice frequency range covers fairly upto 350 Hz to 17KHz. Its fundamental frequency is 350Hz to 3KHz and Harmonics is 3KHz to 17KHz. Male voice covers a Frequency range of 100Hz to 8KHz. The fundamental is 100Hz to 900Hz and Harmonics is 900Hz to 8KHz.

Most low quality speech compression codecs work around 8KHz. But whisper has been trained on 16KHz audio, so I think we should not lower the input quality on purpose because it can loose precision for the prediction. But I did not test it, it is just my observation.

Anyway I'm just talking from my experience as developer with machine learning and audio processing skills. I will fork it and make the best performance and quality for myself. You guys can decide how you find it feasible.

@cyberluke
Copy link

Also if you want to use VAD, you should apply this commit from faster-whisper that fixes a few things and has not been merged to main branch: SYSTRAN/faster-whisper@2f6790a

@thiswillbeyourgithub
Copy link
Contributor Author

Hi. I'm really sorry it appears I answered without proper knowledge and wasted your time.

@thiswillbeyourgithub What? Please, I did not want to offend anyone.

  1. Feature extraction and Complexity analysis
    The Mel-Spec is super simple, it is just one class here with 4 methods and that includes also a constructor: VoiceActivityDetector here: https://github.com/wavey-ai/mel-spec/blob/main/src/vad.rs

STFT, Mel-Spec, Quantization - that all is already included in PyTorch (torch, torchaudio)

Everything this is already included in faster-whisper, which is a dependency. You can check it here: https://github.com/SYSTRAN/faster-whisper/blob/master/requirements.txt

This is how to do it in PyTorch:

# Compute STFT
stft = torch.stft(
    waveform,
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=win_length,
    return_complex=True
)
# Convert to magnitude spectrogram
spectrogram = torch.abs(stft)

# Convert to Mel-scale
n_mels = 128
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=n_fft,
    hop_length=hop_length,
    n_mels=n_mels
)(waveform)

They have example of real time streaming from microphone and they perform it even in Javascript in the browser: https://github.com/wavey-ai/mel-spec/blob/main/examples/browser/worker.js ...but they say they test it on Macbook M2. I'm not sure if mobile devices would handle that.

I only know python and 98% of this repo is in python so at first glance it appeared that adding this other code seemed like a more significant endeavor than a couple of lines of python. But again I'm not the owner so please don't assume they're as incompetent as I am!

  1. Your solution is not that bad, but I happen to come here because VAD is not working for me. You also mention that when there is more audio coming without silence it will overload the processing. Your solution might fix the problem with repeating of translation, but I think it is only partial solution.

Why you apply low pass and high pass filter from 50 to 5000? That is not frequency range of human voice:

A Female voice frequency range covers fairly upto 350 Hz to 17KHz. Its fundamental frequency is 350Hz to 3KHz and Harmonics is 3KHz to 17KHz. Male voice covers a Frequency range of 100Hz to 8KHz. The fundamental is 100Hz to 900Hz and Harmonics is 900Hz to 8KHz.

Whoops indeed it's totally a mistake on my part. I don't recally exactly where I had those figure. But IIRC it was also because I prefered to deteriorate the quality of the recording but be sure to remove other noises that are also in that range.

Most low quality speech compression codecs work around 8KHz. But whisper has been trained on 16KHz audio, so I think we should not lower the input quality on purpose because it can loose precision for the prediction. But I did not test it, it is just my observation.

Anyway I'm just talking from my experience as developer with machine learning and audio processing skills. I will fork it and make the best performance and quality for myself. You guys can decide how you find it feasible.

Thank you for taking the time to explain me those aspects. Also I was just passing by and am just a user of this repo. Not affiliated with the owner who might completely disagree with my thoughts on this! And thanks for linking the other PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants