You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After creating a docker container following the tutorial video and readme, I tried Live Transcription of microphone input using ffmpeg, but it didn't work properly.
After checking the docker container logs and running the faster-whisper-server source locally with modifications, I confirmed that a TimeoutError exception was occurring in the audio_receiver of stt.py in faster-whisper-server.
Additionally, it seems that the websocket connection is forcibly disconnected in some lines of the stt.py code, including the execution of the logger.info(f"Not enough speech in the last {config.inactivity_window_seconds} seconds.") code.
Recognition of local wav files, which is not Live Transcription using websocket, worked normally, and sending wav files (not microphone input) to "/v1/audio/translations" via websocket also worked normally.
I'm using Windows 10 Pro, so I ran faster-whisper-server with WSL + Docker Desktop and sent the microphone input stream to /v1/audio/translations using ffmpeg for Windows.
Here's how to reproduce:
Download ffmpeg and websocat for Windows.
Copy the websocat executable to the ffmpeg\bin folder, renaming it to websocat.exe.
Run Live Transcription with the following command:
I set up a local python development environment for faster-whisper-server and then ran it with VS Code after changing the following setting values in config.py, which made Live Transcription work for a while:
The min_duration setting was the most important, and without changing it, live transcription always failed.
After various tests, I came to the following conclusions:
Transcription of audio files works normally in any method.
Live Transcription of microphone input doesn't work properly because the websocket connection is forcibly disconnected on the server side during the data validity processing stage of the data received by the WebSocket.
Simply increasing some setting values doesn't make Live Transcription work properly.
I tested with both cpu and gpu, but the issue was same.
I think that running Live Transcription on multi-core CPUs may be possible because the faster whisper-large-v3-turbo model is available.
Here are some of my opinions:
Setting min_duration to about 5 seconds seems to improve performance and accuracy of recognized audio. I think accumulating and recognizing audio buffers in 1-second units is too short.
For non-English languages like Korean, Japanese, and Chinese, recognizing too short audio buffers causes the accumulated recognition result text to keep changing, which is not good.
You can create the effect of a native speaker of that country's language by running YouTube on a smartphone, playing any Korean or Japanese news video, and placing the smartphone next to the microphone.
When I got Live Transcription to work for a while by changing the settings, I could see that the initially recognized text from the audio input kept accumulating for a long time (which could also be seen in your demo mp4). I suspect that if you do Live Transcription of long microphone input for more than 30 minutes, the returned recognition result will become extremely long.
It would be better if the recognition results were returned separated by sentences as much as possible, such as the transcription results of audio files.
Best regards.
The text was updated successfully, but these errors were encountered:
Hello.
After creating a docker container following the tutorial video and readme, I tried Live Transcription of microphone input using ffmpeg, but it didn't work properly.
After checking the docker container logs and running the faster-whisper-server source locally with modifications, I confirmed that a TimeoutError exception was occurring in the audio_receiver of stt.py in faster-whisper-server.
Additionally, it seems that the websocket connection is forcibly disconnected in some lines of the stt.py code, including the execution of the logger.info(f"Not enough speech in the last {config.inactivity_window_seconds} seconds.") code.
Recognition of local wav files, which is not Live Transcription using websocket, worked normally, and sending wav files (not microphone input) to "/v1/audio/translations" via websocket also worked normally.
I'm using Windows 10 Pro, so I ran faster-whisper-server with WSL + Docker Desktop and sent the microphone input stream to /v1/audio/translations using ffmpeg for Windows.
Here's how to reproduce:
The client side will display the following error log:
The faster-whisper-server side logs the following:
The following command works normally, and it means that my computer's settings and faster-whisper-server settings are correct:
I set up a local python development environment for faster-whisper-server and then ran it with VS Code after changing the following setting values in config.py, which made Live Transcription work for a while:
The min_duration setting was the most important, and without changing it, live transcription always failed.
After various tests, I came to the following conclusions:
Transcription of audio files works normally in any method.
Live Transcription of microphone input doesn't work properly because the websocket connection is forcibly disconnected on the server side during the data validity processing stage of the data received by the WebSocket.
Simply increasing some setting values doesn't make Live Transcription work properly.
I tested with both cpu and gpu, but the issue was same.
I think that running Live Transcription on multi-core CPUs may be possible because the faster whisper-large-v3-turbo model is available.
Here are some of my opinions:
Setting min_duration to about 5 seconds seems to improve performance and accuracy of recognized audio. I think accumulating and recognizing audio buffers in 1-second units is too short.
For non-English languages like Korean, Japanese, and Chinese, recognizing too short audio buffers causes the accumulated recognition result text to keep changing, which is not good.
You can create the effect of a native speaker of that country's language by running YouTube on a smartphone, playing any Korean or Japanese news video, and placing the smartphone next to the microphone.
When I got Live Transcription to work for a while by changing the settings, I could see that the initially recognized text from the audio input kept accumulating for a long time (which could also be seen in your demo mp4). I suspect that if you do Live Transcription of long microphone input for more than 30 minutes, the returned recognition result will become extremely long.
It would be better if the recognition results were returned separated by sentences as much as possible, such as the transcription results of audio files.
Best regards.
The text was updated successfully, but these errors were encountered: