Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live Transcription using websocket doest not work. #111

Open
ivoryguard opened this issue Oct 10, 2024 · 3 comments
Open

Live Transcription using websocket doest not work. #111

ivoryguard opened this issue Oct 10, 2024 · 3 comments

Comments

@ivoryguard
Copy link

ivoryguard commented Oct 10, 2024

Hello.

After creating a docker container following the tutorial video and readme, I tried Live Transcription of microphone input using ffmpeg, but it didn't work properly.

After checking the docker container logs and running the faster-whisper-server source locally with modifications, I confirmed that a TimeoutError exception was occurring in the audio_receiver of stt.py in faster-whisper-server.

Additionally, it seems that the websocket connection is forcibly disconnected in some lines of the stt.py code, including the execution of the logger.info(f"Not enough speech in the last {config.inactivity_window_seconds} seconds.") code.

Recognition of local wav files, which is not Live Transcription using websocket, worked normally, and sending wav files (not microphone input) to "/v1/audio/translations" via websocket also worked normally.

I'm using Windows 10 Pro, so I ran faster-whisper-server with WSL + Docker Desktop and sent the microphone input stream to /v1/audio/translations using ffmpeg for Windows.


Here's how to reproduce:

  1. Download ffmpeg and websocat for Windows.
  2. Copy the websocat executable to the ffmpeg\bin folder, renaming it to websocat.exe.
  3. Run Live Transcription with the following command:
ffmpeg.exe -f dshow -i audio="Your Mic Input Device Name" -ac 1 -ar 16k -acodec pcm_s16le -f wav - | websocat --binary ws://localhost:8000/v1/audio/transcriptions

The client side will display the following error log:

websocat: WebSocketError: I/O failureate= 255.4kbits/s speed=1.05x
websocat: error running
av_interleaved_write_frame(): Invalid argument5kbits/s speed=1.04x
Error writing trailer of pipe:: Invalid argument

The faster-whisper-server side logs the following:

2024-10-10 20:51:32 faster-whisper-server-cuda-1  | INFO:     ('172.18.0.1', 53050) - "WebSocket /v1/audio/transcriptions?response_format=json" [accepted]
2024-10-10 20:51:32 faster-whisper-server-cuda-1  | INFO:     connection open
2024-10-10 20:51:33 faster-whisper-server-cuda-1  | 2024-10-10 11:51:33,684:INFO:faster_whisper_server.routers.stt:audio_receiver:No data received in 1.0 seconds. Closing the connection.
2024-10-10 20:51:33 faster-whisper-server-cuda-1  | 2024-10-10 11:51:33,684:INFO:faster_whisper_server.audio:close:AudioStream closed
2024-10-10 20:51:33 faster-whisper-server-cuda-1  | 2024-10-10 11:51:33,685:INFO:faster_whisper_server.transcriber:audio_transcriber:Audio transcriber finished
2024-10-10 20:51:33 faster-whisper-server-cuda-1  | 2024-10-10 11:51:33,685:INFO:faster_whisper_server.model_manager:_decrement_ref:Model Systran/faster-whisper-large-v3 is idle, scheduling offload in 300s
2024-10-10 20:51:33 faster-whisper-server-cuda-1  | 2024-10-10 11:51:33,685:INFO:faster_whisper_server.routers.stt:transcribe_stream:Closing the connection.

The following command works normally, and it means that my computer's settings and faster-whisper-server settings are correct:

ffmpeg.exe -i audio.wav -ac 1 -ar 16k -acodec pcm_s16le -f wav - | websocat --binary ws://localhost:8000/v1/audio/transcriptions

I set up a local python development environment for faster-whisper-server and then ran it with VS Code after changing the following setting values in config.py, which made Live Transcription work for a while:

max_no_data_seconds: float = 10.0
min_duration: float = 5.0
max_inactivity_seconds: float = 5
inactivity_window_seconds: float = 6.0

The min_duration setting was the most important, and without changing it, live transcription always failed.


After various tests, I came to the following conclusions:

  1. Transcription of audio files works normally in any method.

  2. Live Transcription of microphone input doesn't work properly because the websocket connection is forcibly disconnected on the server side during the data validity processing stage of the data received by the WebSocket.

  3. Simply increasing some setting values doesn't make Live Transcription work properly.

  4. I tested with both cpu and gpu, but the issue was same.

  5. I think that running Live Transcription on multi-core CPUs may be possible because the faster whisper-large-v3-turbo model is available.


Here are some of my opinions:

  1. Setting min_duration to about 5 seconds seems to improve performance and accuracy of recognized audio. I think accumulating and recognizing audio buffers in 1-second units is too short.

  2. For non-English languages like Korean, Japanese, and Chinese, recognizing too short audio buffers causes the accumulated recognition result text to keep changing, which is not good.

  3. You can create the effect of a native speaker of that country's language by running YouTube on a smartphone, playing any Korean or Japanese news video, and placing the smartphone next to the microphone.

  4. When I got Live Transcription to work for a while by changing the settings, I could see that the initially recognized text from the audio input kept accumulating for a long time (which could also be seen in your demo mp4). I suspect that if you do Live Transcription of long microphone input for more than 30 minutes, the returned recognition result will become extremely long.

  5. It would be better if the recognition results were returned separated by sentences as much as possible, such as the transcription results of audio files.

Best regards.

@soloHeroo
Copy link

soloHeroo commented Nov 2, 2024

这个问题还存在

@RAPHCVR
Copy link

RAPHCVR commented Nov 10, 2024

Indeed

@gdomod
Copy link

gdomod commented Nov 14, 2024

i would also test it with 8 hours live stream, the response is after few minutes too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants