Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify ffdec.py by using Popen.communicate() method #113

Open
Bomme opened this issue Nov 30, 2021 · 4 comments
Open

Simplify ffdec.py by using Popen.communicate() method #113

Bomme opened this issue Nov 30, 2021 · 4 comments

Comments

@Bomme
Copy link
Contributor

Bomme commented Nov 30, 2021

Hi, thanks for this very useful library!
I was looking into ffdec.py since I need faster loading of mp3 and m4a files.
I believe that the module could be improved and simplified by using the Popen.communicate() method. This seems to be the recommended way of retrieving output from a subprocess.

The current implementation only allows to read the data in blocks which is suboptimal since a user might not be able to adapt the block size. (E.g. librosa just calls audio_open() which has no way of setting a block size.)

I did a speed comparison that shows that this way of reading data is slower than it needs to be, especially for large files:
https://gist.github.com/Bomme/d9aee452c8c1e68fb5fac743df6b2a07

If you decide to drop Python 2 support (#112) the timeout handling might be easier.
And for later versions of Python 3 the https://docs.python.org/3/library/subprocess.html#windows-popen-helpers might come in handy.

@sampsyo
Copy link
Member

sampsyo commented Nov 30, 2021

Hi! It could be worth looking into! But there are a few reasons why ffdec works this way now:

  • Streaming. The interface lets clients process the audio while reading it in, rather than loading the entire file into memory first. This is important for handling large files without allocating too much memory.
  • As a kind of corollary, we need to read simultaneously from the stdout and stderr streams. Otherwise, the process could fill up its stderr buffer while we're waiting for data from the stdout buffer, and the interaction would deadlock.

I believe Popen.communicate() only works in a synchronous/blocking style that reads everything until EOF. So I'm not sure we can use that, but maybe there's something else useful in modern subprocess we could rely on?

@Bomme
Copy link
Contributor Author

Bomme commented Dec 1, 2021

Hi! Thanks for the explanation.
Since the queue is infinite, the entire file could get read in the separate thread before the client code gets to call read_data. That's why I assumed it would be a simplification to make this explicit, i.e. loading the file into memory.
Shouldn't the queue size be limited to support real streaming?

@Bomme
Copy link
Contributor Author

Bomme commented Dec 1, 2021

Since my main goal was to speed up the file loading and you plan to release a new version soon, what do you think about raising the block_size? Maybe by setting it to io.DEFAULT_BUFFER_SIZE as a (still) conservative size.

@sampsyo
Copy link
Member

sampsyo commented Dec 1, 2021

Ah, yeah, that's a good point. We probably should be limiting the queue size! It seems tricky to get exactly right, but probably worth a shot nonetheless.

Raising the block size seems like a great idea! Would you mind putting together a small PR for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants