-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On longs pages it seems to get stuck #53
Comments
This issue is definitely on my radar. At the moment, the Ollama embeddings API can only handle 1 request at a time. It's a known issue and I believe it's being prioritized (see ollama/ollama#358). Some kind of progress indicator would be nice. Material UI has a Linear determinate progress bar component: https://mui.com/material-ui/react-progress/#linear-determinate. This should be pretty easy to implement (i.e. message passing between background script and main extension). Separately, I'm working on a small refactor to make things a little bit faster: #52 |
I wonder if one could make a little node.js prototype server for this that simply wraps/proxies to multiple llama.cpp instances using mmap mode. Would be fun to see it purring and get an inkling of future performance. |
Looks like someone made an Ollama proxy server: I've never tinkered much with the ollama memory settings ( so job well done Ollama team) |
Seems it can use mmap. At least for the /api/generate there is a |
Don't know if there's "anything on the table" regardless |
Apparently the embeddings don't use the entire weights, so maybe there's a way. I'm very fuzzy on how those are created. I patched the proxy server to allow CORS, but I am not having much luck, and suddenly it's giving me crazy Yoda responses :) Maybe there's a way to start a pool of ollama instances used just for the embedding, which don't ever load the full model weights and allocate too many buffers (ofc mmap doesn't solve everything) |
I think fromDocuments/OllamaEmbeddings runs serially anyway so may need some wrapper beyond the server proxy. |
Hacked the ollama langchain code to do things in parallel and sadly not really getting any speed up. Maybe due to the threads settings for the servers. Maybe it's already pretty efficient in using available cores? I was getting around 250ms per embed, which went up to around 1.5s per embed. Maybe there is a tiny speedup, shrug, but seemingly not a low hanging fruit substantial one. |
Probably worth investigating yourself |
@sublimator, thanks for testing out all of that stuff! For reference, how large is the content in your testing (i.e. how much text is on the page)? And about how many embeds before 250ms goes to 1.5s? In my testing, I didn't observe an increase in latency; it seems to be constant throughout the entire sequence of embeds. If there's a comparable Wikipedia article, we can both work off of that for testing. |
I was kind of using random pages
Did you use parallel processing though? By default the OllamaEmbedding class does the requests sequentially. Let me dig it back up. |
I hacked the OllamaEmbeddings class (just the compiled code in node_modules): async _embed(strings) {
console.log('hack is working!!')
const embeddings = [];
for await (const prompt of strings) {
const embedding = this.caller.call(() => this._request(prompt));
embeddings.push(embedding);
}
return await Promise.all(embeddings)
}
async embedDocuments(documents) {
return this._embed(documents);
} |
Maybe we should start a branch if want to look seriously, but anyway, here's some more of the artifacts of my investigations before: The ini file I used
proxy server hacks (I might not have been using the right python version, but all I wanted to change was the CORS stuff)
#!/bin/bash
# Start the Default Server
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11442 ollama serve &
# Start Secondary Servers
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11435 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11436 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11437 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11438 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11439 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11440 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11441 ollama serve &
ollama_proxy_server --config config.ini --port 11434 -d |
Enough that there was a lot of embeddings requests anyway This was one of the pages: But I suspect it's grown since I was testing |
I would try hacking a separate pool of servers just for the embeddings, with the proxy running on a non-default port. I'm still not sure WHEN the full model weights are loaded, but from the logging it could very well be lazily when the generate api call is hit |
I feel like this change (or something similar) will be accepted in LangChainJS. See: |
In any case, it didn't seem to help much in the big picture. Maybe you can tweak the threading settings for each ollama instance or something |
I mean for cloud, it definitely would help. |
Have been pondering IPFS global shared embeddings cache It seems so wasteful for everyone to be doing this separately |
I mean that might actually work somehow - cause the shared embeddings would just share public data. And then you just need your little local LLM for processing the private queries. I guess it would get complicated by needing to trust that content -> embeddings for a given model, but I suppose they could be signed. Shrug. Want FAST responses, and to waste less energy $SOMEHOW |
Assume you saw this: |
The latest version of Ollama seems to be much more performant when loading/unlocking models. Having separate inference and embedding models will feel smoother. I'll close this issue after a couple more updates are implemented (e.g. batch embedding creation). |
So much nicer after Nomic anyway :)
|
I'm not sure if this is the same issue, but I've just tried it for the first time and just getting the "generating embeddings..." text seemingly indefinitely. This does progress, but it might as well be indefinite. Edit: I've tried a few small models and various pages, no impact. I'm running a 3090 and don't have issues elsewhere. |
On long pages it seems to halt (e.g. https://news.ycombinator.com/item?id=39190468)
Maybe fixed in new versions
Might be nice to have some indication of the amount of work it's doing
Progress bar or something
I mean you know how many chunks it needs to embed, right?
I don't know the feasibility, but wondering if you can do embedding in parallel somehow?
I suppose with mmap'd model with model shared by multiple processes it could be?
But that's more of an ollama question perhaps?
Thanks
The text was updated successfully, but these errors were encountered: