Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Llama3 inference on API server #377

Open
nwatab opened this issue Jan 21, 2025 · 0 comments
Open

Run Llama3 inference on API server #377

nwatab opened this issue Jan 21, 2025 · 0 comments

Comments

@nwatab
Copy link

nwatab commented Jan 21, 2025

Does anyone know how to run HTTP server that runs Llama inference on it? I searched ending up find no helpful resource about integration with application/WSGI server (ex. Flask, gunicorn). The Llama3 tutorial uses torchrun, but what it does under the hood seems a bit complicated.

Edited:
LLM server scale is limited to the number of GPUs, so we might not need WSGI server in most cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant