You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does anyone know how to run HTTP server that runs Llama inference on it? I searched ending up find no helpful resource about integration with application/WSGI server (ex. Flask, gunicorn). The Llama3 tutorial uses torchrun, but what it does under the hood seems a bit complicated.
Edited:
LLM server scale is limited to the number of GPUs, so we might not need WSGI server in most cases
The text was updated successfully, but these errors were encountered:
Does anyone know how to run HTTP server that runs Llama inference on it? I searched ending up find no helpful resource about integration with application/WSGI server (ex. Flask, gunicorn). The Llama3 tutorial uses
torchrun
, but what it does under the hood seems a bit complicated.Edited:
LLM server scale is limited to the number of GPUs, so we might not need WSGI server in most cases
The text was updated successfully, but these errors were encountered: