-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2
Comments
Examples from the industry: Huggingface TGI uses k6 for benchmarking results |
Like the idea of using a well-tested loadgen. But we need to make sure that the core benchmarking library is python based and can be used as such if needed. I'm not sure if we can instrument k6 loadgen via python. But I would be interested in learning more and discussing the options we have. |
Yes, with this proposal the benchmarking library can be Python-based. There are many reasons to prefer Python for this project data manipulation, tokenization, reporting, etc. and k6 can merely bring an underlying set of utilities aimed at simply load design and request processing. Such a model would help us leverage the best of both worlds. In many load generation cases, a single node cannot process/maintain production-grade loads, especially long-context loads with LLMs, and in such cases distributed testing becomes a necessity. Further, we have seen from the initial project proposal too that distributed testing on Kubernetes would be a key differentiating factor. Many existing LLM perf tools lack in this specific area. A huge benefit of using k6 here would be the distributed testing that we get out of the box with minimal lift. There are also additional extensions to script in "python" if needed. But the key is to leverage the right set of tools. |
Created an example PR #8 that showcases how k6 can be leveraged for load generation and runner capabilities while using python. This PR is also aimed at showcasing the ability to easily configure benchmarking setup configurations like http vs grpc, local run vs distributed run. |
I reviewed the proposal and the k6 tool, and found that python support in k6 can be implemented using the xk6-python extension, which offers python like syntax and integrates seamlessly with k6. However, it is important to note that using Starlark, a Python dialect rather than full python, deviates from our goal of delivering a proper python library for benchmarking. The lack of a standard python module system and the inability to use pip-installable packages may limit flexibility and scalability for handling more complex benchmarking requirements. |
Please refer to the linked PR for the implementation thought process here. This pull request can show how the best of python based packages can be utilised while we build on top of ground work laid down by k6. This proposal is not for utilizing the xk6-python extension, that could be optional not mandatory. Performance testing is not a new problem and we don't need to reinvent the wheel here. |
Of note about the xk6-python package is it does not seem to be actively maintained. From the top of the repo:
|
While these are important considerations but I certainly agree that it is not the only way to build a request processor and the proposal is for one of the easier ways utilizing an already mature technology at this stage. There may be other requirements and certainly ways to extend this differently. My proposal #2 is to utilize the best of tools for the load generation and request processing parts. Now with that in mind, implementation design wise, there could be a base class for request processing and we could extend to build various request processors. k6 (distributed) could be one implementation, a python-based processor (possibly locust) could be another type and inference-perf project could choose the request processor as a configuration choice at runtime. Such an extensible design will surely bring more collaborators together to build this tool in ways that suits all requirements. (cc @achandrasekar, @terrytangyuan ) |
Thanks for sending a PR showcasing how this would work! I think the concerns there are valid as are the advantages of using a distributed load generator like k6. Two of the main goals we started the tool with:
We also have a non-goal of not building a generic benchmark-as-tool using web benchmarks like k6 or Locust. So, just using k6 or Locust as a black box tool to send requests is not a goal we are looking to solve. So, with those in mind I'd say we keep the core implementation in python with a loadgen which can send a specific QPS in a Poisson distribution. But we can have additional extensions like k6 or Locust which can help orchestrate and supplement specific large scale / distributed testing which calls into some / all of the python library as needed. It would be good to explore this model on how that would look. I think that aligns with the extensible design you have mentioned. Let's try and figure out the details there. |
Inference-perf proposal doc describes many vital components for its functioning. This document recommends building some of this capability on top of an existing mature load gen tooling in k6. Given the current requirements and constraints, a k6 based wrapper design can be hugely beneficial to quickly build and provide the following capabilities from the initial proposal.
Load Generator
Load Generator is the component which generates different traffic patterns based on user input. K6 can generate fixed or custom load pattern for a defined duration as deemed necessary for the requirement.
Request Processor
Request Processor provides a way to support different model servers and their corresponding request payload with different configurable parameters. K6 supports http and grpc based request for direct and distributed testing.
Response Processor / Data Collector
Response Processor / Data Collector component allows us to process the response and measure the actual performance of the model server in terms of request latency, TPOT, TTFT and throughput. K6 scripting can be leveraged for advanced data/metrics computation.
Report Generator / Metrics Exporter
Report Generator / Metrics Exporter generates a report based on the data collected during benchmarking. It can also export the different metrics that we collected during benchmarking as metrics into Prometheus which can then be consumed by other monitoring or visualization solutions. k6 supports real-time metrics streaming to services like Prometheus, New Relic etc.
Key Benefits
Key advantages of building on top of k6
The text was updated successfully, but these errors were encountered: