Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2

SachinVarghese · 2025-01-20T15:56:29Z

Inference-perf proposal doc describes many vital components for its functioning. This document recommends building some of this capability on top of an existing mature load gen tooling in k6. Given the current requirements and constraints, a k6 based wrapper design can be hugely beneficial to quickly build and provide the following capabilities from the initial proposal.

Load Generator
Load Generator is the component which generates different traffic patterns based on user input. K6 can generate fixed or custom load pattern for a defined duration as deemed necessary for the requirement.

Request Processor
Request Processor provides a way to support different model servers and their corresponding request payload with different configurable parameters. K6 supports http and grpc based request for direct and distributed testing.

Response Processor / Data Collector
Response Processor / Data Collector component allows us to process the response and measure the actual performance of the model server in terms of request latency, TPOT, TTFT and throughput. K6 scripting can be leveraged for advanced data/metrics computation.

Report Generator / Metrics Exporter
Report Generator / Metrics Exporter generates a report based on the data collected during benchmarking. It can also export the different metrics that we collected during benchmarking as metrics into Prometheus which can then be consumed by other monitoring or visualization solutions. k6 supports real-time metrics streaming to services like Prometheus, New Relic etc.

Key Benefits

Key advantages of building on top of k6

Existing mature OSS ecosystem
Support for custom load generation patterns
Support for HTTP and GRPC request processing
Built-in kubernetes based distributed testing and associated k8s operator
Real-time metrics collection and export to a variety of data stores
Many built-in memory optimizations (like ability to discard response bodies)

SachinVarghese · 2025-01-22T21:46:37Z

Examples from the industry: Huggingface TGI uses k6 for benchmarking results

achandrasekar · 2025-01-23T04:11:03Z

Like the idea of using a well-tested loadgen. But we need to make sure that the core benchmarking library is python based and can be used as such if needed. I'm not sure if we can instrument k6 loadgen via python. But I would be interested in learning more and discussing the options we have.

SachinVarghese · 2025-01-23T22:47:02Z

Yes, with this proposal the benchmarking library can be Python-based. There are many reasons to prefer Python for this project data manipulation, tokenization, reporting, etc. and k6 can merely bring an underlying set of utilities aimed at simply load design and request processing. Such a model would help us leverage the best of both worlds.

In many load generation cases, a single node cannot process/maintain production-grade loads, especially long-context loads with LLMs, and in such cases distributed testing becomes a necessity. Further, we have seen from the initial project proposal too that distributed testing on Kubernetes would be a key differentiating factor. Many existing LLM perf tools lack in this specific area. A huge benefit of using k6 here would be the distributed testing that we get out of the box with minimal lift. There are also additional extensions to script in "python" if needed. But the key is to leverage the right set of tools.

SachinVarghese · 2025-01-25T18:18:54Z

Created an example PR #8 that showcases how k6 can be leveraged for load generation and runner capabilities while using python. This PR is also aimed at showcasing the ability to easily configure benchmarking setup configurations like http vs grpc, local run vs distributed run.

vivekk16 · 2025-01-27T16:34:48Z

I reviewed the proposal and the k6 tool, and found that python support in k6 can be implemented using the xk6-python extension, which offers python like syntax and integrates seamlessly with k6. However, it is important to note that using Starlark, a Python dialect rather than full python, deviates from our goal of delivering a proper python library for benchmarking. The lack of a standard python module system and the inability to use pip-installable packages may limit flexibility and scalability for handling more complex benchmarking requirements.

SachinVarghese · 2025-01-27T18:58:21Z

Please refer to the linked PR for the implementation thought process here. This pull request can show how the best of python based packages can be utilised while we build on top of ground work laid down by k6. This proposal is not for utilizing the xk6-python extension, that could be optional not mandatory. Performance testing is not a new problem and we don't need to reinvent the wheel here.

sjmonson · 2025-01-28T16:33:27Z

Of note about the xk6-python package is it does not seem to be actively maintained. From the top of the repo:

xk6-python is not an official k6 extension but was a Grafana Hackathon #10 project! Active development is not planned in the near future. The purpose of making it public is to assess demand (number of stars on the repo).

SachinVarghese · 2025-01-30T14:39:41Z

Repeating my comment at #8

The core idea for k6 as a request processor is based in its efficiency and distributed testing capability. With k6 scripts it is possible to create various templates more complex loads i.e various requests, token limits etc. And it is also possible to implement token based metrics capturing in such templates. k6 also comes with http and grpc request processing capability out of the box.

While these are important considerations but I certainly agree that it is not the only way to build a request processor and the proposal is for one of the easier ways utilizing an already mature technology at this stage. There may be other requirements and certainly ways to extend this differently. My proposal #2 is to utilize the best of tools for the load generation and request processing parts.

Now with that in mind, implementation design wise, there could be a base class for request processing and we could extend to build various request processors. k6 (distributed) could be one implementation, a python-based processor (possibly locust) could be another type and inference-perf project could choose the request processor as a configuration choice at runtime. Such an extensible design will surely bring more collaborators together to build this tool in ways that suits all requirements. (cc @achandrasekar, @terrytangyuan )

achandrasekar · 2025-01-30T18:07:53Z

Thanks for sending a PR showcasing how this would work! I think the concerns there are valid as are the advantages of using a distributed load generator like k6.

Two of the main goals we started the tool with:

Benchmark-as-code library in python that can be consumed in parts or full as needed in other benchmarking / serving frameworks - this we can technically make work with calling k6 from python, but like @sjmonson pointed out, we need to write the majority of the API calling, tokenization and other code in js which is not ideal / feasible in some cases. This also makes consuming individual pieces around model server APIs, processing streaming requests and tokenization not easier to consume elsewhere as a library.
Sending different traffic patterns and make it easier to test with autoscaling / load balancing - this k6 will help achieve, but there are some alternatives like Locust as well.

We also have a non-goal of not building a generic benchmark-as-tool using web benchmarks like k6 or Locust. So, just using k6 or Locust as a black box tool to send requests is not a goal we are looking to solve.

So, with those in mind I'd say we keep the core implementation in python with a loadgen which can send a specific QPS in a Poisson distribution. But we can have additional extensions like k6 or Locust which can help orchestrate and supplement specific large scale / distributed testing which calls into some / all of the python library as needed. It would be good to explore this model on how that would look. I think that aligns with the extensible design you have mentioned. Let's try and figure out the details there.

SachinVarghese mentioned this issue Jan 25, 2025

k6 based load generator and local runner setup #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2

Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2

SachinVarghese commented Jan 20, 2025

SachinVarghese commented Jan 22, 2025

achandrasekar commented Jan 23, 2025

SachinVarghese commented Jan 23, 2025

SachinVarghese commented Jan 25, 2025

vivekk16 commented Jan 27, 2025

SachinVarghese commented Jan 27, 2025

sjmonson commented Jan 28, 2025

SachinVarghese commented Jan 30, 2025

achandrasekar commented Jan 30, 2025

Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2

Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2

Comments

SachinVarghese commented Jan 20, 2025

Key Benefits

SachinVarghese commented Jan 22, 2025

achandrasekar commented Jan 23, 2025

SachinVarghese commented Jan 23, 2025

SachinVarghese commented Jan 25, 2025

vivekk16 commented Jan 27, 2025

SachinVarghese commented Jan 27, 2025

sjmonson commented Jan 28, 2025

SachinVarghese commented Jan 30, 2025

achandrasekar commented Jan 30, 2025