Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.1 API Review #154

Open
wants to merge 4 commits into
base: v0.0
Choose a base branch
from
Open

v0.1 API Review #154

wants to merge 4 commits into from

Conversation

kfswain
Copy link
Collaborator

@kfswain kfswain commented Jan 6, 2025

This PR is not intended to be merged, merely a point of reference for review.

Slides: https://docs.google.com/presentation/d/1gtOJS1YA0Ax8KvsGPrHiyZR2dBoWnlLd9aACyojmk68/edit#slide=id.p

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kfswain
Once this PR has been reviewed and has the lgtm label, please assign nikhita for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 6, 2025
@kfswain
Copy link
Collaborator Author

kfswain commented Jan 6, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2025
@mrbobbytables mrbobbytables removed their request for review January 6, 2025 21:13
api/inferencemodel_types.go Outdated Show resolved Hide resolved
api/inferencemodel_types.go Outdated Show resolved Hide resolved
Comment on lines +58 to +61
// The number must be in the range 1 to 65535.
//
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If applicable, I suggest reducing this port range to a smaller list of well-known ports that users can rely on for firewall configuration purposes. Also, don't allow overlap with other well-known ports like those used for dns, http/s, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding off on changing this one, just to gather consensus on what range we should limit it to. But I do agree with the idea.

It's possible that we could start with a small range and relax as needed. As the other direction would be nigh impossible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@candita this is meant to be a reference to a port number on a Pod, I can't think of any reasonable way to limit that since I think Kubernetes has likely scaled far enough that there's probably at least one case of each individual port being in use across the many Kubernetes that exist.

@robscott
Copy link
Member

robscott commented Jan 9, 2025

Thanks to everyone for the help reviewing this! Any/all comments are very appreciated.

We also went over this at a high level in a meeting with SIG-Net TLs (and others), walking through some slides as a reference point. Since this is starting as an x-k8s.io API, formal API review is optional, but we're trying to get as much feedback as we can early in the process. Barring any blocking feedback, we're hoping to release v0.1 next week.

Copying SIG-Net TLs and chairs if any have time for additional review cycles.

/cc @aojea @danwinship @MikeZappa87 @shaneutt @thockin

api/inferencepool_types.go Outdated Show resolved Hide resolved
@k8s-ci-robot
Copy link
Contributor

@kfswain: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gateway-api-inference-extension-test-unit-main 86b852d link true /test pull-gateway-api-inference-extension-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// InferenceModel is the Schema for the InferenceModels API.
Copy link
Member

@shaneutt shaneutt Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InferenceModel appears to be topical, and of particular importance, so it may be one of the first thing a newcomer reads when learning these APIs. It may be beneficial to expand the documentation here to explain more thoroughly the "what" and "why" of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a cop out to reference our site or the doc proposal that goes into a bit more detail?

I could see a brief blurb being valuable, but a more detailed explanation offloaded.

Copy link
Collaborator Author

@kfswain kfswain Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also go into more detail in the spec maybe it could be as simple as 'a more detailed description is affixed to the InferenceModelSpec field below'?

Copy link
Member

@shaneutt shaneutt Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link may be sufficient. As far as the spec section goes, I was thinking maybe higher level but maybe that's OK as well.

I would like to see something more in the documentation here. I trust your judgement in what that is, just please consider what a newcomer might be looking for when they come here and try to accommodate for that. Otherwise please feel free to consider this comment resolved at your discretion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a stab! LMKWYT

api/inferencemodel_types.go Show resolved Hide resolved
Items []InferenceModel `json:"items"`
}

// InferenceModelSpec represents the desired state of a specific model use case. This resource is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "model use case" language wasn't immediately clear to me here. Is it fair to just say:

Suggested change
// InferenceModelSpec represents the desired state of a specific model use case. This resource is
// InferenceModelSpec represents the desired state of an InferenceModel. This resource is

Or are we trying to make some additional distinction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying not to use a circular definition, and clarify what an InferenceModel is intending to represent. Open to changing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, please consider resolved at your discretion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more high-level documentation to the InferenceModel obj allowed me to make the model use case - InferenceModel linkage above. Rewording.

// managed by the "Inference Workload Owner" persona.
//
// The Inference Workload Owner persona is someone that trains, verifies, and
// leverages a large language model from a model frontend, drives the lifecycle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume anyone that's made it here understands "inference", "training", "models", e.t.c. but might it be worth explaining more or enumerating some examples of a "model frontend" if we're going to mention that here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried rewording here.

api/inferencemodel_types.go Show resolved Hide resolved
api/inferencemodel_types.go Show resolved Hide resolved
Comment on lines +74 to +78
// Criticality defines how important it is to serve the model compared to other models referencing the same pool.
// The lack of defaulting is intentional, the behavior of not setting criticality future-proofs the API without complicating.
//
// +optional
Criticality *Criticality `json:"criticality,omitempty"`
Copy link
Member

@shaneutt shaneutt Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been seeing and hearing a lot of discussion about Criticality in terms of linguistics, and it's place in this API. I see the coordination between the criticality of multiple models as you add more and more to have the potential to become a bit confusing, particularly if you're trying to deploy a new model and you have to kinda look at what's out there and make decisions about your new model. For instance, will it be common to get into weird shape where you have a new model, and it's criticality is such that it really needs to be higher than that of anything that came before it? Then as a part of deploying it, there's a sort of impetus to update (perhaps downgrade) the criticality of a bunch of other models as a part of deploying the model? Would this serve to complicate the job of the "Inference Workload Owner"?

This doesn't mean I'm strictly against it or anything mind you, I'm just trying to think through how this plays out in the real world. It might help me personally (since I'm very new to this project) to see some of the motivation and user stories that influenced criticality if someone can spare a link.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some more recent discussion here: https://kubernetes.slack.com/archives/C071WA7R9LY/p1736906518995639

But Criticality has been a hot topic, agreed that it might not be in its final state. Since criticality is used in a load balancing aspect, we are trying to limit options to have something that we can guarantee to support out of the box. We expect iteration in the future as we (hopefully) increase usage.

Would opening an issue about criticality to centralize conversation be acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a TODO issue just to make sure the conversation remains topical and continues is a reasonable deferral at this stage so that we can keep velocity up and test it out in its current state and see what that teaches us. 👍


// TargetModels allow multiple versions of a model for traffic splitting.
// If not specified, the target model name is defaulted to the modelName parameter.
// modelName is often in reference to a LoRA adapter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth expanding on this piece in particular in this documentation to help future code readers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added examples to more clearly explain the variability.

api/inferencemodel_types.go Show resolved Hide resolved
}

// InferencePoolList contains a list of InferencePool.
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above it might be nice to provide more documentation about what (and why) InferencePool is.

// Weight is used to determine the proportion of traffic that should be
// sent to this model when multiple target models are specified.
//
// Weight defines the proportion of requests forwarded to the specified
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question, if have 1 client with http2 that sends 1000 requests (all requests are pipelined over the same connection) and two models with 50 and 50, is the result 500 requests for each model?

Is request a connection request, a token , an http request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gateway implementations handle the actual connection (ext-proc just uses gRPC communication with the GW)

But yes, assuming equal weighting for 2 underlying models, the mathematical probability should be a 50:50 split over a large enough sample pool

//
// +optional
// +kubebuilder:validation:MaxItems=10
TargetModels []TargetModel `json:"targetModels,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If target model mutable? can it be updated to add or remove models? if that happen, are the weights recalculated ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, targetModels can be added/removed.

The weights do recalculate, they are all relative to one another. Link to how the weights are consumed here:

func RandomWeightedDraw(model *v1alpha1.InferenceModel, seed int64) string {

@kfswain
Copy link
Collaborator Author

kfswain commented Jan 16, 2025

Heya @shaneutt! I made: #204 to address comments (or I just commented directly). Thanks for the input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants