Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing neuron core resource limit and pass it as ray… #2409

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mounchin
Copy link

Why are these changes needed?

The PR is needed to parse the neuroncore resource limits fro the pod spec and pass them to ray start command as it expects.
Expected format: https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html

Related issue number

ray-project/ray#44361

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

rayStartParams["num-gpus"] = strconv.FormatInt(resource.Value(), 10)
// For now, only support one GPU type. Break on first match.
break
} else if resourceKeyString == "aws.amazon.com/neuroncore" && !resource.IsZero() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put aws.amazon.com/neuroncore into a constant?

rayStartParams["num-gpus"] = strconv.FormatInt(resource.Value(), 10)
// For now, only support one GPU type. Break on first match.
break
} else if resourceKeyString == "aws.amazon.com/neuroncore" && !resource.IsZero() {
if err := addNeuronCoresToResourcesIfNotExists(rayStartParams, resource.Value()); err != nil {
Copy link
Collaborator

@andrewsykim andrewsykim Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the first hardware accelerator (outside GPUs) that we auto detect in container resource and pass into the ray start command as custom resource. I can see the appeal of doing this but it could also mean we end up supporting an ever growing list of custom resources in kuberay.

My question here is whether having to specify the custom resource in startParams is a big enough pain point for KubeRay to auto detect, commonly used custom resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, it's probably worth doing for the well-known accelerators at least. Can we generalize the implementation so it's easy to add new accelerators here? cc @ryanaoleary for TPUs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native support for neuron and other accelerators already exists for Ray's VM cluster-launchers.
See discussion in ray-project/ray#44361

This PR would move KubeRay closer to parity with the VM-based solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there needs to be some styling here that could generalize to other hardware,
again along the same lines as what exists for Ray's VM support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants