Add support for parsing neuron core resource limit and pass it as ray… #2409

mounchin · 2024-09-27T22:37:58Z

Why are these changes needed?

The PR is needed to parse the neuroncore resource limits fro the pod spec and pass them to ray start command as it expects.
Expected format: https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html

Related issue number

ray-project/ray#44361

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

… start param

andrewsykim · 2024-09-28T02:01:57Z

ray-operator/controllers/ray/common/pod.go

 				rayStartParams["num-gpus"] = strconv.FormatInt(resource.Value(), 10)
 				// For now, only support one GPU type. Break on first match.
 				break
+			} else if resourceKeyString == "aws.amazon.com/neuroncore" && !resource.IsZero() {


Can you put aws.amazon.com/neuroncore into a constant?

andrewsykim · 2024-09-28T02:05:01Z

ray-operator/controllers/ray/common/pod.go

 				rayStartParams["num-gpus"] = strconv.FormatInt(resource.Value(), 10)
 				// For now, only support one GPU type. Break on first match.
 				break
+			} else if resourceKeyString == "aws.amazon.com/neuroncore" && !resource.IsZero() {
+				if err := addNeuronCoresToResourcesIfNotExists(rayStartParams, resource.Value()); err != nil {


This would be the first hardware accelerator (outside GPUs) that we auto detect in container resource and pass into the ray start command as custom resource. I can see the appeal of doing this but it could also mean we end up supporting an ever growing list of custom resources in kuberay.

My question here is whether having to specify the custom resource in startParams is a big enough pain point for KubeRay to auto detect, commonly used custom resources.

Thinking about it more, it's probably worth doing for the well-known accelerators at least. Can we generalize the implementation so it's easy to add new accelerators here? cc @ryanaoleary for TPUs

Native support for neuron and other accelerators already exists for Ray's VM cluster-launchers.
See discussion in ray-project/ray#44361

This PR would move KubeRay closer to parity with the VM-based solution.

I agree that there needs to be some styling here that could generalize to other hardware,
again along the same lines as what exists for Ray's VM support.

Add support for parsing neuron core resource limit and pass it as ray…

72e6fa9

… start param

andrewsykim reviewed Sep 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parsing neuron core resource limit and pass it as ray… #2409

Add support for parsing neuron core resource limit and pass it as ray… #2409

mounchin commented Sep 27, 2024

andrewsykim Sep 28, 2024

andrewsykim Sep 28, 2024 •

edited

Loading

andrewsykim Sep 28, 2024

DmitriGekhtman Sep 28, 2024

DmitriGekhtman Sep 28, 2024

Add support for parsing neuron core resource limit and pass it as ray… #2409

Are you sure you want to change the base?

Add support for parsing neuron core resource limit and pass it as ray… #2409

Conversation

mounchin commented Sep 27, 2024

Why are these changes needed?

Related issue number

Checks

andrewsykim Sep 28, 2024

Choose a reason for hiding this comment

andrewsykim Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

andrewsykim Sep 28, 2024

Choose a reason for hiding this comment

DmitriGekhtman Sep 28, 2024

Choose a reason for hiding this comment

DmitriGekhtman Sep 28, 2024

Choose a reason for hiding this comment

andrewsykim Sep 28, 2024 •

edited

Loading