DEVOPS-424 fix: add explicit GPU resource limit #63

tplessas · 2023-10-25T10:33:23Z

Nvidia plugin already updated on both clusters - tested that these changes work on review by deploying a bunch of coref-resolution replicas under release coref-multitest, which I'll leave up until this PR is merged.

Two replicas ended up on the same node, which also allows us to confirm that time-slicing works fine (barring any changes related to memory such as those discussed this morning).

anton-delphai · 2023-10-27T12:34:21Z

And why do we need that if we set the capacity to very big number? Isn't that equivalent to not request GPUs at all?

tplessas · 2023-10-27T13:27:34Z

Time-slicing does not actually do what its name implies – it is just used to claim a generic stake on the GPU. This is why the resource limit cannot have a fractional part or be more than 1.

From https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing:

Note: Unlike with "normal" GPU requests, requesting more than one shared GPU does not imply that you will get guaranteed access to a proportional amount of compute power. It only implies that you will get access to a GPU that is shared by other clients (each of which has the freedom to run as many processes on the underlying GPU as they want). Under the hood CUDA will simply give an equal share of time to all of the GPU processes across all of the clients. The failRequestsGreaterThanOne flag is meant to help users understand this subtlety, by treating a request of 1 as an access request rather than an exclusive resource request.

Everything works just like before this PR/the cluster update – the only thing added is some more k8s bureaucracy to achieve it.

fix: add explicit GPU resource limit

1032db3

tplessas requested a review from a team as a code owner October 25, 2023 10:33

tplessas changed the title ~~fix: add explicit GPU resource limit~~ DEVOPS-424 fix: add explicit GPU resource limit Oct 25, 2023

malek-delphai approved these changes Oct 25, 2023

View reviewed changes

tplessas merged commit ef34ae7 into master Oct 25, 2023
1 check passed

tplessas deleted the add-gpu-limit branch October 25, 2023 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEVOPS-424 fix: add explicit GPU resource limit #63

DEVOPS-424 fix: add explicit GPU resource limit #63

tplessas commented Oct 25, 2023

anton-delphai commented Oct 27, 2023

tplessas commented Oct 27, 2023 •

edited

Loading

DEVOPS-424 fix: add explicit GPU resource limit #63

DEVOPS-424 fix: add explicit GPU resource limit #63

Conversation

tplessas commented Oct 25, 2023

anton-delphai commented Oct 27, 2023

tplessas commented Oct 27, 2023 • edited Loading

tplessas commented Oct 27, 2023 •

edited

Loading