Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Avoid ModelAssignment deadlock #109684

Merged
merged 11 commits into from
Jul 5, 2024
Merged

Conversation

prwhelan
Copy link
Member

The model loading scheduled thread iterates through the model queue and deploys each model. Rather than block and wait on each deployment, the thread will attach a listener that will either iterate to the next model (if one is in the queue) or reschedule the thread.

This change should not impact:

  1. the iterative nature of the model deployment process - each model is still deployed one at a time, and no additional threads are consumed per model.
  2. the 1s delay between model deployment tries - if a deployment fails but can be retried, the retry is added to the next batch of models that are consumed after the 1s scheduled delay.

Relate #109134

The model loading scheduled thread iterates through the model
queue and deploys each model. Rather than block and wait on each
deployment, the thread will attach a listener that will either iterate
to the next model (if one is in the queue) or reschedule the thread.

This change should not impact:
1. the iterative nature of the model deployment process - each model is
   still deployed one at a time, and no additional threads are consumed
   per model.
2. the 1s delay between model deployment tries - if a deployment fails
   but can be retried, the retry is added to the next batch of models
   that are consumed after the 1s scheduled delay.

Relate elastic#109134
@prwhelan prwhelan added >bug :ml Machine learning Team:ML Meta label for the ML team labels Jun 13, 2024
@prwhelan
Copy link
Member Author

@elasticmachine update branch

@prwhelan prwhelan marked this pull request as ready for review June 13, 2024 16:53
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @prwhelan, I've created a changelog YAML for you.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for working on this, I left a suggestion

// if someone calls stop halfway through, abandon this entire chain
var loadingToRetry = new ConcurrentLinkedDeque<TrainedModelDeploymentTask>();
var deploymentChain = SubscribableListener.<Void>newSucceeded(null);
while (loadingModels.isEmpty() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it simplify the code to remove the while() loop in favour of just taking the head of the queue and loading the next model on the next scheduled invocation?

var loadingTask = loadingModels.poll();
if (loadingTask == null) {
  onFinish.run();
  return;
}

This function loadQueuedModels() will be called again periodically anyway so it's an option to let next invocation handle the next model. Would an error loading 1 model fail any other models in this chain?

In practice there are only a small number of models per ml node due to the memory and CPU demands. In an autoscaling situation when an ml node joins the cluster it may have to load 1 or 2 models, certainly not 10s of models.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would very much like to simplify this, if we're okay waiting for ~1s between iterations, I'd be happy to remove the while loop.

Now that I look at it, we could even just schedule immediately if the queue is not empty?

Would an error loading 1 model fail any other models in this chain?

No, because it squelches the error when it calls listener.onResponse so the next iteration will run.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like it

Comment on lines 237 to 238
private static <T> ActionListener<T> thenRun(Runnable runnable) {
return ActionListener.runAfter(ActionListener.noop(), runnable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ActionListener.running()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should swap to it - initially I had this as runBefore so the runnable could safely throw an exception, then I thought failing silently might be confusing and swapped to runAfter without thinking that is effectively ActionListener.running()

Comment on lines 214 to 215
// don't bother calling the listener, the lifecycle will not resume the instance of this class
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd sorta recommend just completing the chain of listeners in this case unless there's a strong argument (e.g. performance) not to do so, and then putting a stopped check in the final listener too. Leaking listeners is a source of super-painful bugs, even at shutdown.

prwhelan and others added 2 commits June 17, 2024 15:57
…rence/assignment/TrainedModelAssignmentNodeService.java

Co-authored-by: David Kyle <[email protected]>
@prwhelan
Copy link
Member Author

prwhelan commented Jul 3, 2024

@elasticmachine update branch

elasticmachine and others added 2 commits July 4, 2024 00:23
… because some users report 1k+ models in a single digit cluster size so startup would take ~3m
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

@prwhelan prwhelan merged commit 1aea049 into elastic:main Jul 5, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants