[ML] Avoid ModelAssignment deadlock #109684

prwhelan · 2024-06-13T15:16:05Z

The model loading scheduled thread iterates through the model queue and deploys each model. Rather than block and wait on each deployment, the thread will attach a listener that will either iterate to the next model (if one is in the queue) or reschedule the thread.

This change should not impact:

the iterative nature of the model deployment process - each model is still deployed one at a time, and no additional threads are consumed per model.
the 1s delay between model deployment tries - if a deployment fails but can be retried, the retry is added to the next batch of models that are consumed after the 1s scheduled delay.

Relate #109134

The model loading scheduled thread iterates through the model queue and deploys each model. Rather than block and wait on each deployment, the thread will attach a listener that will either iterate to the next model (if one is in the queue) or reschedule the thread. This change should not impact: 1. the iterative nature of the model deployment process - each model is still deployed one at a time, and no additional threads are consumed per model. 2. the 1s delay between model deployment tries - if a deployment fails but can be retried, the retry is added to the next batch of models that are consumed after the 1s scheduled delay. Relate elastic#109134

prwhelan · 2024-06-13T15:35:11Z

@elasticmachine update branch

elasticsearchmachine · 2024-06-13T16:54:22Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-06-13T16:54:22Z

Hi @prwhelan, I've created a changelog YAML for you.

davidkyle

LGTM

Thanks for working on this, I left a suggestion

davidkyle · 2024-06-17T11:36:02Z

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

+        // if someone calls stop halfway through, abandon this entire chain
+        var loadingToRetry = new ConcurrentLinkedDeque<TrainedModelDeploymentTask>();
+        var deploymentChain = SubscribableListener.<Void>newSucceeded(null);
+        while (loadingModels.isEmpty() == false) {


Would it simplify the code to remove the while() loop in favour of just taking the head of the queue and loading the next model on the next scheduled invocation?

var loadingTask = loadingModels.poll(); if (loadingTask == null) { onFinish.run(); return; }

This function loadQueuedModels() will be called again periodically anyway so it's an option to let next invocation handle the next model. Would an error loading 1 model fail any other models in this chain?

In practice there are only a small number of models per ml node due to the memory and CPU demands. In an autoscaling situation when an ml node joins the cluster it may have to load 1 or 2 models, certainly not 10s of models.

Yes I would very much like to simplify this, if we're okay waiting for ~1s between iterations, I'd be happy to remove the while loop.

Now that I look at it, we could even just schedule immediately if the queue is not empty?

Would an error loading 1 model fail any other models in this chain?

No, because it squelches the error when it calls listener.onResponse so the next iteration will run.

DaveCTurner

Nice, I like it

DaveCTurner · 2024-06-17T14:33:36Z

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

+    private static <T> ActionListener<T> thenRun(Runnable runnable) {
+        return ActionListener.runAfter(ActionListener.noop(), runnable);


Why not ActionListener.running()?

We should swap to it - initially I had this as runBefore so the runnable could safely throw an exception, then I thought failing silently might be confusing and swapped to runAfter without thinking that is effectively ActionListener.running()

DaveCTurner · 2024-06-17T14:36:34Z

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

+                    // don't bother calling the listener, the lifecycle will not resume the instance of this class
+                    return;


I'd sorta recommend just completing the chain of listeners in this case unless there's a strong argument (e.g. performance) not to do so, and then putting a stopped check in the final listener too. Leaking listeners is a source of super-painful bugs, even at shutdown.

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

…rence/assignment/TrainedModelAssignmentNodeService.java Co-authored-by: David Kyle <[email protected]>

prwhelan · 2024-07-03T14:23:30Z

@elasticmachine update branch

… because some users report 1k+ models in a single digit cluster size so startup would take ~3m

davidkyle

Still LGTM

prwhelan added 2 commits June 13, 2024 09:48

Always run handleFailure on ML threadpool

262ee35

prwhelan added >bug :ml Machine learning Team:ML Meta label for the ML team labels Jun 13, 2024

elasticsearchmachine added the v8.15.0 label Jun 13, 2024

elasticmachine and others added 2 commits June 14, 2024 01:35

Merge branch 'main' into fix/109134-3v1

0f06085

Restore thread context after async calls

bf8b9ff

prwhelan marked this pull request as ready for review June 13, 2024 16:53

prwhelan added 2 commits June 13, 2024 12:54

Update docs/changelog/109684.yaml

40f5a92

Allow tests to wait for model loading

dde87e2

davidkyle approved these changes Jun 17, 2024

View reviewed changes

DaveCTurner reviewed Jun 17, 2024

View reviewed changes

Remove loop, only load one model at a time

ecfd512

davidkyle reviewed Jun 17, 2024

View reviewed changes

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java Outdated Show resolved Hide resolved

prwhelan and others added 2 commits June 17, 2024 15:57

Update x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/infe…

6f7609e

…rence/assignment/TrainedModelAssignmentNodeService.java Co-authored-by: David Kyle <[email protected]>

Spotless apply

6bb469b

elasticmachine and others added 2 commits July 4, 2024 00:23

Merge branch 'main' into fix/109134-3v1

83fabe5

Rerun immediately if there are still tasks left in the queue, this is…

64e87ef

… because some users report 1k+ models in a single digit cluster size so startup would take ~3m

davidkyle approved these changes Jul 4, 2024

View reviewed changes

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

prwhelan merged commit 1aea049 into elastic:main Jul 5, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Avoid ModelAssignment deadlock #109684

[ML] Avoid ModelAssignment deadlock #109684

prwhelan commented Jun 13, 2024

prwhelan commented Jun 13, 2024

elasticsearchmachine commented Jun 13, 2024

elasticsearchmachine commented Jun 13, 2024

davidkyle left a comment

davidkyle Jun 17, 2024

prwhelan Jun 17, 2024

DaveCTurner left a comment

DaveCTurner Jun 17, 2024

prwhelan Jun 17, 2024

DaveCTurner Jun 17, 2024

prwhelan commented Jul 3, 2024

davidkyle left a comment

		private static <T> ActionListener<T> thenRun(Runnable runnable) {
		return ActionListener.runAfter(ActionListener.noop(), runnable);

		// don't bother calling the listener, the lifecycle will not resume the instance of this class
		return;

[ML] Avoid ModelAssignment deadlock #109684

[ML] Avoid ModelAssignment deadlock #109684

Conversation

prwhelan commented Jun 13, 2024

prwhelan commented Jun 13, 2024

elasticsearchmachine commented Jun 13, 2024

elasticsearchmachine commented Jun 13, 2024

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jun 17, 2024

Choose a reason for hiding this comment

prwhelan Jun 17, 2024

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Jun 17, 2024

Choose a reason for hiding this comment

prwhelan Jun 17, 2024

Choose a reason for hiding this comment

DaveCTurner Jun 17, 2024

Choose a reason for hiding this comment

prwhelan commented Jul 3, 2024

davidkyle left a comment

Choose a reason for hiding this comment