trace_event: destroy platform before tracing #22938

ofrobots · 2018-09-18T21:26:58Z

For safer shutdown, we should destroy the platform – and platform
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above rationale.

Fixes: #22865

This should fix the thread races we have been observing with trace events. I have been running this on the FreeBSD box that was showing flakes in the CI:

[freebsd@test-digitalocean-freebsd11-x64-2 ~/ofrobots]$ tools/test.py -J test/parallel/test-trace-events-fs-sync.js --repeat 1000
[28:25|% 100|+ 1000|-   0]: Done
[freebsd@test-digitalocean-freebsd11-x64-2 ~/ofrobots]$ tools/test.py -J test/parallel/test-trace-events-fs-sync.js --repeat 9999
[290:38|% 100|+ 9999|-   0]: Done

I believe this makes 92b695e unnecessary (but harmless). I can revert that if this change sticks and the CI proves that the flakiness is gone.

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
commit message follows commit guidelines

CI: https://ci.nodejs.org/job/node-test-pull-request/17302/

nodejs-github-bot · 2018-09-18T21:26:59Z

@ofrobots build started: https://ci.nodejs.org/blue/organizations/jenkins/node-test-pull-request-lite-pipeline/detail/node-test-pull-request-lite-pipeline/937/pipeline

Trott · 2018-09-19T12:20:27Z

Linter re-run: https://ci.nodejs.org/job/node-test-linter/22199/

Windows re-run: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20863/

Trott · 2018-09-19T13:47:50Z

win10 failures are odd but they have to be unrelated, right?

Windows rebuild: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20869/

ofrobots · 2018-09-19T17:03:34Z

The windows tests are consistently failing with:

10:49:26     AssertionError [ERR_ASSERTION]: Expected inputs to be strictly equal:
10:49:26     + actual - expected
10:49:26     
10:49:26     + 3221225477
10:49:26     - 1

3221225477 is 0xc0000005 which is access violation. This needs investigating.

mcollina

LGTM

Trott · 2018-09-19T23:56:16Z

/ping @nodejs/platform-windows on the access-violation Windows failures that this change apparently causes.

ofrobots · 2018-09-20T00:20:39Z

I plan to take a look at this once I get some free cycles – today has been quite busy. I would appreciate others' help however.

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865

ofrobots · 2018-09-20T23:10:08Z

I have been having a hard time reproducing this failure on my local neighbourhood windows machine. Even in the CI it seems like a flaky issue with different tests failing depending on shutdown timing. Failures typically take the form of a segfault in a child process. If someone could help me grab the crashing stacktrace somehow, that would help tremendously.

BTW, it would be a nice feature if our test infrastructure was capable of capturing stacktraces on segfaults automatically. It would make it tremendously simpler to do root cause analysis for failures.

BridgeAR · 2018-09-23T20:59:17Z

@nodejs/build-infra @nodejs/build would it be possible to support what @ofrobots suggested?

BTW, it would be a nice feature if our test infrastructure was capable of capturing stacktraces on segfaults automatically. It would make it tremendously simpler to do root cause analysis for failures.

refack · 2018-09-24T01:18:29Z

src/tracing/agent.cc

@@ -48,8 +48,7 @@ using v8::platform::tracing::TraceConfig;
 using v8::platform::tracing::TraceWriter;
 using std::string;

-Agent::Agent() {
-  tracing_controller_ = new TracingController();
+Agent::Agent() : tracing_controller_(new TracingController()) {


If the TracingController life is 1:1 with the Agent, make it a member instead of a pointer-to.

At this point, the lifetimes aren't perfectly aligned. After this PR lands, I have an intention to refactor & merge the Agent and TracingController concepts into a single structure.

ofrobots · 2018-09-24T14:33:16Z

After a lot of flailing (due to my inexperience with Windows) I was able to capture a crash dump. On Unix this would have been as simple as ulimit -c unlimited.

There were lots of false starts though. E.g. the internet suggests that setting some registry keys should enable generation of dumps. This didn't work for me. There are instructions on our issue tracker about 'Dumps on Silent Process Exit'. This didn't help as this would generate a dump every time a child process exited 'silently'. This happens frequently.

Ultimately, I added this code to node_main.cc. This would programmatically generate a crash dump. This worked. Perhaps we can add this permanently (under a flag)? Perhaps node-report already does this and will solve this problem once it is merged into core?

Anyway, back to the problem at hand: the tests that are flaking are ones that manually call process.exit. It seems that the background worker threads are still in the process of being initialized when the main thread is shutting down. This path still has some data races in some of the trace event globals.

ofrobots · 2018-09-24T18:51:03Z

I have verified that this windows crash is not caused by the change here, but a long standing issue with thread timing on windows that happens to get exposed with the change here.

Without my patch, I can reproduce a crash by introducing a manual delay in the background worker thread startup:

static void PlatformWorkerThread(void* data) {
  fprintf(stderr, ""); // write an empty string to stderr, just to introduce a delay.
  TRACE_EVENT_METADATA1("__metadata", "thread_name", "name",
                        "PlatformWorkerThread");
  TaskQueue<Task>* pending_worker_tasks = static_cast<TaskQueue<Task>*>(data);
  while (std::unique_ptr<Task> task = pending_worker_tasks->BlockingPop()) {
    task->Run();
    pending_worker_tasks->NotifyOfCompletion();
  }
}

This results in a segfault in the child process (3221225477 is the code in decimal for access violation).

C:\workspace\ofrobots\test\common\index.js:662
const crashOnUnhandledRejection = (err) => { throw err; };
                                             ^

AssertionError [ERR_ASSERTION]: Expected inputs to be strictly equal:
�[32m+ actual�[39m �[31m- expected�[39m

�[32m+�[39m 3221225477
�[31m-�[39m 1
    at execFile.catch.common.mustCall (C:\workspace\ofrobots\test\parallel\test-child-process-promisified.js:47:14)
    at C:\workspace\ofrobots\test\common\index.js:349:15
    at process._tickCallback (internal/process/next_tick.js:68:7)

So far I can make the crash happen on windows only. The problem is that the background worker is still doing IO when the main thread calls exit.

I'll open a separate issue for this. This PR is blocked until this can be resolved.

ofrobots · 2018-09-24T18:59:59Z

Opened issue #23065.

Trott · 2018-09-24T22:15:47Z

Added blocked label until #23065 gets resolved. 😞

Trott · 2018-10-06T05:53:41Z

#23065 is resolved, so I believe this is now unblocked.

Trott · 2018-10-06T05:54:40Z

CI: https://ci.nodejs.org/job/node-test-pull-request/17658/

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865 PR-URL: nodejs#22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>

Trott · 2018-10-06T13:05:59Z

Landed in 68b3e46

targos · 2018-10-06T13:12:33Z

Should this be backported to v10.x-staging? If yes please follow the guide and raise a backport PR, if not let me know or add the dont-land-on label.

ofrobots · 2018-10-10T16:56:33Z

v10.x Backport on #23398

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865 PR-URL: nodejs#22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>

For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. labels Sep 18, 2018

ofrobots requested a review from eugeneo September 18, 2018 21:27

eugeneo approved these changes Sep 18, 2018

View reviewed changes

jasnell approved these changes Sep 19, 2018

View reviewed changes

mcollina approved these changes Sep 19, 2018

View reviewed changes

ofrobots added 2 commits September 20, 2018 14:45

[squash] lint fix

db9b5ff

ofrobots force-pushed the fix-platform-tracing-shutdown-race branch from b2718fe to db9b5ff Compare September 20, 2018 21:45

Trott mentioned this pull request Sep 23, 2018

Investigate flaky test/parallel/test-trace-events-fs-sync #22865

Closed

refack reviewed Sep 24, 2018

View reviewed changes

ofrobots mentioned this pull request Sep 24, 2018

Platform thread race: process.exit executed before background threads are ready #23065

Closed

Trott added the blocked PRs that are blocked by other issues or PRs. label Sep 24, 2018

ofrobots mentioned this pull request Sep 28, 2018

src: trace_event: secondary storage for metadata #20900

Closed

3 tasks

Trott removed the blocked PRs that are blocked by other issues or PRs. label Oct 6, 2018

Trott closed this Oct 6, 2018

targos added the backport-requested-v10.x label Oct 6, 2018

ofrobots mentioned this pull request Oct 10, 2018

[v10.x backport] backport 23233, 22938 #23398

Closed

ofrobots deleted the fix-platform-tracing-shutdown-race branch October 10, 2018 16:56

ofrobots added backported-to-v10.x and removed backport-requested-v10.x labels Oct 15, 2018

codebytere mentioned this pull request Nov 27, 2018

v10.13.1 proposal #24675

Closed

codebytere mentioned this pull request Nov 29, 2018

v10.14.2 proposal #24727

Merged

gireeshpunathil mentioned this pull request Jan 14, 2019

node-report: meld into core #22712

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trace_event: destroy platform before tracing #22938

trace_event: destroy platform before tracing #22938

ofrobots commented Sep 18, 2018 •

edited

Loading

nodejs-github-bot commented Sep 18, 2018

Trott commented Sep 19, 2018

Trott commented Sep 19, 2018

ofrobots commented Sep 19, 2018

mcollina left a comment

Trott commented Sep 19, 2018

ofrobots commented Sep 20, 2018

ofrobots commented Sep 20, 2018

BridgeAR commented Sep 23, 2018

refack Sep 24, 2018

ofrobots Sep 24, 2018

ofrobots commented Sep 24, 2018 •

edited

Loading

ofrobots commented Sep 24, 2018

ofrobots commented Sep 24, 2018

Trott commented Sep 24, 2018

Trott commented Oct 6, 2018

Trott commented Oct 6, 2018

Trott commented Oct 6, 2018

targos commented Oct 6, 2018

ofrobots commented Oct 10, 2018

trace_event: destroy platform before tracing #22938

trace_event: destroy platform before tracing #22938

Conversation

ofrobots commented Sep 18, 2018 • edited Loading

Checklist

nodejs-github-bot commented Sep 18, 2018

Trott commented Sep 19, 2018

Trott commented Sep 19, 2018

ofrobots commented Sep 19, 2018

mcollina left a comment

Choose a reason for hiding this comment

Trott commented Sep 19, 2018

ofrobots commented Sep 20, 2018

ofrobots commented Sep 20, 2018

BridgeAR commented Sep 23, 2018

refack Sep 24, 2018

Choose a reason for hiding this comment

ofrobots Sep 24, 2018

Choose a reason for hiding this comment

ofrobots commented Sep 24, 2018 • edited Loading

ofrobots commented Sep 24, 2018

ofrobots commented Sep 24, 2018

Trott commented Sep 24, 2018

Trott commented Oct 6, 2018

Trott commented Oct 6, 2018

Trott commented Oct 6, 2018

targos commented Oct 6, 2018

ofrobots commented Oct 10, 2018

ofrobots commented Sep 18, 2018 •

edited

Loading

ofrobots commented Sep 24, 2018 •

edited

Loading