Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trace_event: destroy platform before tracing #22938

Closed

Conversation

ofrobots
Copy link
Contributor

@ofrobots ofrobots commented Sep 18, 2018

For safer shutdown, we should destroy the platform – and platform
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above rationale.

Fixes: #22865

This should fix the thread races we have been observing with trace events. I have been running this on the FreeBSD box that was showing flakes in the CI:

[freebsd@test-digitalocean-freebsd11-x64-2 ~/ofrobots]$ tools/test.py -J test/parallel/test-trace-events-fs-sync.js --repeat 1000
[28:25|% 100|+ 1000|-   0]: Done
[freebsd@test-digitalocean-freebsd11-x64-2 ~/ofrobots]$ tools/test.py -J test/parallel/test-trace-events-fs-sync.js --repeat 9999
[290:38|% 100|+ 9999|-   0]: Done

I believe this makes 92b695e unnecessary (but harmless). I can revert that if this change sticks and the CI proves that the flakiness is gone.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines

CI: https://ci.nodejs.org/job/node-test-pull-request/17302/

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. labels Sep 18, 2018
@Trott
Copy link
Member

Trott commented Sep 19, 2018

@Trott
Copy link
Member

Trott commented Sep 19, 2018

win10 failures are odd but they have to be unrelated, right?

Windows rebuild: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20869/

@ofrobots
Copy link
Contributor Author

The windows tests are consistently failing with:

10:49:26     AssertionError [ERR_ASSERTION]: Expected inputs to be strictly equal:
10:49:26     + actual - expected
10:49:26     
10:49:26     + 3221225477
10:49:26     - 1

3221225477 is 0xc0000005 which is access violation. This needs investigating.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Trott
Copy link
Member

Trott commented Sep 19, 2018

/ping @nodejs/platform-windows on the access-violation Windows failures that this change apparently causes.

@ofrobots
Copy link
Contributor Author

I plan to take a look at this once I get some free cycles – today has been quite busy. I would appreciate others' help however.

For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: nodejs#22865
@ofrobots ofrobots force-pushed the fix-platform-tracing-shutdown-race branch from b2718fe to db9b5ff Compare September 20, 2018 21:45
@ofrobots
Copy link
Contributor Author

I have been having a hard time reproducing this failure on my local neighbourhood windows machine. Even in the CI it seems like a flaky issue with different tests failing depending on shutdown timing. Failures typically take the form of a segfault in a child process. If someone could help me grab the crashing stacktrace somehow, that would help tremendously.

BTW, it would be a nice feature if our test infrastructure was capable of capturing stacktraces on segfaults automatically. It would make it tremendously simpler to do root cause analysis for failures.

@BridgeAR
Copy link
Member

@nodejs/build-infra @nodejs/build would it be possible to support what @ofrobots suggested?

BTW, it would be a nice feature if our test infrastructure was capable of capturing stacktraces on segfaults automatically. It would make it tremendously simpler to do root cause analysis for failures.

@@ -48,8 +48,7 @@ using v8::platform::tracing::TraceConfig;
using v8::platform::tracing::TraceWriter;
using std::string;

Agent::Agent() {
tracing_controller_ = new TracingController();
Agent::Agent() : tracing_controller_(new TracingController()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the TracingController life is 1:1 with the Agent, make it a member instead of a pointer-to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the lifetimes aren't perfectly aligned. After this PR lands, I have an intention to refactor & merge the Agent and TracingController concepts into a single structure.

@ofrobots
Copy link
Contributor Author

ofrobots commented Sep 24, 2018

After a lot of flailing (due to my inexperience with Windows) I was able to capture a crash dump. On Unix this would have been as simple as ulimit -c unlimited.

There were lots of false starts though. E.g. the internet suggests that setting some registry keys should enable generation of dumps. This didn't work for me. There are instructions on our issue tracker about 'Dumps on Silent Process Exit'. This didn't help as this would generate a dump every time a child process exited 'silently'. This happens frequently.

Ultimately, I added this code to node_main.cc. This would programmatically generate a crash dump. This worked. Perhaps we can add this permanently (under a flag)? Perhaps node-report already does this and will solve this problem once it is merged into core?

Anyway, back to the problem at hand: the tests that are flaking are ones that manually call process.exit. It seems that the background worker threads are still in the process of being initialized when the main thread is shutting down. This path still has some data races in some of the trace event globals.

@ofrobots
Copy link
Contributor Author

I have verified that this windows crash is not caused by the change here, but a long standing issue with thread timing on windows that happens to get exposed with the change here.

Without my patch, I can reproduce a crash by introducing a manual delay in the background worker thread startup:

static void PlatformWorkerThread(void* data) {
  fprintf(stderr, ""); // write an empty string to stderr, just to introduce a delay.
  TRACE_EVENT_METADATA1("__metadata", "thread_name", "name",
                        "PlatformWorkerThread");
  TaskQueue<Task>* pending_worker_tasks = static_cast<TaskQueue<Task>*>(data);
  while (std::unique_ptr<Task> task = pending_worker_tasks->BlockingPop()) {
    task->Run();
    pending_worker_tasks->NotifyOfCompletion();
  }
}

This results in a segfault in the child process (3221225477 is the code in decimal for access violation).

C:\workspace\ofrobots\test\common\index.js:662
const crashOnUnhandledRejection = (err) => { throw err; };
                                             ^

AssertionError [ERR_ASSERTION]: Expected inputs to be strictly equal:
�[32m+ actual�[39m �[31m- expected�[39m

�[32m+�[39m 3221225477
�[31m-�[39m 1
    at execFile.catch.common.mustCall (C:\workspace\ofrobots\test\parallel\test-child-process-promisified.js:47:14)
    at C:\workspace\ofrobots\test\common\index.js:349:15
    at process._tickCallback (internal/process/next_tick.js:68:7)

So far I can make the crash happen on windows only. The problem is that the background worker is still doing IO when the main thread calls exit.

I'll open a separate issue for this. This PR is blocked until this can be resolved.

@ofrobots
Copy link
Contributor Author

Opened issue #23065.

@Trott Trott added the blocked PRs that are blocked by other issues or PRs. label Sep 24, 2018
@Trott
Copy link
Member

Trott commented Sep 24, 2018

Added blocked label until #23065 gets resolved. 😞

@Trott
Copy link
Member

Trott commented Oct 6, 2018

#23065 is resolved, so I believe this is now unblocked.

@Trott Trott removed the blocked PRs that are blocked by other issues or PRs. label Oct 6, 2018
@Trott
Copy link
Member

Trott commented Oct 6, 2018

Trott pushed a commit to Trott/io.js that referenced this pull request Oct 6, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: nodejs#22865

PR-URL: nodejs#22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
@Trott
Copy link
Member

Trott commented Oct 6, 2018

Landed in 68b3e46

@targos
Copy link
Member

targos commented Oct 6, 2018

Should this be backported to v10.x-staging? If yes please follow the guide and raise a backport PR, if not let me know or add the dont-land-on label.

@ofrobots ofrobots deleted the fix-platform-tracing-shutdown-race branch October 10, 2018 16:56
@ofrobots
Copy link
Contributor Author

v10.x Backport on #23398

ofrobots added a commit to ofrobots/node that referenced this pull request Oct 13, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: nodejs#22865

PR-URL: nodejs#22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
ofrobots added a commit that referenced this pull request Oct 15, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
jasnell pushed a commit that referenced this pull request Oct 17, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
MylesBorins pushed a commit that referenced this pull request Oct 30, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
@codebytere codebytere mentioned this pull request Nov 27, 2018
rvagg pushed a commit that referenced this pull request Nov 28, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
MylesBorins pushed a commit that referenced this pull request Nov 29, 2018
For safer shutdown, we should destroy the platform – and background
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.

Make the tracing agent own the tracing controller instead of platform
to match the above.

Fixes: #22865

PR-URL: #22938
Reviewed-By: Eugene Ostroukhov <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
@codebytere codebytere mentioned this pull request Nov 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate flaky test/parallel/test-trace-events-fs-sync
9 participants