Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in BlockingDictionary #175

Closed
johnsusi opened this issue Jan 15, 2020 · 9 comments
Closed

Deadlock in BlockingDictionary #175

johnsusi opened this issue Jan 15, 2020 · 9 comments
Assignees
Labels

Comments

@johnsusi
Copy link

First a big thanks for all your hard work. It is very appreciated!

We are having some issues in a system which publishes a lot of data (lots of small messages < 1k).

I have narrowed it down to multithreading and BlockingCollection. My guess is that the issue arises from Remove and waitForSpace takes the locks in different order.

Here is a small test that, on my computer, hits the deadlock every time. But timing is probably a factor here.

        public static void DeadlockTest()
        {
          var options = StanOptions.GetDefaultOptions();
          options.MaxPubAcksInFlight = 100;
          using var connection = new StanConnectionFactory().CreateConnection("test-cluster", "test-client", options);

          var subject = Guid.NewGuid().ToString();
          var payload = new byte[1024];

          long count = 0;

          Action action = () =>
          {
            while (true)
            {
              connection.Publish(subject, payload, (source, args) =>
              {
              });
              Interlocked.Increment(ref count);
            }
          };

          Func<Task> report = async () =>
          {
            var sw = new Stopwatch();
            sw.Start();
            while (true)
            {
              await Task.Delay(1000);
              var _count = Interlocked.Read(ref count);
              Console.WriteLine($"Publish {_count} in {sw.ElapsedMilliseconds} ms. {1000*_count/sw.ElapsedMilliseconds} msg/sec.");
            }
          };

          Task.WaitAll(
            Task.Run(action),
            Task.Run(action),
            report()
          );

        }

Server was run using

nats-streaming-server -p 4222 -st file -dir $PWD/data

on a macbook pro and on a windows 10 pc.

I tried looking into fixing the problem myself but still no progress.

@watfordgnf
Copy link
Collaborator

I'm able to remove the deadlock with the following change:

        internal bool Remove(TKey key, out TValue value, int timeout)
        {
            bool rv = false;
            bool wasAtCapacity = false;

            value = default(TValue);

            lock (dLock)
            {
                if (!finished)
                {
                    // check and wait if empty
                    while (d.Count == 0)
                    {
                        if (timeout < 0)
                        {
                            Monitor.Wait(dLock);
                        }
                        else
                        {
                            if (timeout > 0)
                            {
                                if (Monitor.Wait(dLock, timeout) == false)
                                {
                                    throw new Exception("timeout");
                                }
                            }
                        }
                    }

                    if (!finished)
                    {
                        rv = d.TryGetValue(key, out value);
                    }
                }

                if (rv)
                {
                    wasAtCapacity = d.Count >= maxSize;
                    d.Remove(key);
                }
            }

            // wasAtCapacity is updated inside a lock
            // so move waiting on addLock outside of dLock
            if (wasAtCapacity)
            {
                lock (addLock)
                {
                    Monitor.Pulse(addLock);
                }
            }

            return rv;

        } // get

@watfordgnf
Copy link
Collaborator

I'm evaluating whether other changes may be more appropriate, such as not holding StanConnection.mu when calls to BlockingDictionary.Remove are made.

@ColinSullivan1
Copy link
Member

@watfordgnf , really appreciate you taking a look at this. Thanks!

watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
- pubAckMap's reference cannot change under mu in StanConnection
watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
…s-io#175

- This resolves a deadlock seen with unexpected connection failures
@johnsusi
Copy link
Author

Wow that was quick, thanks :)

We solved it temporarily by replacing the wait for space with a simple sleep/retry loop

               while (!pubAckMap.TryAdd(guidValue, a))
                {
                    var bd = pubAckMap;

                    Monitor.Exit(mu);
                    // Wait for space outside of the lock so 
                    // acks can be removed.
                    // bd.waitForSpace();
                    Thread.Sleep(10);
                    Monitor.Enter(mu);

                    if (nc == null)
                    {
                        throw new StanConnectionClosedException();
                    }
                }

It will not affect performance since this only happens when we are writing to fast anyway.

Looking forward to trying out your changes.

watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
@watfordgnf
Copy link
Collaborator

watfordgnf commented Jan 15, 2020

@johnsusi I applied a very similar fix as well, such that it will not wait for space in the pubAckMap any longer than the ping interval, which fixes a second deadlock seen with unexpected connection failures. I removed the deadlock you saw by ensuring addLock and dLock are never held under the same path.

Another interesting bit is pubAckMap.Remove no longer needs protection under mu, removing a possible point of contention under load.

watfordgnf added a commit to watfordgnf/csharp-nats-streaming that referenced this issue Jan 15, 2020
@watfordgnf watfordgnf added the bug label Jan 15, 2020
@watfordgnf watfordgnf self-assigned this Jan 15, 2020
@johnsusi
Copy link
Author

Very nice @watfordgnf, seems my issue is resolved at least on my simple example. Will try out with our real system and see if any problems arise and report back here after.

@ColinSullivan1
Copy link
Member

@johnsusi, would love the feedback - much appreciated!

@johnsusi
Copy link
Author

Everything seems to be working now so big thumbs up here.

@ColinSullivan1
Copy link
Member

Thanks for checking @johnsusi!

ColinSullivan1 pushed a commit that referenced this issue Jan 22, 2020
* Do not hold addLock inside dLock in BlockingDictionary #175
* Remove unnecessary lock in processAck #175
* Do not wait for space indefinitely in pubAckMap #175
* Re-enable TestMaxPubAckInflight #175
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants