Log Management #1679

aluzzardi · 2016-10-22T01:06:04Z

This change adds support for log management in the manager, agent and
CLI.

The log broker is currently naive and broadcasts subscriptions to all
agents, which in turn need to perform filtering to figure out if the
subscription is relevant to them.

In the future, the broker should be smarter and dispatch subscriptions
only to concerned agents.

The basic logging functionality works.

Fixes #1332

/cc @stevvooe @aaronlehmann @diogomonica

aaronlehmann · 2016-10-22T01:28:29Z

agent/worker.go

-	listeners map[*statusReporterKey]struct{}
-	secrets   *secrets
+	db                *bolt.DB
+	taskevents        *events.Broadcaster


This needs to be closed once it is no longer being used, otherwise it will leak a goroutine.

I've updated the LogBroker - however, there's no way to terminate a Worker so it should never leak

aaronlehmann · 2016-10-22T01:33:05Z

manager/logbroker/broker.go

 	return &LogBroker{
-		broadcaster: events.NewBroadcaster(),
+		broadcaster:   events.NewBroadcaster(),
+		subscriptions: events.NewBroadcaster(),


Same with these - they need to be closed when the log broker stops.

Take care when closing them that there are no sinks still attached to the broadcasters; otherwise it can deadlock. We have a wrapper in manager/state/watch that we use to protect against this, which you may find useful here.

aaronlehmann · 2016-10-22T01:36:39Z

agent/worker.go

+	ctx, cancel := context.WithCancel(ctx)
+	defer cancel()
+
+	log.G(ctx).Infof("Received subscription %s", subscription.ID)


It's probably a good idea to log the selectors as well.

aaronlehmann · 2016-10-22T01:37:52Z

agent/exec/container/controller.go

+	}
+
+	ctx, cancel := context.WithCancel(pctx)
+	defer cancel()


What's the cancel useful for? Do any of the function calls below spawn goroutines that outlive this function?

Removed

The WithCancel still seems to be in the code

I think that one in the controller is legit - it cancels out the event stream as soon as we hit a start event

I get it now, thanks.

aaronlehmann · 2016-10-22T01:39:31Z

agent/worker.go

+
+func (w *worker) Subscribe(ctx context.Context, subscription *api.SubscriptionMessage) error {
+	ctx, cancel := context.WithCancel(ctx)
+	defer cancel()


What's the cancel useful for? Do any of the function calls below spawn goroutines that outlive this function?

Right - removed the sub-context

aaronlehmann · 2016-10-22T01:43:42Z

agent/exec/container/controller.go

+
+			Data: parts[1],
+		}, false); err != nil {
+			return errors.Wrap(err, "failed publisher log message")


"failed to publish log message"?

stevvooe · 2016-10-24T18:44:01Z

agent/exec/container/controller_integration_test.go

@@ -69,3 +91,9 @@ func TestControllerFlowIntegration(t *testing.T) {
 		t.Fatalf("expected controller to be closed: %v", err)
 	}
 }
+
+type logPublisherFn func(ctx context.Context, message api.LogMessage, close bool) error


This isn't necessary since we define an exported version in the exec package.

Exec defines a log publisher, not a publish provider

Never mind, this is old code. Fixed

stevvooe · 2016-10-24T18:44:24Z

agent/exec/controller.go

+type LogPublisherFunc func(ctx context.Context, message api.LogMessage, close bool) error
+
+// Publish calls the wrapped function.
+func (fn LogPublisherFunc) Publish(ctx context.Context, message api.LogMessage, close bool) error {


No need to propagate close here. Just cancel the context.

How?

The only thing that can send a close is agent.Publisher (LogPublisherProvider) since down in the controller nobody has access to PublishLogsRequest, and I don't see how it can watch for context cancellations

stevvooe · 2016-10-24T18:52:17Z

agent/exec/container/controller.go

+	for {
+		// so, message header is 8 bytes, treat as uint64, pull stream off MSB
+		var header uint64
+		if err := binary.Read(brd, binary.BigEndian, &header); err != nil {


Most of this should be moved into the adapter.

codecov-io · 2016-10-24T20:20:54Z

Current coverage is 55.32% (diff: 36.05%)

Merging #1679 into master will decrease coverage by 0.58%

@@             master      #1679   diff @@
==========================================
  Files            96         94     -2   
  Lines         15005      15190   +185   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           8390       8404    +14   
- Misses         5504       5685   +181   
+ Partials       1111       1101    -10

Powered by Codecov. Last update be97f7f...3058ba9

aluzzardi · 2016-10-24T23:54:25Z

@aaronlehmann @stevvooe PTAL

aaronlehmann · 2016-10-24T23:59:35Z

I've updated the LogBroker - however, there's no way to terminate a Worker so it should never leak

I don't understand. Doesn't docker swarm leave terminate the worker?

aaronlehmann · 2016-10-25T00:51:14Z

@aluzzardi: Unit tests seemed to hang in CI.

aluzzardi · 2016-10-25T01:28:33Z

I don't understand. Doesn't docker swarm leave terminate the worker?

It just stops the agent.

EDIT: Rephrasing. When you docker swarm leave, we stop the agent. All it does is stop the agent from going through its main processing loop, the Worker is never cleanly stopped. We'd need to change that.

I could change the Worker interface, add a Stop there and plumb it into the agent - the problem now is the wrapper. It'd be weird if the agent imported stuff from the manager, any suggestion?

aluzzardi · 2016-10-25T01:30:42Z

@aluzzardi: Unit tests seemed to hang in CI.

make test works locally :/ I'll try make ci

UPDATE: Found the problem. It only happens with GOMAXPROCS=1 and its due to the tests. Basically, they rely that the "agent" connects to the broker before the "client" sends a logs command. In real life that should never happen (the disconnected agent shouldn't have containers matching the selector anyway) but will be enhanced in a follow up anyway once we implement subscription filtering in the manager.

In the meantime, I had to add a sleep - I tried different other approaches but none had a 100% success rate.

aaronlehmann · 2016-11-01T00:44:05Z

manager/manager.go

@@ -712,6 +729,7 @@ func (m *Manager) becomeLeader(ctx context.Context) {
 // becomeFollower shuts down the subsystems that are only run by the leader.
 func (m *Manager) becomeFollower() {
 	m.dispatcher.Stop()
+	m.logbroker.Stop()


It looks like calling m.logbroker.Stop multiple times will panic. Since both becomeFollower and (*Manager).Stop can call m.logbroker.Stop, either logbroker's Stop function should protect against this (as I believe most Stop functions do), or the manager should set m.logbroker to nil after calling Stop and check before each use.

aaronlehmann

Mostly LGTM

Here are the remaining few things that I think should be addressed:

If worker creates a broadcaster, it needs to have a well-defined lifecycle so the broadcaster's resources can be freed. There are some real-world cases in Docker where multiple workers would be created without a restart in between; for example after encryption-at-rest lands each attempt to unlock the manager will create a worker.
https://github.com/docker/swarmkit/pull/1679/files#r85861157 needs to be addressed
I'd like https://github.com/docker/swarmkit/pull/1679/files#r85228089 to be considered before merging.

aluzzardi · 2016-11-02T00:20:27Z

@aaronlehmann Thanks! Agreed on all counts, will make the changes

aaronlehmann · 2016-11-02T00:26:00Z

Oh, interesting, CI found a race. SubscribeLogs uses the queue created in Run without locking. Stop should probably wait for all outstanding RPCs to terminate before returning, and RPCs should error out if the log broker is not currently running. I think I mentioned this before but maybe the comment got lost.

aluzzardi · 2016-11-02T00:35:52Z

@aaronlehmann Yeah, right, I was looking into that now.

I'm seeing a similar pattern in dispatcher as well (check if running on every RPC).

I'll adopt the same pattern but I'm wondering if that's repeated work. Shouldn't the manager basically take care of 1) stop routing traffic to the service 2) call stop and do the opposite on promotion? That way our services could be "dumber" (in terms of init/deinit logic)

Building on that, we could actually have a Servicer (bad name, I know) interface with Run & Stop and the manager could simply keep a list of services and perform the Run + Mount on startup/promotion and Umount + Stop on demotion. Then adding a new service such as the broker would just be a matter of appending the object to the list.

aluzzardi · 2016-11-02T00:37:06Z

Regarding the above - I'm also wondering if the broker really needs a Stop function. Wouldn't it be simpler if the demotion code just stopped the actual server altogether, cancelling all active connections?

aaronlehmann · 2016-11-02T00:39:55Z

How do you stop the server without killing the whole GRPC listener?

aaronlehmann · 2016-11-02T00:42:39Z

I'll adopt the same pattern but I'm wondering if that's repeated work. Shouldn't the manager basically take care of 1) stop routing traffic to the service 2) call stop and do the opposite on promotion? That way our services could be "dumber" (in terms of init/deinit logic)

Maybe, but it's not easy. (1) probably involves some codegenned wrappers, unless there's some way I don't know of to "stop traffic" for a service (there is no Unregister function).

aluzzardi · 2016-11-03T02:58:56Z

Just pushed an update addressing the remaining comments (I only ran make test so other things may fail, I'll fix that tomorrow).

There's a bunch of different changes but the most relevant are:

If worker creates a broadcaster, it needs to have a well-defined lifecycle so the broadcaster's resources can be freed. There are some real-world cases in Docker where multiple workers would be created without a restart in between; for example after encryption-at-rest lands each attempt to unlock the manager will create a worker.

@aaronlehmann Added stop safeguards in 9a03850

https://github.com/docker/swarmkit/pull/1679/files#r85861157 needs to be addressed

This is about the agent leaking resources (worker event.Broadcaster) because we don't cleanly shut down the worker.

I've changed the worker interface to add a Close function (75a4aa5) and also moved the manager/state/watch into its own package (ab301df).

@stevvooe Can you verify if the worker interface and agent integration look alright?
@aaronlehmann Could you double check if the leaks are gone?

I'd like https://github.com/docker/swarmkit/pull/1679/files#r85228089 to be considered before merging.

That's about logs being unary RPCs rather than streams.

I've implemented that in 49e48be

@stevvooe @aaronlehmann Are you happy with how it looks?
@LK4D4 I had to change the codegen wrapper to support client streaming. Turns out, the interface for both Client and Client&Server is the same. Could you check if it looks alright?

aaronlehmann · 2016-11-03T03:04:19Z

manager/logbroker/broker_test.go

@@ -228,7 +235,13 @@ func testLogBrokerEnv(t *testing.T) (context.Context, *LogBroker, *ca.SecurityCo
 	}
 	brokerClient := api.NewLogBrokerClient(brokerCc)

+	go func() {
+		broker.Run(ctx)
+	}()


nit: Don't need a closure.

aaronlehmann · 2016-11-03T03:09:32Z

manager/logbroker/broker.go

+	}))
+}
+
+func (lb *LogBroker) publish(log *api.PublishLogsRequest) {


I think PublishLogs needs to use this.

aaronlehmann · 2016-11-03T03:11:48Z

manager/logbroker/broker.go

 	lb.logQueue.Close()
 	lb.subscriptionQueue.Close()
+
+	return nil


I think it's good practice for Stop to block until all current RPCs are finished. I don't see anything especially dangerous in the RPCs like interactions with Raft, but we may add something in the future. Also, right now it's theoretically possible for an RPC started before a leader reelection to interact with the new queue created by Run, which feels wrong.

I followed the same pattern as the dispatcher and keymanager - neither of them has a blocking Stop (and they do interact with Raft).

Do you want me to make stop blocking or keeping it consistent with the rest?

aaronlehmann · 2016-11-03T03:15:20Z

api/logbroker.proto

 	LogSelector selector = 1;

 	LogSubscriptionOptions options = 2;
 }

 message SubscribeLogsMessage {
-	repeated LogMessage messages = 1 [(gogoproto.nullable) = false];
+	LogMessage message = 1 [(gogoproto.nullable) = false];


It still seems useful for this to be a repeated field so we can send down multiple messages at a time if we later decide to. That will probably cut down some framing overhead.

aaronlehmann · 2016-11-03T03:16:00Z

api/logbroker.proto

-	// the contents of this message.
-	bool close = 3;
+	// Messages is the log message for publishing.
+	LogMessage message = 2 [(gogoproto.nullable) = false];


Same here, I think there's a valid use case for sending many log messages at once even though this is a stream.

aluzzardi · 2016-11-03T18:57:23Z

@aaronlehmann Addressed your comments except for blocking Stop (waiting your response on whether we should do it or not)

aaronlehmann · 2016-11-03T19:00:27Z

Sorry for the misinformation about blocking Stop. I was sure that dispatcher did that way. Turns out it was changed here: 6ec2d00

Let's keep Stop non-blocking for now, to be consistent with dispatcher and ca.

stevvooe · 2016-11-03T19:01:40Z

agent/exec/container/adapter.go

+	apiOptions := types.ContainerLogsOptions{
+		Follow: options.Follow,
+
+		// TODO(stevvooe): Parse timestamp out of message. This


This comment is no longer valid. We parse below. :)

Sorry for not removing.

This change adds support for log management in the manager, agent and CLI. The log broker is currently naive and broadcasts subscriptions to all agents, which in turn need to perform filtering to figure out if the subscription is relevant to them. In the future, the broker should be smarter and dispatch subscriptions only to concerned agents. The basic logging functionality works. Fixes moby#1332 Signed-off-by: Andrea Luzzardi <[email protected]>

The broker now keeps track of all active subscriptions. When a new agent joins, it will receive all actives before receiving new subscriptions. Signed-off-by: Andrea Luzzardi <[email protected]>

Signed-off-by: Andrea Luzzardi <[email protected]>

aaronlehmann

LGTM

GordonTheTurtle added the status/0-triage label Oct 22, 2016

aaronlehmann reviewed Oct 22, 2016

View reviewed changes

aluzzardi force-pushed the logbroker branch from d0dcf1b to e64875b Compare October 22, 2016 01:38

aaronlehmann reviewed Oct 22, 2016

View reviewed changes

aluzzardi mentioned this pull request Oct 22, 2016

api: rudimentary support for docker service logs #1553

Closed

aaronlehmann reviewed Oct 22, 2016

View reviewed changes

aluzzardi force-pushed the logbroker branch 2 times, most recently from 860c496 to 23a5c94 Compare October 22, 2016 01:44

aluzzardi changed the title ~~[WIP] Log Management~~ Log Management Oct 22, 2016

aluzzardi force-pushed the logbroker branch 3 times, most recently from f0a77e1 to 78e55a4 Compare October 22, 2016 02:10

stevvooe reviewed Oct 24, 2016

View reviewed changes

aluzzardi force-pushed the logbroker branch 2 times, most recently from e5ef9ff to 1c75bba Compare October 24, 2016 19:16

aluzzardi force-pushed the logbroker branch 4 times, most recently from 51da25f to ad6d3b1 Compare October 24, 2016 23:54

aaronlehmann reviewed Nov 1, 2016

View reviewed changes

aaronlehmann requested changes Nov 1, 2016

View reviewed changes

aluzzardi force-pushed the logbroker branch from 737350f to ab301df Compare November 3, 2016 01:34

aaronlehmann reviewed Nov 3, 2016

View reviewed changes

aluzzardi force-pushed the logbroker branch from 49e48be to 2db6490 Compare November 3, 2016 18:56

stevvooe reviewed Nov 3, 2016

View reviewed changes

aluzzardi added 7 commits November 3, 2016 13:45

broker: Keep track of active subscriptions.

42f0bb9

The broker now keeps track of all active subscriptions. When a new agent joins, it will receive all actives before receiving new subscriptions. Signed-off-by: Andrea Luzzardi <[email protected]>

broker: Correctly set up grpc leader proxy

d84b47a

Signed-off-by: Andrea Luzzardi <[email protected]>

broker: start/stop on raft promotion/demotion

f611dbb

Signed-off-by: Andrea Luzzardi <[email protected]>

Move package manager/state/watch into watch

3654ccf

Signed-off-by: Andrea Luzzardi <[email protected]>

agent: Release worker resources on shutdown.

4c435f0

Signed-off-by: Andrea Luzzardi <[email protected]>

logs: Switch logs publishing to a stream.

3058ba9

Signed-off-by: Andrea Luzzardi <[email protected]>

aluzzardi force-pushed the logbroker branch from 2db6490 to 3058ba9 Compare November 3, 2016 20:45

aaronlehmann approved these changes Nov 3, 2016

View reviewed changes

aluzzardi merged commit 2eaae1a into moby:master Nov 4, 2016

aluzzardi deleted the logbroker branch November 4, 2016 19:02

Log Management #1679

Log Management #1679

Conversation

aluzzardi commented Oct 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aluzzardi Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

aluzzardi Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Oct 24, 2016 • edited Loading

Current coverage is 55.32% (diff: 36.05%)

aluzzardi commented Oct 24, 2016

aaronlehmann commented Oct 24, 2016

aaronlehmann commented Oct 25, 2016

aluzzardi commented Oct 25, 2016 • edited Loading

aluzzardi commented Oct 25, 2016 • edited Loading

Choose a reason for hiding this comment

aaronlehmann left a comment • edited Loading

Choose a reason for hiding this comment

aluzzardi commented Nov 2, 2016

aaronlehmann commented Nov 2, 2016

aluzzardi commented Nov 2, 2016

aluzzardi commented Nov 2, 2016

aaronlehmann commented Nov 2, 2016

aaronlehmann commented Nov 2, 2016

aluzzardi commented Nov 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aluzzardi commented Nov 3, 2016

aaronlehmann commented Nov 3, 2016

Choose a reason for hiding this comment

aaronlehmann left a comment

Choose a reason for hiding this comment

aluzzardi commented Oct 22, 2016 •

edited

Loading

aluzzardi Oct 24, 2016 •

edited

Loading

aluzzardi Oct 24, 2016 •

edited

Loading

codecov-io commented Oct 24, 2016 •

edited

Loading

aluzzardi commented Oct 25, 2016 •

edited

Loading

aluzzardi commented Oct 25, 2016 •

edited

Loading

aaronlehmann left a comment •

edited

Loading