Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm state incorrect after docker daemon crashed in swarm manager #26223

Closed
ypjin opened this issue Sep 1, 2016 · 6 comments
Closed

swarm state incorrect after docker daemon crashed in swarm manager #26223

ypjin opened this issue Sep 1, 2016 · 6 comments
Assignees
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12
Milestone

Comments

@ypjin
Copy link

ypjin commented Sep 1, 2016

This issue was found in the scenario described in #26193.

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        
 OS/Arch:      linux/amd64

Output of docker info:
in swarm manager

Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.12.1
Storage Driver: devicemapper
 Pool Name: docker-202:2-33554824-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 560.1 MB
 Data Space Total: 107.4 GB
 Data Space Available: 8.609 GB
 Metadata Space Used: 1.348 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.146 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host null overlay
Swarm: active
 NodeID: f2agzdvh57921758o7titm1d3
 Is Manager: true
 ClusterID: 4bv4z07ozbqm6n0539a0pqzuc
 Managers: 3
 Nodes: 7
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.145.54.218
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.518 GiB
Name: ip-10-145-54-218.us-west-1.compute.internal
ID: EXGE:OE7Y:ITP3:LP4R:TQ7P:5XMB:7DHQ:6HCN:FTYU:OS56:7RRX:6OQX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 arrow_cloud_db=true
Insecure Registries:
 127.0.0.0/8

Output of docker node ls:
in swarm manager

ID                           HOSTNAME                                     STATUS  AVAILABILITY  MANAGER STATUS
6ktrpinx4f0cevx15e77n3s93    ip-10-226-3-190.us-west-1.compute.internal   Ready   Active        
78mg1bojlxjij2vcr1fvb8p5f    ip-10-226-29-23.us-west-1.compute.internal   Ready   Active        
7toslo9rc7dvysrvme3ceud96    ip-10-226-23-157.us-west-1.compute.internal  Ready   Active        Reachable
9fpxplu1hzxgandwfgk1u6gpk    ip-10-229-15-178.us-west-1.compute.internal  Ready   Active        
bh5ynqugx8w5qwr7m45c9q9cc    ip-10-229-34-235.us-west-1.compute.internal  Ready   Active        
dthkgyjyij2bpgc0dk4rpz07g    ip-10-145-33-221.us-west-1.compute.internal  Ready   Active        Reachable
f2agzdvh57921758o7titm1d3 *  ip-10-145-54-218.us-west-1.compute.internal  Ready   Active        Leader

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS, Red Hat Enterprise Linux 7.2 (HVM), SSD Volume Type - ami-d1315fb1

Steps to reproduce the issue:
I have a swarm (swarm mode) with three managers, one master and two slaves.

  1. deploy a service using service API. The options object looks like.
  var createOpts = {
        "Name": serviceName,
        "TaskTemplate": {
            "ContainerSpec": {
                "Image": imageRepoName,
                "Command": ["/usr/local/bin/start"],
                "Args": [""],
                "Mounts": [
                    {
                        "ReadOnly": false,
                        "Target": "/ctlog",
                        "Type": "volume"
                    }
                ],
                "Env": ["PORT=80"],
                "Labels": {
                    "version" : version
                }
            },
            "LogDriver": {
                "Name": "json-file",
                "Options": {
                    "max-file": "3",
                    "max-size": "10M"
                }
            },
            "Placement": {
                "Constraints": ["engine.labels.something == true"]
            },
            "Resources": {
                "Limits": {
                    "MemoryBytes": containerSize.RAM * 1048576
                },
                "Reservations": {
                    "MemoryBytes": containerSize.RAM * 1048576
                }
            },
            "RestartPolicy": {
                "Condition": "on-failure",
                "Delay": 1,
                "MaxAttempts": 3
            }
        },
        "Mode": {
            "Replicated": {
                "Replicas": num_servers_required
            }
        },
        "UpdateConfig": {
            "Delay": 10,
            "Parallelism": 2,
            "FailureAction": "pause"
        },
        "EndpointSpec": {
            "Ports": [
                {
                    "Protocol": "tcp",
                    //"PublishedPort": 8080,
                    "TargetPort": 80
                }
            ]
        }
    };
  1. update the service using service API. The options object looks like
        var updateOpts = {
            "Name": serviceName,
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": imageRepoName,
                    "Command": ["/usr/local/bin/start"],
                    "Args": [""],
                    "Env": ["PORT=80"]
                },
                "Placement": {
                    "Constraints": ["engine.labels.something == true"]
                },
                "Resources": {
                    "Limits": {
                        "MemoryBytes": containerSize.RAM * 1048576
                    },
                    "Reservations": {
                        "MemoryBytes": containerSize.RAM * 1048576
                    }
                }
            },
            "Mode": {
                "Replicated": {
                    "Replicas": num_servers_required
                }
            },
            "version": data.Version.Index
        };

Describe the results you received:
docker daemon crashed on the swarm manager node.

Aug 31 04:24:39 localhost docker: panic: runtime error: invalid memory address or nil pointer dereference
Aug 31 04:24:39 localhost docker: [signal 0xb code=0x1 addr=0x10 pc=0x14cf17c]
Aug 31 04:24:39 localhost docker: goroutine 777 [running]:
Aug 31 04:24:39 localhost docker: panic(0x1a7edc0, 0xc82000e070)
Aug 31 04:24:39 localhost docker: /usr/local/go/src/runtime/panic.go:481 +0x3e6

Aug 31 04:24:39 localhost docker: github.com/docker/swarmkit/manager/allocator/networkallocator.(*portAllocator).isPortsAllocated(0xc82001fd18, 0xc8209169c0, 0xc821762c00)
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/networkallocator/portallocator.go:164 +0x4c
Aug 31 04:24:39 localhost docker: github.com/docker/swarmkit/manager/allocator/networkallocator.(*NetworkAllocator).IsServiceAllocated(0xc8208a75f0, 0xc8209169c0, 0xc820dcaf60)
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/networkallocator/networkallocator.go:309 +0xeb
Aug 31 04:24:39 localhost docker: github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkAlloc(0xc8208a6ab0, 0x7f576bb1dad8, 0xc820f9d800, 0x1b39bc0, 0xc820dcaf60)
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/network.go:318 +0x369
Aug 31 04:24:39 localhost docker: github.com/docker/swarmkit/manager/allocator.(*Allocator).(github.com/docker/swarmkit/manager/allocator.doNetworkAlloc)-fm(0x7f576bb1dad8, 0xc820f9d800, 0x1b39bc0, 0xc820dcaf60)
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/allocator.go:117 +0x48
Aug 31 04:24:39 localhost docker: github.com/docker/swarmkit/manager/allocator.(*Allocator).run(0xc8208a6ab0, 0x7f576bb1dad8, 0xc820f9d800, 0xc8217bcc60, 0xc8208a7380, 0x1d62578, 0x7, 0xc820e14270, 0xc820e14260)
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/allocator.go:175 +0x118
Aug 31 04:24:39 localhost docker: github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2.1(0xc820e14230, 0xc8208a6ab0, 0xc820e14210, 0xc8217bcc60, 0xc8208a7380, 0x1d62578, 0x7, 0xc820e14270, 0xc820e14260)
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0xa9
Aug 31 04:24:39 localhost docker: created by github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2
Aug 31 04:24:39 localhost docker: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/manager/allocator/allocator.go:143 +0x13c
Aug 31 04:24:39 localhost systemd: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 31 04:24:39 localhost systemd: Unit docker.service entered failed state.
Aug 31 04:24:39 localhost systemd: docker.service failed.

The problem seems caused by the options object for service update API missing the following:

        "EndpointSpec": {
            "Ports": [
                {
                    "Protocol": "tcp",
                    "TargetPort": 80
                }
            ]
        }

But the problem is: when the manager is down swarm promoted one slave to be master, but the swarm state seems not correct. When I deployed a new service ("docker service create") the tasks of the service hung in New state.

Describe the results you expected:

The swarm should be working as normal.

Additional information you deem important (e.g. issue happens only occasionally):

@icecrime icecrime added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/swarm labels Sep 1, 2016
@aaronlehmann
Copy link
Contributor

Looks like moby/swarmkit#1481 has been opened to fix the same panic.

@aaronlehmann aaronlehmann added this to the 1.12.2 milestone Sep 23, 2016
@xiaods
Copy link
Contributor

xiaods commented Sep 26, 2016

this issue can be closed caused by moby/swarmkit#1481

@thaJeztah
Copy link
Member

Vendor PR for 1.12.2 is here; #26765

@xiaods
Copy link
Contributor

xiaods commented Sep 26, 2016

@thaJeztah thanks for your update. waiting for the #26765 merged.

@lbguilherme
Copy link

#26765 has been merged. This issue can be closed.

@thaJeztah
Copy link
Member

ah, yes thanks for the ping. closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12
Projects
None yet
Development

No branches or pull requests

8 participants