Vault Crashes OOM #4803

rshsmi · 2018-06-20T12:36:21Z

Describe the bug
When a Vault node in cluster of 6 with consul back-end becomes leader, the CPU Load and Memory increases to the point is crashes because of OOM.

To Reproduce
Steps to reproduce the behavior:
Vault node becomes leader and is crashing with:
fatal error: runtime: out of memory

Expected behavior
A clear and concise description of what you expected to happen.

Environment:

Vault Server Version (retrieve with vault status):
Key Value

Seal Type shamir
Sealed false
Total Shares 5
Threshold 3
Version 0.10.1
Cluster Name vault-cluster-
Cluster ID
HA Enabled true
HA Cluster https://
HA Mode active

Server Operating System/Architecture:

Docker version 18.03.1-ce, build 9ee9f40
4.14.48-coreos-r2

Vault server configuration file(s):

{"storage": {"consul": {"address": ":8500", "path": "vault", "token": "${consul_token}" }}, "listener": { "tcp": { "address": "0.0.0.0:443", "tls_cert_file": "/vault/certs/tls.crt", "tls_key_file": "/vault/certs/tls.key" }}, "disable_cache": "true", "disable_mlock": "true", "ui": "true"}

Additional context
Consul back-end : Consul 1.1.0
Crash log: vault_crash.log
Prometheus Monitoring Graph:

The text was updated successfully, but these errors were encountered:

jefferai · 2018-06-20T13:34:09Z

It sounds like you have a huge number of leases, so when Vault is loading them when a node becomes active, you OOM. Probably you need to increase memory to get past the immediate issue, and then figure out why you have so many leases. (Running trace mode on logs will have the lease loading print out a total and running load count.)

rshsmi · 2018-06-20T14:44:37Z

@jefferai - Thanks for your response, the total number of leases we have is around 100 - since I created the ticket we've upgraded to 0.10.2 and changed the log level to TRACE, since the upgrade the only logs we see is the standard rollback messages.

jefferai · 2018-06-20T15:54:52Z

You could try upgrading to 0.10.3, although I doubt there's anything relevant (but it's a good idea anyways).

Depending on what the issue is, based on your log, 0.10.2 may in fact help. We've fixed issues in 0.10.1 and 0.10.2 related to revocation logic.

Can you get total counts of items in Consul under sys/expire/id and sys/token/id?

rshsmi · 2018-06-20T16:13:28Z

Thanks for you suggestions, Infact I was just reviewing the changes in 0.10.3
121 under sys/expire/id
59 under sys/token/id

jefferai · 2018-06-20T16:26:56Z

Please let us know if the problem persists under 0.10.2/0.10.3.

rshsmi · 2018-06-20T16:55:50Z

The problem persist on 0.10.2

jefferai · 2018-06-20T17:01:35Z

Can you get another crash log from 0.10.3? Since the logic has changed significantly it would be good to know the current stack trace.

Also, is there a reason you are running with disable_cache set to true? Does this issue go away if you remove that config option?

rshsmi · 2018-06-20T17:30:10Z

I'm waiting for 0.10.3 docker image to be available. disable_cache is set to false (default value)
[DEBUG] expiration: leases collected: num_existing=117
The memory issue is still persist.

jefferai · 2018-06-20T17:42:07Z

{"storage": {"consul": {"address": ":8500", "path": "vault", "token": "${consul_token}" }}, "listener": { "tcp": { "address": "0.0.0.0:443", "tls_cert_file": "/vault/certs/tls.crt", "tls_key_file": "/vault/certs/tls.key" }}, "disable_cache": "true", "disable_mlock": "true", "ui": "true"}

It's set to true...

rshsmi · 2018-06-20T17:46:12Z

Changed it to "false" after your last comment, then restarted all the nodes, the graph in my previous comment is the leader with"disable_cache": "false"

rshsmi · 2018-06-21T12:33:03Z

Vault 0.10.2 crash log:
vault-0-10-2.log.txt

briankassouf · 2018-06-21T17:20:00Z

Could you also post the full vault server logs with any sensitive values redacted? It seems the crash is happening while trying to revoke a lease. It would be helpful to see any lease revocation errors from the logs.

rshsmi · 2018-06-22T10:34:49Z

@briankassouf - I've been through logs on all nodes, here is what I found:
[ERROR] secrets.system.system_62531041: error retrieving lease: lease_id=/auth/ldap/login/XXXXXX error="failed to read lease entry: decryption failed: cipher: message authentication failed"
if I try to revoke it using CLI:
vault lease revoke /auth/ldap/login/XXXXXX
it is being revoked:
[INFO ] expiration: revoked lease: lease_id=auth/ldap/login/

briankassouf · 2018-06-22T16:58:56Z

@ars05 Did anything change with your underlying datastore? Vault being unable to decrypt the data can mean the data changed out from under it. Perhaps from a bad consul restore.

That being said it shouldn't cause a OOM. The full logs would be helpful to determine what's being loaded into memory and if you got into some kind of loop attempting to revoke this lease.

rshsmi · 2018-06-25T10:26:44Z

@briankassouf - We did consul snapshot of an old cluster then we restored it on the new cluster. Please find the full log here. vault.log

briankassouf · 2018-06-28T07:17:09Z

@ars05 I don't see anything suspicious in the log file, what calls are clients making during the time the memory is growing?

briankassouf · 2018-06-28T07:17:59Z

Also were you ever able to try the 0.10.3 docker container to see if it was fixed in that version?

rshsmi · 2018-07-02T10:26:45Z

@briankassouf - here is crash log and memory usage graph from 0.10.3 - the log level is on trace.
vault-0-10-3.log

briankassouf · 2018-07-24T21:09:42Z

@ars05 Are you still experiencing this OOM and could describe the access patterns and what auth/secret mounts you are using?

rshsmi · 2018-07-31T14:36:48Z

@briankassouf - this issue still persists. When a node becomes Active, the memory usage increases to the point which crashes with fatal error: runtime: out of memory and another node in the cluster becomes active and goes through he same cycle.

In total we have 25 secret engines enabled. 5 AWS, 1 Consul, 5 secrets, 1 SSH, 13 PKI
10-15 SRE Engineers access to Vault (mostly through UI) AD/LDAP integration is enabled.
In term of Authentication Methods, 3 Kubernetes clusters, ldap, approle and token (by default) is enabled/configured.

kalafut · 2018-07-31T20:41:35Z

@ars05 I just pushed a branch off of 0.10.3 that enables some basic memory profiling: https://github.com/hashicorp/vault/tree/profile-0.10.3 . Are you able to build from source?

This version will save memory stats files in $TMPDIR/vaultprof/ every 5 minutes. If you can run with this version through the event and send us a zip of the vaultprof folder, we should be able to get a better idea of what is using so much memory.

rshsmi · 2018-09-04T11:22:43Z

@kalafut - I've built the binary from the profile-0.10.3 branch and deployed it, The configuration is exactly the same as the other nodes and here is the log at startup:

==> Vault server configuration:

             Api Address: https://IP_ADDRESS:443
                     Cgo: disabled
         Cluster Address: https://IP_ADDRESS:444
              Listener 1: tcp (addr: "0.0.0.0:443", cluster address: "0.0.0.0:444", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
                 Storage: consul (HA available)
                 Version: Vault v0.10.4
             Version Sha: 8e427f20e3174039d5a49d7e7ab2566ddacf2ce2+CHANGES

==> Vault server started! Log data will stream in below:

2018-09-04T11:07:40.120Z [WARN ] storage.consul: appending trailing forward slash to path
2018-09-04T11:08:23.834Z [INFO ] core: vault is unsealed
2018-09-04T11:08:23.834Z [INFO ] core: entering standby mode
2018-09-04T11:08:23.834Z [INFO ] core: entering standby mode

The node is unsealed but is not active (leader)yet, and I don't get any memory stat files, is this expected ? does the node needs to be active (leader) to provide memory stat files ?

kalafut · 2018-09-04T18:07:10Z

@ars05 The profiling will be on regardless of active status. The log message is at the DEBUG level, so you'd need to start with the vault binary with -log-level=debug to see a message like this appear in the log stream:

2018-09-04T11:01:22.314-0700 [DEBUG] wrote memory profile: filename=/var/folders/yy/kf718y9x41g40ncf8s29mshh0000gn/T/vaultprof/20180904_180122.pprof

But debug is not required for the profiling to actually occur. Did you check $TMPDIR for a vaultprof folder, and some files therein?

rshsmi · 2018-09-07T08:49:09Z

@kalafut - please find the pprof files attached.
vaultprof.zip

kalafut · 2018-09-12T23:27:29Z

@ars05 Thanks, these were pretty useful. Based on them, I'd like to review the structure of some token data in Consul. Can you please export and provide the list of keys under vault/sys/token/parent, e.g.:

curl http://<consul address>/v1/kv/vault/sys/token/parent\?keys > out.json

This list will only contain hashed UUID paths like "vault/sys/token/parent/10e62d82c8982d247b64c8b17c44c24e7b6d4d5e/6da14c90b428e2f7c8140c44c14d836d0e0af4e6". If you wish you can email the output to me directly: [email protected]

rshsmi · 2018-09-13T05:42:04Z

@kalafut - Thanks, I sent you an email with the list of keys.

Fixes #4803

kalafut · 2018-09-14T15:16:46Z

We were able to create an abnormal key state (a parent/child cycle) that would result in ever growing memory use that is similar to the profiles you sent earlier. The keys you provided had this problem state as well.

A PR (#5335) is up to handle such cycles correctly.

Fixes #4803

rshsmi · 2018-10-11T12:11:26Z

This issue is resolved.

kalafut · 2018-10-11T15:46:10Z

@ars05 Excellent news!

jefferai added this to the 0.10.4 milestone Jun 20, 2018

michaeljs1990 mentioned this issue Jul 1, 2018

Improve Lease Loading #4857

Closed

briankassouf modified the milestones: 0.10.4, 0.10.5 Jul 24, 2018

jefferai modified the milestones: 0.10.5 , 0.11 Aug 13, 2018

chrishoffman added this to the 0.11 milestone Aug 13, 2018

chrishoffman removed this from the 0.11 milestone Aug 25, 2018

chrishoffman added this to the 0.11.1 milestone Aug 25, 2018

jefferai modified the milestones: 0.11.1, 0.11.2 Sep 1, 2018

kalafut pushed a commit that referenced this issue Sep 13, 2018

Detect and bypass cycles during token revocation

77cb849

Fixes #4803

kalafut mentioned this issue Sep 13, 2018

Detect and bypass cycles during token revocation #5335

Merged

kalafut closed this as completed in #5335 Sep 17, 2018

kalafut added a commit that referenced this issue Sep 17, 2018

Detect and bypass cycles during token revocation (#5335)

00314eb

Fixes #4803

kalafut pushed a commit that referenced this issue Sep 19, 2018

Detect and bypass cycles during token revocation

9016c68

Fixes #4803

kalafut mentioned this issue Sep 19, 2018

Detect and bypass cycles during token revocation #5364

Merged

kalafut added a commit that referenced this issue Sep 20, 2018

Detect and bypass cycles during token revocation (#5364)

99ee1c8

Fixes #4803

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault Crashes OOM #4803

Vault Crashes OOM #4803

rshsmi commented Jun 20, 2018 •

edited

Loading

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018 •

edited

Loading

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018 •

edited

Loading

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018 •

edited

Loading

rshsmi commented Jun 21, 2018

briankassouf commented Jun 21, 2018

rshsmi commented Jun 22, 2018

briankassouf commented Jun 22, 2018

rshsmi commented Jun 25, 2018

briankassouf commented Jun 28, 2018

briankassouf commented Jun 28, 2018

rshsmi commented Jul 2, 2018 •

edited

Loading

briankassouf commented Jul 24, 2018

rshsmi commented Jul 31, 2018

kalafut commented Jul 31, 2018

rshsmi commented Sep 4, 2018

kalafut commented Sep 4, 2018 •

edited

Loading

rshsmi commented Sep 7, 2018

kalafut commented Sep 12, 2018

rshsmi commented Sep 13, 2018

kalafut commented Sep 14, 2018

rshsmi commented Oct 11, 2018

kalafut commented Oct 11, 2018

Vault Crashes OOM #4803

Vault Crashes OOM #4803

Comments

rshsmi commented Jun 20, 2018 • edited Loading

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018 • edited Loading

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018 • edited Loading

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018

jefferai commented Jun 20, 2018

rshsmi commented Jun 20, 2018 • edited Loading

rshsmi commented Jun 21, 2018

briankassouf commented Jun 21, 2018

rshsmi commented Jun 22, 2018

briankassouf commented Jun 22, 2018

rshsmi commented Jun 25, 2018

briankassouf commented Jun 28, 2018

briankassouf commented Jun 28, 2018

rshsmi commented Jul 2, 2018 • edited Loading

briankassouf commented Jul 24, 2018

rshsmi commented Jul 31, 2018

kalafut commented Jul 31, 2018

rshsmi commented Sep 4, 2018

kalafut commented Sep 4, 2018 • edited Loading

rshsmi commented Sep 7, 2018

kalafut commented Sep 12, 2018

rshsmi commented Sep 13, 2018

kalafut commented Sep 14, 2018

rshsmi commented Oct 11, 2018

kalafut commented Oct 11, 2018

rshsmi commented Jun 20, 2018 •

edited

Loading

rshsmi commented Jun 20, 2018 •

edited

Loading

rshsmi commented Jun 20, 2018 •

edited

Loading

rshsmi commented Jun 20, 2018 •

edited

Loading

rshsmi commented Jul 2, 2018 •

edited

Loading

kalafut commented Sep 4, 2018 •

edited

Loading