Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called without an active exception #3500

Closed
yuchsiao opened this issue Jul 24, 2018 · 6 comments
Closed

terminate called without an active exception #3500

yuchsiao opened this issue Jul 24, 2018 · 6 comments

Comments

@yuchsiao
Copy link

yuchsiao commented Jul 24, 2018

version: xgboost:0.72 compiled with NCCL.

Saw this error message 'deterministically' with some hyperparameters for "gpu_hist". But the same set of hyperparameters works for "hist" (cpu).

terminate called without an active exception
Aborted

The training does not start even the first iteration. No other error messages printed either. The compete messages are just as above.

It seems related to the gpu-hist implementation. This was observed for single gpu training.

Any idea?

@RAMitchell
Copy link
Member

Do you have a script for reproduction?

@yuchsiao
Copy link
Author

The usage is basic:

dtrain = xgboost.DMatrix(data=train_features, label=train_labels, weight=train_weights)
xgb_params = {
    'tree_method': 'gpu_hist',
    'gpu_id': 0,
    'n_estimators': 700,
    'learning_rate': 0.035,
    'max_depth': 16,
    'min_split_loss': 10,
    'min_child_weight': 100,
    'colsample_bytree': 0.9,
    'reg_lambda':5,
    'objective': 'binary:logistic'
}
bst = xgboost.train(xgb_params, dtrain)

Unfortunately, I am not able to share the data that encounter the problem as they are proprietary. But I can mention a few characteristics of them:

  • Labels are soft labels, meaning between 0 and 1, two classes.
  • Some features are floating points, some are binary, and some are one-hot encoded. All of them may contain np.nan as missing values.
  • The data size can be small, at the order of 10k lines and tens of features.
  • A subset of the data may be okay. Interestingly, it is possible all subsets of a partition of the 10k lines can be individually okay (successful training). But the whole file of the 10k lines encounter the issue.

The crash behavior is deterministic, meaning that the same dataset and hyperparameters can generate the same crash at the same training step (always at step 0).

@RAMitchell
Copy link
Member

Thanks. Are you using the version from pip or building from source? There were a few bug fixes recently that would be only in the source version.

Would it be possible to take a minimal sample of data that reproduces the issue and then anonymise it? You could remove column labels and add noise to features.

I can't see how to resolve it without having a reproducible example.

@yuchsiao
Copy link
Author

That makes sense. Let me look into what will be the viable path to share some fragment of data. The version is built from source just two weeks ago with the flag NCCL.

If you know the exact dates of those fixes applied, I can check whether my version includes them or not.

Thank you for looking into the issue! I will try to get back to you shortly.

@RAMitchell
Copy link
Member

I made a notable fix in #3472

@yuchsiao
Copy link
Author

yuchsiao commented Jul 28, 2018

Hi @RAMitchell ,

Report back here: the bug fix #3472 does not help resolve this issue.

I figure out a sample snippet that can reproduce the crash

import os
import numpy as np
import scipy as sp
import xgboost

N = 10000
d = 50
thres = 0.5
density = 0.10

features = sp.sparse.random(N, d, density=density)
labels = np.random.rand(N)
labels[labels>thres] = 1
labels[labels<=thres] = 0

dtrain = xgboost.DMatrix(data=features, label=labels)

xgb_params = {
    'tree_method': 'gpu_hist',
#    'n_gpus': 2,
    'n_estimators': 700,
    'learning_rate': 0.035,
    'max_depth': 16,
    'min_split_loss': 10,
    'min_child_weight': 100,
    'colsample_bytree': 0.9,
    'reg_lambda':5
}

bst = xgboost.train(xgb_params, dtrain)

The overall messages are as follows

[01:27:55] Allocated 0MB on [0] Tesla P100-PCIE-16GB, 15965MB remaining.
[01:27:55] Allocated 1MB on [0] Tesla P100-PCIE-16GB, 15963MB remaining.
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  /Users/bpottangi/xgboost/src/tree/../common/device_helpers.cuh(523): out of memory
Aborted

Note that this happens only when the sparse matrix density is high enough, say, as in the above snippet, 0.1. When it is as low as 0.01, the code runs fine. It behaves the same when using multiple GPUs.

Please let me know if this is enough for investigation.

Thanks a lot again for looking into the problem!

@lock lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants