terminate called without an active exception #3500

yuchsiao · 2018-07-24T00:43:22Z

version: xgboost:0.72 compiled with NCCL.

Saw this error message 'deterministically' with some hyperparameters for "gpu_hist". But the same set of hyperparameters works for "hist" (cpu).

terminate called without an active exception
Aborted

The training does not start even the first iteration. No other error messages printed either. The compete messages are just as above.

It seems related to the gpu-hist implementation. This was observed for single gpu training.

Any idea?

The text was updated successfully, but these errors were encountered:

RAMitchell · 2018-07-24T00:56:25Z

Do you have a script for reproduction?

yuchsiao · 2018-07-25T20:47:15Z

The usage is basic:

dtrain = xgboost.DMatrix(data=train_features, label=train_labels, weight=train_weights)
xgb_params = {
    'tree_method': 'gpu_hist',
    'gpu_id': 0,
    'n_estimators': 700,
    'learning_rate': 0.035,
    'max_depth': 16,
    'min_split_loss': 10,
    'min_child_weight': 100,
    'colsample_bytree': 0.9,
    'reg_lambda':5,
    'objective': 'binary:logistic'
}
bst = xgboost.train(xgb_params, dtrain)

Unfortunately, I am not able to share the data that encounter the problem as they are proprietary. But I can mention a few characteristics of them:

Labels are soft labels, meaning between 0 and 1, two classes.
Some features are floating points, some are binary, and some are one-hot encoded. All of them may contain np.nan as missing values.
The data size can be small, at the order of 10k lines and tens of features.
A subset of the data may be okay. Interestingly, it is possible all subsets of a partition of the 10k lines can be individually okay (successful training). But the whole file of the 10k lines encounter the issue.

The crash behavior is deterministic, meaning that the same dataset and hyperparameters can generate the same crash at the same training step (always at step 0).

RAMitchell · 2018-07-26T00:13:53Z

Thanks. Are you using the version from pip or building from source? There were a few bug fixes recently that would be only in the source version.

Would it be possible to take a minimal sample of data that reproduces the issue and then anonymise it? You could remove column labels and add noise to features.

I can't see how to resolve it without having a reproducible example.

yuchsiao · 2018-07-26T00:16:43Z

That makes sense. Let me look into what will be the viable path to share some fragment of data. The version is built from source just two weeks ago with the flag NCCL.

If you know the exact dates of those fixes applied, I can check whether my version includes them or not.

Thank you for looking into the issue! I will try to get back to you shortly.

RAMitchell · 2018-07-26T02:27:24Z

I made a notable fix in #3472

yuchsiao · 2018-07-28T01:29:28Z

Hi @RAMitchell ,

Report back here: the bug fix #3472 does not help resolve this issue.

I figure out a sample snippet that can reproduce the crash

import os
import numpy as np
import scipy as sp
import xgboost

N = 10000
d = 50
thres = 0.5
density = 0.10

features = sp.sparse.random(N, d, density=density)
labels = np.random.rand(N)
labels[labels>thres] = 1
labels[labels<=thres] = 0

dtrain = xgboost.DMatrix(data=features, label=labels)

xgb_params = {
    'tree_method': 'gpu_hist',
#    'n_gpus': 2,
    'n_estimators': 700,
    'learning_rate': 0.035,
    'max_depth': 16,
    'min_split_loss': 10,
    'min_child_weight': 100,
    'colsample_bytree': 0.9,
    'reg_lambda':5
}

bst = xgboost.train(xgb_params, dtrain)

The overall messages are as follows

[01:27:55] Allocated 0MB on [0] Tesla P100-PCIE-16GB, 15965MB remaining.
[01:27:55] Allocated 1MB on [0] Tesla P100-PCIE-16GB, 15963MB remaining.
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  /Users/bpottangi/xgboost/src/tree/../common/device_helpers.cuh(523): out of memory
Aborted

Note that this happens only when the sparse matrix density is high enough, say, as in the above snippet, 0.1. When it is as low as 0.01, the code runs fine. It behaves the same when using multiple GPUs.

Please let me know if this is enough for investigation.

Thanks a lot again for looking into the problem!

RAMitchell mentioned this issue Jul 28, 2018

Dynamically allocate GPU histogram memory #3519

Merged

RAMitchell closed this as completed in #3519 Jul 28, 2018

lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminate called without an active exception #3500

terminate called without an active exception #3500

yuchsiao commented Jul 24, 2018 •

edited

Loading

RAMitchell commented Jul 24, 2018

yuchsiao commented Jul 25, 2018

RAMitchell commented Jul 26, 2018

yuchsiao commented Jul 26, 2018

RAMitchell commented Jul 26, 2018

yuchsiao commented Jul 28, 2018 •

edited

Loading

terminate called without an active exception #3500

terminate called without an active exception #3500

Comments

yuchsiao commented Jul 24, 2018 • edited Loading

RAMitchell commented Jul 24, 2018

yuchsiao commented Jul 25, 2018

RAMitchell commented Jul 26, 2018

yuchsiao commented Jul 26, 2018

RAMitchell commented Jul 26, 2018

yuchsiao commented Jul 28, 2018 • edited Loading

yuchsiao commented Jul 24, 2018 •

edited

Loading

yuchsiao commented Jul 28, 2018 •

edited

Loading