#install.packages("pacman")
library(pacman)
p_load("devtools","SuperLearner","ranger","CAST","caret","skimr","gbm","xgboost","hexbin")
Please install the 'deeper' R package:
You can install the developing version of deeper from github with:
library(devtools)
install_github("Alven8816/deeper")
Our newly build R package mainly include 3 steps:
Using predictModel() or predictModel_parallel() to establish the base models. A tuningModel() function can be used to tuning the parameters to get the best single base model.
We use stack_ensemble(),stack_ensemble.fit(), or stack_ensemble_parallel() function to stack the meta models to get a DEML model.
After establishment of DEML model, the predict() can be used predict the unseen data set.
To assess the performance of the models, assess.plot() can be used by comparing the original observations (y in test set) and the prediction. The assess.plot() also return the point scatter plot.
The other functions including CV.predictModel() (referred from SuperLearner), CV.predictModel_parallel(),and CV.stack_ensemble_parallel() could be used to conduct the external cross validation to assess the base models and DEML models for each fold.
Note: DEML could not directly deal with missing values and missing value imputation technologies is recommended prior to the use of the DEML model.
This is a basic example which shows you how to use deeper:
library(deeper)
## obtain the example data
data("envir_example")
#knitr::kable(envir_example[1:6,])
summary(envir_example)
80% data was used as training and the reminding as testing.
set.seed(1234)
size <-
caret::createDataPartition(y = envir_example$PM2.5, p = 0.8, list = FALSE)
trainset <- envir_example[size, ]
testset <- envir_example[-size, ]
y <- c("PM2.5")
x <- colnames(envir_example[-c(1, 6)]) # except "date" and "PM2.5"
# we need to select the algorithm previously
SuperLearner::listWrappers()
data(model_list) # get more details about the algorithm
In the 'type' column, "R": can be used for regression or classification;"N": can be used for regression but variables requre to be numeric style; "C": just used in classification.
#method 1: using deeper package function
ranger <-
tuningModel(
basemodel = 'SL.ranger',
params = list(num.trees = 100),
tune = list(mtry = c(1, 3, 7))
)
# training the model with separate new data set
pred_m2 <-
predictModel(
Y = trainset[, y],
X = trainset[, x],
base_model = c("SL.xgboost", ranger),
cvControl = list(V = 5)
)
# predict the new dataset
pred_m2_new <- deeper::predict(object = pred_m2, newX = testset[, x])
# the base model prediction
head(pred_m2_new$pre_base$library.predict)
The results show the weight, R2, and the root-mean-square error (RMSE) of each model. "ranger_1","ranger_2","ranger_3" note the Random Forest model with parameters mtry = 1,3,7 separately.
The results show that mtry = 3 is the best RF model for this prediction.This method could be used to tune the suitable parameters for algorithms.
## conduct the spatial CV
# Create a list with 7 (folds) elements (each element contains index of rows to be considered on each fold)
indices <-
CAST::CreateSpacetimeFolds(trainset, spacevar = "code", k = 7)
# Rows of validation set on each fold
v_raw <- indices$indexOut
names(v_raw) <- seq(1:7)
pred_m3 <- predictModel_parallel(
Y = trainset[, y],
X = trainset[, x],
base_model = c("SL.xgboost", ranger),
cvControl = list(V = length(v_raw), validRows = v_raw),
#number_cores = 4,
seed = 1
)
## when number_cores is missing, it will indicate user to set one based on the operation system.
# pred_m3 <- predictModel_parallel(
# Y = trainset[,y],
# X = trainset[,x],
# base_model = c("SL.xgboost",ranger),
# cvControl = list(V = length(v_raw), validRows = v_raw),
# seed = 1
# )
#You have 8 cpu cores, How many cpu core you want to use:
# type the number to continue the process.
# prediction
pred_m3_new <- deeper::predict(object = pred_m3, newX = testset[, x])
head(pred_m3_new$pre_base$library.predict)
# test dataset performance
test_p1 <-
assess.plot(pre = pred_m3_new$pre_base$pred, obs = testset[, y])
print(test_p1$plot)
Our DEML model includes several meta-models based on the trained base models' results. Different from other ensemble (stacking method), DEML model allow to select several meta-models and run them parallelly and ensemble all the meta-models with the optimal weights.(This step is optinal for analysis).
#Include original feature
pred_stack_10 <-
stack_ensemble_parallel(
object = pred_m3,
Y = trainset[, y],
meta_model = c("SL.ranger", "SL.xgboost", "SL.glmnet"),
original_feature = TRUE,
X = trainset[, x],
number_cores = 4
)
pred_stack_10_new <-
deeper::predict(object = pred_stack_10, newX = testset[, x])
caret::R2(pred = pred_stack_10_new$pre_meta$pred,obs = testset$PM2.5)
caret::RMSE(pred = pred_stack_10_new$pre_meta$pred,obs = testset$PM2.5)
test_p2 <-
assess.plot(pre = pred_stack_10_new$pre_meta$pred, obs = testset[, y])
print(test_p2$plot)
Wenhua Yu, Shanshan Li, Tingting Ye,Rongbin Xu, Jiangning Song, Yuming Guo (2022) Deep ensemble machine learning framework for the estimation of PM2.5 concentrations,Environmental health perspectives: https://doi.org/10.1289/EHP9752