博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
With our powers combined! xgboost and pipelearner
阅读量:5904 次
发布时间:2019-06-19

本文共 8032 字,大约阅读时间需要 26 分钟。

 here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

Setup

To follow this post you’ll need the following packages:

# Install (if necessary)install.packages(c("xgboost", "tidyverse", "devtools"))devtools::install_github("drsimonj/pipelearner") # Attach library(tidyverse) library(xgboost) library(pipelearner) library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the . Set up as follows:

data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'd <- read_csv(  data_url,  col_names = c('id', 'thinkness', 'size_uniformity',                'shape_uniformity', 'adhesion', 'epith_size', 'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>% select(-id) %>% # Remove id; not useful here filter(nuclei != '?') %>% # Remove records with missing data mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign mutate_all(as.numeric) # All to numeric; needed for XGBoost d #> # A tibble: 683 × 10 #> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei #> 
#> 1 5 1 1 1 2 1 #> 2 5 4 4 5 7 10 #> 3 3 1 1 1 2 2 #> 4 6 8 8 1 3 4 #> 5 4 1 1 3 2 1 #> 6 8 10 10 8 7 10 #> 7 1 1 1 1 2 10 #> 8 2 1 2 1 2 1 #> 9 2 1 1 1 2 1 #> 10 4 2 1 1 2 1 #> # ... with 673 more rows, and 4 more variables: chromatin
, #> # nucleoli
, mitoses
, cancer

pipelearner

pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out . For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: . As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

pipelearner(d, rpart::rpart, cancer ~ .,            minsplit = c(2, 4, 6, 8, 10), maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula

xgboost

Here’s an xgboost model:

# Prep data (X) and labels (y)X <- select(d, -cancer) %>% as.matrix()y <- d$cancer# Fit the modelfit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")#> [1] train-rmse:0.372184 #> [2] train-rmse:0.288560 #> [3] train-rmse:0.230171 #> [4] train-rmse:0.188965 #> [5] train-rmse:0.158858 # Examine accuracy predicted <- as.numeric(predict(fit, X) >= .5) mean(predicted == y) #> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse data and formula

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

pl_xgboost <- function(data, formula, ...) {  data <- as.data.frame(data)  X_names <- as.character(f_rhs(formula))  y_name  <- as.character(f_lhs(formula))  if (X_names == '.') {    X_names <- names(data)[names(data) != y_name]  }  X <- data.matrix(data[, X_names])  y <- data[[y_name]]  xgboost(data = X, label = y, ...) }

Let’s try it out:

pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")#> [1]  train-rmse:0.372184 #> [2]  train-rmse:0.288560 #> [3] train-rmse:0.230171 #> [4] train-rmse:0.188965 #> [5] train-rmse:0.158858 # Examine accuracy pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5) mean(pl_predicted == y) #> [1] 0.9838946

Perfect!

Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

pl <- pipelearner(d, pl_xgboost, cancer ~ .,                  nrounds = c(5, 10, 25),                  eta = c(.1, .3), max_depth = c(4, 6)) fits <- pl %>% learn() #> [1] train-rmse:0.453832 #> [2] train-rmse:0.412548 #> ... fits #> # A tibble: 12 × 9 #> models.id cv_pairs.id train_p fit target model #> 
#> 1 1 1 1
cancer pl_xgboost #> 2 10 1 1
cancer pl_xgboost #> 3 11 1 1
cancer pl_xgboost #> 4 12 1 1
cancer pl_xgboost #> 5 2 1 1
cancer pl_xgboost #> 6 3 1 1
cancer pl_xgboost #> 7 4 1 1
cancer pl_xgboost #> 8 5 1 1
cancer pl_xgboost #> 9 6 1 1
cancer pl_xgboost #> 10 7 1 1
cancer pl_xgboost #> 11 8 1 1
cancer pl_xgboost #> 12 9 1 1
cancer pl_xgboost #> # ... with 3 more variables: params
, train
, test

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

accuracy <- function(fit, data, target_var) {  # Convert resample object to data frame  data <- as.data.frame(data)  # Get feature matrix and labels  X <- data %>%    select(-matches(target_var)) %>%     as.matrix()  y <- data[[target_var]]  # Obtain predicted class  y_hat <- as.numeric(predict(fit, X) > .5) # Return accuracy mean(y_hat == y) } results <- fits %>% mutate( # hyperparameters nrounds = map_dbl(params, "nrounds"), eta = map_dbl(params, "eta"), max_depth = map_dbl(params, "max_depth"), # Accuracy accuracy_train = pmap_dbl(list(fit, train, target), accuracy), accuracy_test = pmap_dbl(list(fit, test, target), accuracy) ) %>% # Select columns and order rows select(nrounds, eta, max_depth, contains("accuracy")) %>% arrange(desc(accuracy_test), desc(accuracy_train)) results #> # A tibble: 12 × 5 #> nrounds eta max_depth accuracy_train accuracy_test #> 
#> 1 25 0.3 6 1.0000000 0.9489051 #> 2 25 0.3 4 1.0000000 0.9489051 #> 3 10 0.3 6 0.9981685 0.9489051 #> 4 5 0.3 6 0.9945055 0.9489051 #> 5 10 0.1 6 0.9945055 0.9489051 #> 6 25 0.1 6 0.9945055 0.9489051 #> 7 5 0.1 6 0.9926740 0.9489051 #> 8 25 0.1 4 0.9890110 0.9489051 #> 9 10 0.3 4 0.9871795 0.9489051 #> 10 5 0.3 4 0.9853480 0.9489051 #> 11 10 0.1 4 0.9853480 0.9416058 #> 12 5 0.1 4 0.9835165 0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%   learn_cvpairs(n = 100) %>%   learn() %>%   mutate(    test_accuracy  = pmap_dbl(list(fit, test,  target), accuracy)  )#> [1]  train-rmse:0.357471 #> [2]  train-rmse:0.256735 #> ... results %>% ggplot(aes(test_accuracy)) + geom_histogram(bins = 30) + scale_x_continuous(labels = scales::percent) + theme_minimal() + labs(x = "Accuracy", y = "Number of samples", title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow  on Twitter, or email me at to get in touch.

If you’d like the code that produced this blog, check out the .

 

转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

转载于:https://www.cnblogs.com/payton/p/6375101.html

你可能感兴趣的文章
BZOJ3160 万径人踪灭
查看>>
Okhttp3请求网络开启Gzip压缩
查看>>
pycharm配置mysql数据库连接访问
查看>>
Spring源码学习:第0步--环境准备
查看>>
烂泥:rsync与inotify集成实现数据实时同步更新
查看>>
call & apply
查看>>
学习英语哦
查看>>
第六届蓝桥杯java b组第四题
查看>>
通过TortoiseGIT怎么把本地项目上传到GitHub
查看>>
Python 1 Day
查看>>
Python基础学习笔记(十:二进制位运算)
查看>>
C语言中字符串结束符
查看>>
技术工作者上升到思想,哲学层面也许更好
查看>>
LCD12864使用总结
查看>>
wireshark简明教程
查看>>
EditPlus配置Java编译器
查看>>
app已损坏,打不开。你应该将它移到废纸篓
查看>>
Switchover and Failover说明
查看>>
linux 环境RPM 安装MYSQL5.6
查看>>
Linux文件管理和编辑常用命令
查看>>