*BOSO* is an R package to perform feature selection in linear regression models.

BOSO can be installed from CRAN repository:

`install.packages("BOSO")`

Note that it is necessary to have IBM ILOG CPLEX installed, with a version greater than 12.1, and the *cplexAPI* package, which is used to invoke CPLEX from R environment.

Furthermore, check if packages are installed. Check if bestsubset is installed. If not, source necessary functions from bestsubset.

```
if (!requireNamespace('cplexAPI', quietly = TRUE)) {
testthat::skip('Package cplexAPI not installed (required for this vignette)!\n
Install it from CRAN: https://cran.r-project.org/web/packages/cplexAPI/index.html')
stop('Package cplexAPI not installed (required for this vignette)!\n
Install it from CRAN: https://cran.r-project.org/web/packages/cplexAPI/index.html', call. = FALSE)
}
```

```
if (!requireNamespace('bestsubset', quietly = TRUE)) {
devtools::source_url("https://raw.githubusercontent.com/ryantibs/best-subset/master/bestsubset/R/fs.R")
devtools::source_url("https://raw.githubusercontent.com/ryantibs/best-subset/master/bestsubset/R/lasso.R")
devtools::source_url("https://raw.githubusercontent.com/ryantibs/best-subset/master/bestsubset/R/sim.R")
enlist <- function (...)
{
result <- list(...)
if ((nargs() == 1) & is.character(n <- result[[1]])) {
result <- as.list(seq(n))
names(result) <- n
for (i in n) result[[i]] <- get(i)
} else {
n <- sys.call()
n <- as.character(n)[-1]
if (!is.null(n2 <- names(result))) {
which <- n2 != ""
n[which] <- n2[which]
}
names(result) <- n
}
result
}
} else {require("bestsubset")}
```

```
## â„¹ SHA-1 hash of file is "cc12499ddd2cc538f318f3ee14ad4a3911945a27"
## â„¹ SHA-1 hash of file is "e7c4bf157a3c33b0a11179541be100c0037a4fea"
## â„¹ SHA-1 hash of file is "531fcc6ad06024eae37afaa3e15d83e0ea3ece89"
```

The package is designed to run the BOSO algorithm in one single function, which comprises two steps: block strategy and final problem.

The block strategy is a pre-processing step in which the BOSO MILP (Mixed-Integer Liner Programming) is applied to random subproblems of small dimension, aiming to discard variables for the final problem. The usual block size is 10 variables. This process is done iteratively until convergence.

Final problem applies the BOSO MILP with all the remaining variables from the block strategy.

Figure 1 summarizes the *BOSO* algorithm. An example dataset with 7 features is split into training and validation sets. Green boxes represent optimal selected features for a Specific K value. For example, for K=2, the subset of features that minimizes the validation error is {X3, X6}. The problem of selecting the best subset of features of length K is formulated via mixed-integer quadratic programming (MIQP) and solved using IBM ILOG CPLEX This process is repeated for each K value until an information criterion, in this case the extended Bayesian Information Criterion (eBIC), is not further improved. Minimal eBIC is found in this example for K=2. The final model is derived from Ridge regression with only these two selected variables.

Function `BOSO`

has a number of parameters that can be changed:

*IC*is the information criterion to be used. Options are AIC, BIC and eBIC*IC.blocks*is the information criterion to be used in the block strategy step. By default, it is the same as the*IC*, but it can be changed. Additionally, if*IC*is eBIC,*IC.blocks*will be BIC.*nlambda*The number of lambda values in the final problem. Default value 100.*nlambda.blocks*The number of lambda values in the blocks strategy. Default value 10, in order to alleviate the block strategy.*TH_IC*Threshold for the stopping criterion in K selection. Default value 1e-3. We have a worse model if the IC is at least 0.1% higher than the current solution.

On the other hand, there are some parameters that are identical to the ones found in the `glmnet`

function: *lambda.min.ratio*, *lambda*, *intercept*, *standardize* and *dfmax*.

Finally, some of the parameters are solver-specific:

*Threads*CPLEX parameter that defines the number of cores to be used. Default is 0 (automatic).*timeLimit*CPLEX parameter that defines the time limit per problem provided to CPLEX.Default value is 1e75 (infinite time).*warmstart*warmstart for CPLEX or use a different problem for each k. Default is False.

Given a synthetic data set (see Methods section in the article), we evaluate the capacity of BOSO to extract the features that are related with the response variable. First o, we set the parameters for the generation of synthetic data:

```
# Set some overall simulation parameters
n = 100; # Size of training set
nval = n # Size of validation set
p = 10; # Number of variables
s = 5 # Number of nonzero coefficients
beta.type = 1 # Type of coefficiente structure
rho.vec = 0.35 # Variable autocorrelation level
snr.vec = exp(seq(log(0.05),log(6),length=10)) # Signal-to-noise ratios
nrep = 10 # Number of repetitions for a given setting
seed = 0 # Random number generator seed
```

Benchmarked methods are defined: Forward Stepwise, Lasso, Relaxed lasso and BOSO.

```
# Regression functions: lasso, forward stepwise and BOSO
reg.funs = list()
reg.funs[["Forward stepwise"]] = function(x,y) fs(x,y,intercept=FALSE, max=min(n,p))
reg.funs[["Lasso"]] = function(x,y) lasso(x,y,intercept=FALSE,nlam=50)
reg.funs[["Relaxed lasso"]] = function(x,y) lasso(x,y,intercept=FALSE,
nrelax=10,nlam=50)
reg.funs[["BOSO - AIC"]] <- function(x,y,xval,yval) BOSO(x,y,xval,yval,
IC = "AIC",
nlambda = 100,
Threads = 4,
timeLimit = 60,
intercept = F,
standardize = F,
verbose = 0,
warmstart=T,
seed = 123456)
reg.funs[["BOSO - BIC"]] <- function(x,y,xval,yval) BOSO(x,y,xval,yval,
IC = "BIC",
nlambda = 100,
Threads = 4,
timeLimit = 60,
intercept = F,
standardize = F,
verbose = 0,
warmstart=T,
seed = 123456)
reg.funs[["BOSO - eBIC"]] <- function(x,y,xval,yval) BOSO(x,y,xval,yval,
IC = "eBIC",
nlambda = 100,
Threads = 4,
timeLimit = 60,
intercept = F,
standardize = F,
verbose = 0,
warmstart=T,
seed = 123456)
```

We have adapted the `sim.master`

function from the *bestsubset* library function to assess the different methods.

```
sim.results <- list()
for (i in 1:length(snr.vec)) {
set.seed(i)
sim.results[[i]] <- sim.master(n = n, p = p, nval = nval,
rho=rho.vec, s=s, beta.type=beta.type,
snr = snr.vec[[i]],
reg.funs, nrep=nrep, seed=i,
verbose=T)
}
```

Finally, the next part of the code shows the comparison of BOSO-eBIC with some of the most widely used algorithms in the literature: Lasso, Relaxed lasso and Forward Stepwise.

```
method.nums = c(6,1,2,3)
method.names = c("BOSO - eBIC","Forward stepwise","Lasso","Relaxed lasso")
plot.BOSO <- list(
"F" = plot.from.sim(sim.list = sim.results, what="F", rel.to=NULL, tuning="val",
method.nums=method.nums, method.names=method.names),
"nonzeros" = plot.from.sim(sim.list = sim.results, what="nonzero", rel.to=NULL, tuning="val",
method.nums=method.nums, method.names=method.names),
"fpos" = plot.from.sim(sim.list = sim.results, what="fpos", rel.to=NULL, tuning="val",
method.nums=method.nums, method.names=method.names),
"fneg" = plot.from.sim(sim.list = sim.results, what="fneg", rel.to=NULL, tuning="val",
method.nums=method.nums, method.names=method.names))
```

```
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
```

`ggarrange(plotlist = plot.BOSO, ncol = 2, nrow = 2, common.legend = T, legend = "bottom")`