Zygmunt Zawadzki

FSelectorRcpp is an Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support. It is also equipped with a parallel backend.

```
install.packages('FSelectorRcpp') # stable release version on CRAN
::install_github('mi2-warsaw/FSelectorRcpp') # dev version
devtools# windows users should have Rtools for devtools installation
# https://cran.r-project.org/bin/windows/Rtools/
```

In the modern statistical learning the biggest bottlenecks are computation times of model training procedures and the overfitting. Both are caused by the same issue - the high dimension of explanatory variables space. Researchers have encountered problems with too big sets of features used in machine learning algorithms also in terms of model interpretaion. This motivates applying feature selection algorithms before performing statisical modeling, so that on smaller set of attributes the training time will be shorter, the interpretation might be clearer and the noise from non important features can be avoided. More motivation can be found in (John, Kohavi, and Pfleger 1994).

Many methods were developed to reduce the curse of dimensionality
like **Principal Component Analysis** (Pearson 1901) or **Singular Value
Decomposition** (Eckart and Young
1936) which approximates the variables by smaller number of
combinations of original variables, but this approach is hard to
interpret in the final model.

Sophisticated methods of attribute selection as
**Boruta** algoritm (Kursa and
Rudnicki 2010), genetic algorithms Aziz et
al. (2013) or simulated annealing techniques (Khachaturyan, Semenovsovskaya, and Vainshtein
1981) are known and broadly used but in some cases for those
algorithms computations can take even days, not to mention that datasets
are growing every hour.

Few classification and regression models can reduce redundand
variables during the training phase of statistical learning process,
e.g. **Decision Trees** Breiman et
al. (1984), **LASSO Regularized Generalized Linear
Models** (with cross-validation) (Friedman, Hastie, and Tibshirani 2010) or
**Regularized Support Vector Machine** (Xu, Caramanis, and Mannor 2009), but still
computations starting with full set of explanatory variables are time
consuming and the understaning of the feature selection procedure in
this case is not simple and those methods are sometimes used without the
understanding.

In business applications there appear a need to provide a fast
feature selection that is extremely easy to understand. For such demands
easy methods are prefered. This motivates using simple techniques like
**Entropy Based Feature Selection** (Largeron, Moulin, and Géry 2011), where every
feature can be checked independently so that computations can be
performed in a parallel to shorter the procedure’s time. For this
approach we provide an R interface to **Rcpp**
reimplementation (Dirk Eddelbuettel 2011)
of methods included in **FSelector** package which we also
extended with parallel background and sparse matrix support. This has
significant impact on computations time and can be used on greater
datasets, comparing to **FSelector**. Additionally we
avoided the Weka (Hall et al. 2009)
dependency and we provided faster discretization implementations than
those from **entropy** package, used originally in
**FSelector**.

```
library(magrittr)
library(FSelectorRcpp)
```

A simple entropy based feature selection workflow.
**Information gain** is an easy, linear algorithm that
computes the entropy of a dependent and explanatory variables, and the
conditional entropy of a dependent variable with a respect to each
explanatory variable separately. This simple statistic enables to
calculate the belief of the distribution of a dependent variable when we
only know the distribution of a explanatory variable.

```
information_gain( # Calculate the score for each attribute
formula = Species ~ ., # that is on the right side of the formula.
data = iris, # Attributes must exist in the passed data.
type = "infogain" # Choose the type of a score to be calculated.
%>%
) cut_attrs( # Then take attributes with the highest rank.
k = 2 # For example: 2 attrs with the higehst rank.
%>%
) to_formula( # Create a new formula object with
attrs = ., # the most influencial attrs.
class = "Species"
%>%
) glm(
formula = ., # Use that formula in any classification algorithm.
data = iris,
family = "binomial"
)
```

```
Call: glm(formula = ., family = "binomial", data = iris)
Coefficients:
(Intercept) Petal.Width Petal.Length
-69.45 33.89 17.60
Degrees of Freedom: 149 Total (i.e. Null); 147 Residual
Null Deviance: 191
Residual Deviance: 5.17e-09 AIC: 6
```

Apply a complicated feature selection search engine, that checks
combinations of subsets of the specified attributes’ set, to score the
currently considered subset, depending on the criterion passed with the
`fun`

parameter.

```
<- # Create a scorer function.
evaluator_R2_lm function(
# That takes the currently considered subset of attributes
attributes, # from a specified dataset.
data, dependent =
names(data)[1] # To find features that best describe the dependent variable.
) {summary( # In this situation we take the r.squared statistic
lm( # from the summary of a linear model object.
to_formula( # This is the score to use to choose between considered
# subsets of attributes.
attributes,
dependent
),data = data)
$r.squared
)
}
feature_search( # feature_search work in 2 modes - 'exhaustive' and 'greedy'
attributes =
names(iris)[-1], # It takes attribues and creates combinations of it's subsets.
fun = evaluator_R2_lm, # And it calculates the score of a subset that depends on the
data = iris, # evaluator function passed in the `fun` parameter.
mode = "exhaustive", # exhaustive - means to check all possible
sizes = # attributes' subset combinations
1:length(attributes) # of sizes passed in sizes.
$all )
```

```
Sepal.Width Petal.Length Petal.Width Species values
1 1 0 0 0 0.01382265
2 0 1 0 0 0.7599546
3 0 0 1 0 0.6690277
4 0 0 0 1 0.6187057
```

Aziz, Amira Sayed A., Ahmad Taher Azar, Mostafa A. Salama, Aboul Ella
Hassanien, and Sanaa El Ola Hanfy. 2013. “Genetic Algorithms with
Different Feature Selection Techniques for Anomaly Detectors
Generation.” In *Proceedings of the 2013 Federated Conference
on Computer Science and Information Systems*, edited by M. Paprzycki
M. Ganzha L. Maciaszek, pages 769–774. IEEE.

Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. *Classification and Regression Trees*.
Monterey, CA: Wadsworth; Brooks.

Dirk Eddelbuettel, Romain Franccois. 2011. “Rcpp:
Seamless R and C++ Integration.”
*Journal of Statistical Software* 40 (8): 1–18. https://www.jstatsoft.org/v40/i08/.

Eckart, C., and G. Young. 1936. “The Approximation of One Matrix
by Another of Lower Rank.” *Psychometrika* 1 (3): 211–18.
https://doi.org/10.1007/BF02288367.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010.
“Regularization Paths for Generalized Linear Models via Coordinate
Descent.” *Journal of Statistical Software* 33 (1): 1–22.
https://www.jstatsoft.org/v33/i01/.

Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, and Ian H. Witten. 2009. “The WEKA Data Mining
Software: An Update.” *SIGKDD Explor. Newsl.* 11 (1):
10–18. https://doi.org/10.1145/1656274.1656278.

John, George H., Ron Kohavi, and Karl Pfleger. 1994. “Irrelevant
Features and the Subset Selection Problem.” In *MACHINE
LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL*, 121–29. Morgan
Kaufmann.

Khachaturyan, A., S. Semenovsovskaya, and B. Vainshtein. 1981.
“The thermodynamic approach to the structure
analysis of crystals.” *Acta Crystallographica Section
A* 37 (5): 742–54.

Kuhn, Max, and Kjell Johnson. 2013. “Applied Predictive
Modeling.” New York, NY: Springer. 2013. https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/.

Kursa, Miron B., and Witold R. Rudnicki. 2010. “Feature Selection
with the Boruta Package.” *Journal of Statistical
Software* 36 (11): 1–13. https://www.jstatsoft.org/v36/i11/.

Largeron, Christine, Christophe Moulin, and Mathias Géry. 2011.
“Entropy Based Feature Selection for Text Categorization.”
In *Proceedings of the 2011 ACM Symposium on Applied Computing*,
924–28. SAC ’11. New York, NY, USA: ACM. https://doi.org/10.1145/1982185.1982389.

Pearson, Karl. 1901. “LIII. On Lines and Planes of Closest Fit to
Systems of Points in Space.” *Philosophical Magazine Series
6* 2 (11): 559–72.

Rokach, Lior, and Oded Maimon. 2008. *Data Mining with Decision
Trees: Theroy and Applications*. River Edge, NJ, USA: World
Scientific Publishing Co., Inc.

Xu, Huan, Constantine Caramanis, and Shie Mannor. 2009.
“Robustness and Regularization of Support Vector Machines.”
*J. Mach. Learn. Res.* 10 (December): 1485–1510. https://dl.acm.org/doi/10.5555/1577069.1755834.