`vtreat::prepare`

includes a required argument
`pruneSig`

that (if not NULL) is used to prune variables.
Obviously significance depends on training set size (so is not an
intrinsic property of just the variables) and there are issues of bias
in the estimate (which vtreat attempts to eliminate by estimating
significance of complex sub-model variables on cross-validated or out of
sample data). As always there is a question of what to set a
significance control to.

Our advice is the following pragmatic:

Use variable filtering on wide datasets (datasets with many columns or variables). Most machine learning algorithms can not defend themselves against large numbers of noise variables (including those algorithms that have cross-validation procedures built in). Examples are given here.

As an upper bound think of setting `pruneSig`

below
*1/numberOfColumns*. Setting `pruneSig`

to
*1/numberOfColumns* means that (in expectation) only a constant
number of pure noise variables (variables with no actual relation to the
outcome we are trying to predict) should create columns. This means
(under some assumptions, and in expectation) we expect only a bounded
number of noisy columns to be exposed to downstream statistical and
machine learning algorithms (which they can presumably handle).

As a lower bound think of what sort of good variables get thrown out
at a given setting of `pruneSig`

. For example suppose our
problem is categorization in a data set with *n/2* positive
examples and *n/2* negative examples. Consider the observed
significance of a rare indicator variable that is on *k* times in
training and is only on for positive instances. A random variable that
is on *k* times would achieve this purity with probability \(2^{-k}\), so we expect it to have a
*-log(significance)* in the ballpark of *k*. So a
`pruneSig`

of \(2^{-k}\)
will filter all such variables out (be they good or bad). Thus if you
want levels or indicators that are on only a *z* fraction of the
time on a training set of size *n* you want `pruneSig`

>> \(2^{-z*n}\).

Example:

```
signk <- function(n,k) {
sigTab <- data.frame(y=c(rep(TRUE,n/2),rep(FALSE,n/2)),v=FALSE)
sigTab[seq_len(k),'v'] <- TRUE
vtreat::designTreatmentsC(sigTab,'v','y',TRUE,verbose=FALSE)$scoreFrame[1,'sig']
}
sigTab <- data.frame(k=c(1,2,3,4,5,10,20,50,100))
# If you want to see a rare but perfect indicator of positive class
# that's only on k times out of 1000, this is the lower bound on pruneSig
sigTab$sigEst = vapply(sigTab$k,function(k) signk(1000,k),numeric(1))
sigTab$minusLogSig = -log(sigTab$sigEst) # we expect this to be approximately k
print(sigTab)
```

```
## k sigEst minusLogSig
## 1 1 2.388636e-01 1.431863
## 2 2 9.565153e-02 2.347044
## 3 3 4.119677e-02 3.189395
## 4 4 1.836242e-02 3.997449
## 5 5 8.351092e-03 4.785363
## 6 10 1.863495e-04 8.587887
## 7 20 1.131954e-07 15.994150
## 8 50 2.209988e-17 38.350959
## 9 100 1.952762e-34 77.618649
```

For a data set with 100 variables (and 1000 rows), you might want to
set `pruneSig`

<= 0.01 to limit the number of pure noise
variables that enter the model. Note that this value is smaller than the
lower bounds given above for \(k <
5\). This means that in a data set of this width and length, you
may not be able to detect rare but perfect indicators that occur fewer
than 5 times. You would have a chance of using such rare indicators in a
*catN* or *catB* effects coded variable.

Below we design a data frame with a perfect categorical variable
(completely determines the outcome y) where each level occurs exactly 2
times. The individual levels are insignificant, but we can still extract
a significant *catB* effect coded variable.

```
set.seed(3346)
n <- 1000
k <- 4
d <- data.frame(y=rbinom(n,size=1,prob=0.5)>0)
d$catVarNoise <- rep(paste0('lev',sprintf("%03d",1:floor(n/k))),(k+1))[1:n]
d$catVarPerfect <- paste0(d$catVar,substr(as.character(d$y),1,1))
d <- d[order(d$catVarPerfect),]
head(d)
```

```
## y catVarNoise catVarPerfect
## 1 FALSE lev001 lev001F
## 501 FALSE lev001 lev001F
## 251 TRUE lev001 lev001T
## 751 TRUE lev001 lev001T
## 2 FALSE lev002 lev002F
## 252 FALSE lev002 lev002F
```

```
## [1] "vtreat 1.6.5 inspecting inputs Wed Jun 12 08:51:33 2024"
## [1] "designing treatments Wed Jun 12 08:51:33 2024"
## [1] " have initial level statistics Wed Jun 12 08:51:33 2024"
## [1] " scoring treatments Wed Jun 12 08:51:33 2024"
## [1] "have treatment plan Wed Jun 12 08:51:33 2024"
## [1] "rescoring complex variables Wed Jun 12 08:51:33 2024"
## [1] "done rescoring complex variables Wed Jun 12 08:51:33 2024"
```

```
# Estimate effect significance (not coefficient significance).
estSigGLM <- function(xVar,yVar,numberOfHiddenDegrees=0) {
d <- data.frame(x=xVar,y=yVar,stringsAsFactors = FALSE)
model <- stats::glm(stats::as.formula('y~x'),
data=d,
family=stats::binomial(link='logit'))
delta_deviance <- model$null.deviance - model$deviance
delta_df <- model$df.null - model$df.residual + numberOfHiddenDegrees
pRsq <- 1.0 - model$deviance/model$null.deviance
sig <- stats::pchisq(delta_deviance, delta_df, lower.tail=FALSE)
sig
}
prepD <- vtreat::prepare(treatmentsC,d,pruneSig=c())
```

```
## Warning in prepare.treatmentplan(treatmentsC, d, pruneSig = c()): possibly
## called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit. To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.
```

vtreat produces good variable significances using out of sample simulation (cross frames).

```
## varName rsq sig extraModelDegrees
## 1 catVarNoise_catB 0.001071310 2.231012e-01 249
## 2 catVarPerfect_catP 0.001512584 1.477140e-01 473
## 3 catVarPerfect_catB 0.695865944 1.125597e-211 473
```

For categorical targets we have in the `scoreFrame`

the
`sig`

column is the significance of the single variable
logistic regression using the named variable (plus a constant term), and
the `rsq`

column is the “pseudo-r-squared” or portion of
deviance explained (please see here
for some notes). For numeric targets the `sig`

column is the
significance of the single variable linear regression using the named
variable (plus a constant term), and the `rsq`

column is the
“r-squared” or portion of variance explained (please see here)
for some notes).

Signal carrying complex variables can score as significant, even those composed of rare levels.

```
##
## Call:
## glm(formula = y ~ d$catVarPerfect == "lev001T", family = binomial,
## data = d)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.06014 0.06334 -0.949 0.342
## d$catVarPerfect == "lev001T"TRUE 13.62620 378.59287 0.036 0.971
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1385.5 on 999 degrees of freedom
## Residual deviance: 1382.6 on 998 degrees of freedom
## AIC: 1386.6
##
## Number of Fisher Scoring iterations: 12
```

`## Warning: glm.fit: algorithm did not converge`

`## [1] 2.958641e-303`

```
estSigGLM(prepD$catVarPerfect_catB,prepD$y,
numberOfHiddenDegrees=length(unique(d$catVarPerfect))-1)
```

`## Warning: glm.fit: algorithm did not converge`

`## [1] 3.963376e-90`

Noise variables (those without a relation to outcome) are also scored correctly as long was we account for the degrees of freedom.

```
##
## Call:
## glm(formula = y ~ d$catVarNoise == "lev001", family = binomial,
## data = d)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05624 0.06340 -0.887 0.375
## d$catVarNoise == "lev001"TRUE 0.05624 1.00201 0.056 0.955
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1385.5 on 999 degrees of freedom
## Residual deviance: 1385.5 on 998 degrees of freedom
## AIC: 1389.5
##
## Number of Fisher Scoring iterations: 3
```

`## [1] 1.223667e-63`

`## [1] 0.07074029`