The **mfp** package is a collection of R (R Core Team 2022) functions targeted at the use
of fractional polynomials (FP) for modelling the influence of continuous
covariates on the outcome in regression models, as introduced by Royston and Altman (1994) and modified by Sauerbrei and Royston (1999). The model may
include binary, categorical or further continuous covariates which are
included in the variable selection process but without need of FP
transformation. It combines backward elimination with a systematic
search for a ‘suitable’ transformation to represent the influence of
each continuous covariate on the outcome. An application of
multivariable fractional polynomials (MFP) in modelling prognostic and
diagnostic factors in breast cancer is given by Sauerbrei and Royston (1999). The stability of
the models selected is investigated in Royston
and Sauerbrei (2003). Briefly, fractional polynomials models are
useful when one wishes to preserve the continuous nature of the
covariates in a regression model, but suspects that some or all of the
relationships may be non-linear. At each step of a `backfitting’
algorithm MFP constructs a fractional polynomial transformation for each
continuous covariate while fixing the current functional forms of the
other covariates. The algorithm terminates when no more covariate is
excluded and the functional forms of the continuous covariates do not
change anymore.

**mfp.object** is an object representing a fitted
**mfp** model. Class **mfp** inherits from
either glm or coxph depending on the type of model fitted. In addition
to the standard glm/coxph components the following components are
included in an **mfp** object:

**x**: the final FP transformations that are contained in the design matrix x. The covariate “z” with 4 df (second-degree FP) has corresponding columns “z.1” and “z.2” in x. A first-degree FP covariate “z” would have one column “z.1”.**powers**: a matrix containing the best FP powers for each covariate. If a covariate has less than two powers NAs will fill the appropriate cell of the matrix.**pvalues**: a matrix containing the P-values from the ‘’closed test procedure’’ together with the best powers chosen. Briefly p.null is the P-value for the test of inclusion (see mfp), p.lin corresponds to the test of nonlinearity and p.FP the test of simplification by comparing first degree (FP1) and second degree (FP2) transformations. The best first degree FP power (power2) and best second degree FP powers (power4.1 and power4.2) are also given. The numbers 2 and 4 describe the corresponding degrees of freedom.**scale**: all covariates are shifted and rescaled before being power transformed if nonpositive values are encountered or the range of values of the covariates is reasonably large. If x’ would be used instead of x where x’ = (x+a)/b the parameters a (shift) and b (scale) are contained in the matrix scale.**df.initial**: a vector containing the degrees of freedom allocated to each covariate corresponding to the degree of FP (m=4 for second degree FP, m=2 for first degree FP).**df.final**: a vector containing the degrees of freedom of each covariate at convergence of the backfitting algorithm (m=4 for second degree FP, m=2 for first degree FP, m=1 for untransformed variable, m=0 if covariate was excluded).**dev**: the deviance of the final model.**dev.lin**: the deviance of the model that uses the linear predictor of untransformed covariates.**dev.null**: the deviance of the null model.**fptable**: the table of the final fp transformations.**fit**: a call of the corresponding glm or cox model using the selected and (possibly) FP transformed variables of the final model.

An **mfp.object** will be created by application of
function **mfp**.

A typical call of an mfp model has the form **response \(\sim\) terms** where
**response** is the (numeric) response vector and
**terms** is a series of terms, separated by \(+\) operators, which specifies a linear
predictor for **response** provided by the
**formula** argument of the function call.

```
library(mfp)
#> Loading required package: survival
str(mfp)
#> function (formula, data, family = gaussian, method = c("efron", "breslow"),
#> subset = NULL, na.action = na.omit, init = NULL, alpha = 0.05, select = 1,
#> maxits = 20, keep = NULL, rescale = FALSE, verbose = FALSE, x = TRUE,
#> y = TRUE)
```

Fractional polynomial terms are indicated by **fp**.

For **binomial** models the response can also be
specified as a **factor**. If a Cox proportional hazards
model is required then the outcome need to be specified using the
**Surv()** notation.

The argument **family** describes the error distribution
and link function to be used in the model. This can be a character
string naming a family function, a family function or the result of a
call to a family function.

Argument **alpha** sets the global FP selection level
for all covariates. Different selection levels for individual covariates
can be chosen by using the **fp** function. The variable
selection level for all covariates is set by **select**.
Values for individual fractional polynomials may be set using the
**fp** function.

The function **fp** defines a fractional polynomial
object for a single input variable.

In addition to **alpha** and **select** the
**scale** argument of the **fp** function
denotes the use of pre-transformation scaling to avoid possible
numerical problems or for variables with non-positive values.

The original Stata implementation of **mfp** uses two
different selection procedures for a single continuous covariate \(x\), a sequential selection procedure and a
closed testing selection procedure (**ra2**, Ambler and Royston (2001)). In the R
implementation only the **ra2** algorithm is used which is
also the default in the Stata and SAS implementations of
**mfp**.

The **ra2** algorithm is described in Ambler and Royston (2001) and Sauerbrei and Royston (2002). It uses a closed
test procedure Marcus, Peritz, and Gabriel
(1976) which maintains approximately the correct Type I error
rate for each component test. The procedure allows the complexity of
candidate models to increase progressively from a prespecified minimum
(a null model) to a prespecified maximum (an FP) according to an ordered
sequence of test results.

The algorithm works as follows:

Perform a 4 df test at the \(\alpha\) level of the best-fitting second-degree FP against the null model. If the test is not significant, drop \(x\) and stop, otherwise continue.

Perform a 3 df test at the \(\alpha\) level of the best-fitting second-degree FP against a straight line. If the test is not significant, stop (the final model is a straight line), otherwise continue.

Perform a 2 df test at the \(\alpha\) level of the best-fitting second-degree FP against the best-fitting first-degree FP. If the test is significant, the final model is the FP with \(m=2\), otherwise the FP with \(m=1\).

The tests in step 1, 2 and 3 are of overall association, non-linearity and between a simpler or more complex FP model, respectively.

We use the dataset **GBSG** which contains data from a
study of the German Breast Cancer Study Group for patients with
node-positive breast cancer.

```
data(GBSG)
str(GBSG)
#> 'data.frame': 686 obs. of 11 variables:
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ htreat : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 1 1 ...
#> $ age : int 70 56 58 59 73 32 59 65 80 66 ...
#> $ menostat: Factor w/ 2 levels "1","2": 2 2 2 2 2 1 2 2 2 2 ...
#> $ tumsize : int 21 12 35 17 35 57 8 16 39 18 ...
#> $ tumgrad : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 3 2 2 2 2 ...
#> $ posnodal: int 3 7 9 4 1 24 2 1 30 7 ...
#> $ prm : int 48 61 52 60 26 0 181 192 0 0 ...
#> $ esm : int 66 77 271 29 65 13 0 25 59 3 ...
#> $ rfst : int 1814 2018 712 1807 772 448 2172 2161 471 2014 ...
#> $ cens : int 1 1 1 1 1 1 0 0 1 0 ...
```

The response variable is recurrence free survival time
(**Surv(rfst, cens)**). Complete data on 7 prognostic
factors is available for 686 patients. The median follow-up was about 5
years, 299 events were observed for recurrence free survival time. We
use a Cox proportional hazards regression to model the hazard of
recurrence by the 7 prognostic factors of which 5 are continuous, age of
the patients in years (**age**), tumor size in mm
(**tumsize**), number of positive lymphnodes
(**posnodal**), progesterone receptor in fmol
(**prm**), estrogen receptor in fmol
(**esm**); one is binary, menopausal status
(**menostat**); and one is ordered categorical with three
levels, tumor grade (**tumgrad**). The additional variable
**htreat** describes if a hormonal therapy was applied and
is used as stratification variable.

We use **mfp** to build a model from the initial set of
7 covariates using the backfitting model selection algorithm. We set the
global variable selection level to 0.05 and use the default FP selection
level.

By using **fp()** in the model formula a fractional
polynomial transformation with possibly pre-transformation scaling is
estimated. This is done here for **tumsize**,
**posnodal**, **prm**, and
**esm**. Otherwise a linear form of the unscaled variable
is used, as for **age**. Categorical factors are included
without transformation. Hormonal therapy (**htreat**) was
used as stratification variable.

By **verbose=TRUE** the process of FP and variable
selection is printed.

```
f <- mfp(Surv(rfst, cens) ~ strata(htreat)+age+fp(tumsize)+fp(posnodal)+fp(prm)+fp(esm)
+menostat+tumgrad, family = cox, data = GBSG, select=0.05, verbose=TRUE)
#>
#> Variable Deviance Power(s)
#> ------------------------------------------------
#> Cycle 1
#> posnodal
#> 3135.218
#> 3103.245 1
#> 3081.123 0
#> 3074.213 0.5 3
#>
#> prm
#> 3095.43
#> 3074.213 1
#> 3067.746 0.5
#> 3066.502 -2 0.5
#>
#> tumgrad2
#> 3081.253
#> 3074.213 1
#>
#>
#>
#> tumgrad3
#> 3082.613
#> 3074.213 1
#>
#>
#>
#> tumsize
#> 3075.813
#> 3074.213 1
#> 3072.091 -1
#> 3071.882 -1 3
#>
#> menostat2
#> 3076.922
#> 3075.813 1
#>
#>
#>
#> age
#> 3076.922
#> 3076.922 1
#>
#>
#>
#> esm
#> 3077.795
#> 3076.922 1
#> 3073.627 3
#> 3071.028 -0.5 3
#>
#> Cycle 2
#> posnodal
#> 3152.737
#> 3108.965 1
#> 3085.051 0
#> 3077.795 0.5 3
#>
#> prm
#> 3099.562
#> 3077.795 1
#> 3071.74 0.5
#> 3070.548 0 0.5
#>
#> tumgrad2
#> 3085.024
#> 3077.795 1
#>
#>
#>
#> tumgrad3
#> 3086.686
#> 3077.795 1
#>
#>
#>
#> tumsize
#> 3077.795
#> 3076.471 1
#> 3074.077 -1
#> 3073.759 -0.5 0
#>
#> menostat2
#> 3077.795
#> 3076.973 1
#>
#>
#>
#> age
#> 3077.795
#> 3077.737 1
#>
#>
#>
#>
#> Tansformation
#> shift scale
#> posnodal 0 10
#> prm 1 100
#> tumgrad2 0 1
#> tumgrad3 0 1
#> tumsize 0 10
#> menostat2 0 1
#> age 0 1
#> esm 1 100
#>
#> Fractional polynomials
#> df.initial select alpha df.final power1 power2
#> posnodal 4 0.05 0.05 4 0.5 3
#> prm 4 0.05 0.05 1 1 .
#> tumgrad2 1 0.05 0.05 1 1 .
#> tumgrad3 1 0.05 0.05 1 1 .
#> tumsize 4 0.05 0.05 0 . .
#> menostat2 1 0.05 0.05 0 . .
#> age 1 0.05 0.05 0 . .
#> esm 4 0.05 0.05 0 . .
#>
#>
#> Transformations of covariates:
#> formula
#> age <NA>
#> tumsize <NA>
#> posnodal I((posnodal/10)^0.5)+I((posnodal/10)^3)
#> prm I(((prm+1)/100)^1)
#> esm <NA>
#> menostat <NA>
#> tumgrad tumgrad
#>
#>
#> Deviance table:
#> Resid. Dev
#> Null model 3198.026
#> Linear model 3103.245
#> Final model 3077.795
```

After two cycles the final model is selected. Of the possible input
variables tumor size (**tumsize**), menopausal status
(**menostat**), age and estrogen receptor
(**esm**) were excluded from the model. Only for variable
**posnodal** a nonlinear transformation was chosen.
Prescaling was used for esm, prm and tumsize.

Details of the model fit are given by **summary**.

```
summary(f)
#> Call:
#> coxph(formula = Surv(rfst, cens) ~ strata(htreat) + I((posnodal/10)^0.5) +
#> I((posnodal/10)^3) + I(((prm + 1)/100)^1) + tumgrad, data = GBSG)
#>
#> n= 686, number of events= 299
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> I((posnodal/10)^0.5) 1.79086 5.99461 0.21339 8.392 < 2e-16 ***
#> I((posnodal/10)^3) -0.03251 0.96801 0.01334 -2.438 0.01478 *
#> I(((prm + 1)/100)^1) -0.21321 0.80799 0.05381 -3.963 7.41e-05 ***
#> tumgrad2 0.61624 1.85195 0.24877 2.477 0.01324 *
#> tumgrad3 0.74897 2.11482 0.26798 2.795 0.00519 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> I((posnodal/10)^0.5) 5.995 0.1668 3.9457 9.1075
#> I((posnodal/10)^3) 0.968 1.0330 0.9430 0.9936
#> I(((prm + 1)/100)^1) 0.808 1.2376 0.7271 0.8978
#> tumgrad2 1.852 0.5400 1.1373 3.0157
#> tumgrad3 2.115 0.4729 1.2507 3.5758
#>
#> Concordance= 0.679 (se = 0.016 )
#> Likelihood ratio test= 120.2 on 5 df, p=<2e-16
#> Wald test = 116.4 on 5 df, p=<2e-16
#> Score (logrank) test = 122.9 on 5 df, p=<2e-16
```

Details of the FP transformations are given in the
**fptable** value of the resulting
**mfp.object**.

```
f$fptable
#> df.initial select alpha df.final power1 power2
#> posnodal 4 0.05 0.05 4 0.5 3
#> prm 4 0.05 0.05 1 1 .
#> tumgrad2 1 0.05 0.05 1 1 .
#> tumgrad3 1 0.05 0.05 1 1 .
#> tumsize 4 0.05 0.05 0 . .
#> menostat2 1 0.05 0.05 0 . .
#> age 1 0.05 0.05 0 . .
#> esm 4 0.05 0.05 0 . .
```

The final model uses a second degree fractional polynomial for the
number of positive lymphnodes with powers 0.5 and 3. The estimated
functional from can be visualized using
**predict.mfp**.

```
vizmfp <- predict(f, type = "terms", terms = "posnodal", seq = list(1:50), ref = list(5))
plot(vizmfp$posnodal$variable, exp(vizmfp$posnodal$contrast), type = "n", log = "y",
xlab = "posnodal", ylab = "Hazard Ratio", ylim = c(0.1, 5))
polygon(x = c(vizmfp$posnodal$variable, rev(vizmfp$posnodal$variable)),
y = exp(c(vizmfp$posnodal$contrast - 1.96 * vizmfp$posnodal$stderr,
rev(vizmfp$posnodal$contrast + 1.96 * vizmfp$posnodal$stderr))),
col = "grey", border = NA)
grid()
lines(vizmfp$posnodal$variable, exp(vizmfp$posnodal$contrast), type = "l", col = 4, lwd = 2)
```

The value **fit** of the resulting mfp object can be
used for survival curve estimation of the final model fit.

Ambler, G., and P. Royston. 2001. “Fractional Polynomial Model
Selection Procedures: Investigation of Type I Error
Rate.” *Journal of Statistical Simulation and Computation*
69: 89–108.

Marcus, R., E. Peritz, and K. Gabriel. 1976. “On Closed Test
Procedures with Special Reference to Ordered Analysis of
Variance.” *Biometrika* 76: 655–60.

R Core Team. 2022. *R: A Language and Environment for Statistical
Computing*. Vienna, Austria: R Foundation for Statistical Computing.
https://www.R-project.org/.

Royston, P., and D. G. Altman. 1994. “Regression Using Fractional
Polynomials of Continuous Covariates: Parsimonious Parametric Modelling
(with Discussion).” *Applied Statistics* 43 (3): 429–67.

Royston, P., and W. Sauerbrei. 2003. “Stability of Multivariable
Fractional Polynomial Models with Selection of Variables and
Transformations: A Bootstrap Investigation.” *Statistics in
Medicine* 22: 639–59.

Sauerbrei, W., and P. Royston. 1999. “Building Multivariable
Prognostic and Diagnostic Models: Transformation of the Predictors by
Using Fractional Polynomials.” *Journal of the Royal
Statistical Society (Series A)* 162: 71–94.

———. 2002. “Corrigendum: Building
Multivariable Prognostic and Diagnostic Models: Transformation of the
Predictors by Using Fractional Polynomials.” *Journal of the
Royal Statistical Society (Series A)* 165: 399–400.