Think Globally, Fit Locally (Saul and Roweis 2003)
Modeling spectral data has garnered wide interest in the last four decades. Spectroscopy is the study of the spectral response of a matrix (e.g. soil, plant material, seeds, etc.) when it interacts with electromagnetic radiation. This spectral response directly or indirectly relates to a wide range of compositional characteristics (chemical, physical or biological) of the matrix. Therefore, it is possible to develop empirical models that can accurately quantify properties of different matrices. In this respect, quantitative spectroscopy techniques are usually fast, non-destructive and cost-efficient in comparison to conventional laboratory methods used in the analyses of such matrices. This has resulted in the development of comprehensive spectral databases for several agricultural products comprising large amounts of observations. The size of such databases increases de facto their complexity. To analyze large and complex spectral data, one must then resort to numerical and statistical tools and methods such as dimensionality reduction, and local spectroscopic modeling based on spectral dissimilarity concepts.
The aim of the resemble
package is to provide tools to efficiently and
accurately extract meaningful quantitative information from large and complex
spectral databases. The core functionalities of the package include:
Simply type and you will get the info you need:
citation(package = "resemble")
##
## To cite resemble in publications use:
##
## Ramirez-Lopez, L., and Stevens, A., and Viscarra Rossel, R., and
## Shen, Z., and Wadoux, A., and Breure, T. (2024). resemble: Regression
## and similarity evaluation for memory-based learning in spectral
## chemometrics. R package Vignette R package version 2.2.3.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{resemble-package,
## title = {resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics. },
## author = {Leonardo Ramirez-Lopez and Antoine Stevens and Claudio Orellano and Raphael {Viscarra Rossel} and Zefang Shen and Alex Wadoux and Timo Breure},
## publication = {R package Vignette},
## year = {2024},
## note = {R package version 2.2.3},
## url = {https://CRAN.R-project.org/package=resemble},
## }
This vignette uses the soil Near-Infrared (NIR) spectral dataset provided in the
package prospectr
package (Stevens and Ramirez-Lopez 2024). The reason why we use this dataset is because
soils are one of the most complex matrices analyzed with NIR spectroscopy. This
spectral dataset/library was used in the challenge by
Pierna and Dardenne (2008). The library contains NIR absorbance spectra of dried and sieved
825 soil observations/samples. These samples originate from agricultural fields
collected from all over the Walloon region in Belgium. The data are in an R
data.frame
object which is organized as follows:
Response variables:
Nt (Total Nitrogen in g/kg of dry soil): a numerical variable (values are available for 645 samples and missing for 180 samples).
Ciso (Carbon in g/100 g of dry soil): a numerical variable (values are available for 732 and missing for 93 samples).
CEC (Cation Exchange Capacity in meq/100 g of dry soil): A numerical variable (values are available for 447 and missing for 378 samples).
Predictor variables: the predictor variables are in a matrix embedded in
the data frame, which can be accessed via NIRsoil$spc
. These variables
contain the NIR absorbance spectra of the samples recorded between the
1100 nm and 2498 nm of the electromagnetic spectrum at 2 nm interval. Each
column name in the matrix of spectra represents a specific wavelength (in nm).
Set: a binary variable that indicates whether the samples belong to the training subset (represented by 1, 618 samples) or to the test subset (represented by 0, 207 samples).
Load the necessary packages and data:
The dataset can be loaded into R as follows:
This step aims at improving the signal quality of the spectra for quantitative
analysis. In this respect, the following standard methods are applied using the
package prospectr
(Stevens and Ramirez-Lopez 2024):
# obtain a numeric vector of the wavelengths at which spectra is recorded
wavs <- NIRsoil$spc %>% colnames() %>% as.numeric()
# pre-process the spectra:
# - resample it to a resolution of 6 nm
# - use first order derivative
new_res <- 5
poly_order <- 1
window <- 5
diff_order <- 1
NIRsoil$spc_p <- NIRsoil$spc %>%
resample(wav = wavs, new.wav = seq(min(wavs), max(wavs), by = new_res)) %>%
savitzkyGolay(p = poly_order, w = window, m = diff_order)