GCalignR Step by Step

Meinolf Ottensmann, Martin A. Stoffel, Hazel J. Nichols and Joseph I. Hoffman

Introduction

We developed GCalignR primarily as an alignment tool for GC-FID data based on peak retention times, but other types of data that contain retention time information for peaks (e.g. GC-MS) are supported as well. GCalignR implements a fast and objective method to cluster putatively homologous substances prior to multivariate statistical analyses. Using sophisticated visualisations, the resulting alignments can then be fine tuned. The input format of GCalignR is a peak list, which is comprised of retention times and arbitrary variables (e.g. peak height, peak area) that characterise each peak in a dataset.

The implemented algorithm is purely based on retention time data, which is why the quality of the alignment is highly dependent on the quality of the raw data and the parameters used for the initial peak detection. In other words: The clearer the peaks are which were extracted from the chromatograms, the better the alignment will be. GCalignR has been created for situations where the main interest of the research is in exploring broader patterns rather then the specific function of a certain chemical, which is unlikely to be determined correctly in all cases when just retention times are used. Also, we recommend to double-check the resulting alignment with mass-spectrometry data, where available. Furthermore, replicates that were analysed using both GC-MS and GC-FID with identical gas chromatography settings can be aligned together and may be used to identify peaks and validate alignment results.

This vignette gives an quick introduction into using GCalignR, whereas a more detailed description of the background is found in our manuscript(Ottensmann et al. 2018) and in the second vignette of the package “GCalignR: How does the Algorithm work?”.

GCalignR workflow in a larger context

In the flow diagram below, we visualized the functionality of GCalignR within a complete workflow of analysing chemical data. After (1) analysis of the chemical samples with GC-FID, an often proprietary software is used to extract a list of peaks (retention times, peak area, also often peak height and other variables). Steps (3)-(7) are the alignment steps within GCalignR, detailed below. After alignment and normalisation, the output can be used as input for multivariate statistics in other packages such as vegan (8).

Extended Workflow using GCalignR in the analysis of chemical similarity patterns.

Extended Workflow using GCalignR in the analysis of chemical similarity patterns.

Installation

The development version can be downloaded from GitHub with the following code:

install.packages("devtools") 
devtools::install_github("mottensmann/GCalignR", build_vignettes = TRUE) 
library("GCalignR") 

The package documentation can be accessed with:

?GCalignR # documentation

The functions below form the core of GCalignR:

The alignment algorithm

We developed an alignment procedure that involves three sequential steps to align and finally match peaks belonging to putatively homologous substances across samples. The first step is to align each sample to a reference sample while maximising overall similarity through linear shifts of retention times. This procedure is often described in the literature as ‘full alignment’. In the second step, individual peaks are sorted into rows based on close similarity of their retention times, a procedure that is often referred to as ‘partial alignment’. Finally, there is still a chance that homologous peaks can be sorted into different, but adjacent, rows in different samples, depending on the variability of their retention times Consequently, a third step merges rows representing putatively homologous substances.

The alignment algorithm implemented in the align_chromatograms function contains the following steps: (Here we refer to a peak list as all extracted peaks from a given sample chromatogram)

  1. The first step in the alignment procedure consists of an algorithm that corrects systematic linear shifts between peaks of a query sample and a fixed reference to account for systematic shifts in retention times among samples. By default the sample that is most similar on average to all other samples is automatically selected as reference. With respect to the user-defined parameter max_linear_shift linear shifts shifts are applied to all retention times of a sample to maximise similarity to the reference.

  2. The second step in the alignment procedure aligns individual peaks across samples by comparing the peak retention times of each sample consecutively with the mean of all previous samples The parameter max_diff_peak2mean specifies the allowed deviation between the retention time of a peak and the mean of previous retention times within the same row. If the deviation is larger than allowed, matrix operations are conducted to sort the peaks accordingly.

3.The third step in the alignment procedure accounts for the fact that a number of homologous peaks will be sorted into multiple rows that can be subsequently merged The maximum mean difference between two retention time rows can be specified with the min_diff_peak2peak argument.

Optional steps:

  1. Delete peaks that occur in just one sample by setting the delete_single_peak argument to TRUE

  2. Delete all peaks that occur in negative control samples by specifying their names as argument to blanks

Input data

The statistical analysis of GC-FID or GC-MS data is usually based on the detection of peaks (i.e. substances) within chromatograms instead of using the whole profile. Peak can be integrated using proprietary software or free programs. The peak data derived from a chromatogram usually contains the retention time of a given peak plus additional information such as the area under the peak or its height which are used in the subsequent analysis. GCalignR uses only the retention times (and not the mass-spectra, which may not be available, e.g. when using gas-chromatography coupled to a flame ionization detector (FID)) to align the peaks across individuals for subsequent chemometric analysis and pattern detection. The simple assumption is that peaks with similar retention times represent the same substances. However, it is recommended to verify this assumption by comparing also the mass-spectra (if available) of the substances of interest. The input peak list used in GCalignR is a plain text file, whereby all elements should be separated by tabs (with sep = “/t”) or any other separator, which has to be specified with the sep argument (see ?read.table for a list of separators). The decimal separator has to be the point.