---
title: "Minifying files with RaMS"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Minifying-files-with-RaMS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- This vignette is created from a .Rmd file: please edit that instead. -->

```{r setup, include = FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
options(tidyverse.quiet = TRUE)
data.table::setDTthreads(2)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "80%",
  fig.align = 'center',
  fig.height = 3,
  fig.width = 6.5
)
```

As of version 1.1.0, `RaMS` also has functions that allow irrelevant data to
be removed from the file to reduce file sizes. Like `grabMSdata`, there's one
wrapper function `minifyMSdata` that accepts mzML or mzXML files, plus a vector
of *m/z* values that should either be kept (`mz_include`) or removed 
(`mz_exclude`). The function then opens up the provided MS files and removes
data points in the MS^1^ and MS^2^ spectra that fall outside the accepted bounds.
`mz_include` is useful when only a few masses are of interest, as in targeted
metabolomics. `mz_exclude` is useful when many masses are known to be 
contaminants or interfere with peakpicking/plotting abilities. This minification
can shrink a file over three orders of magnitude, decreasing both
processing time and memory allocation later in the pipeline.

This is also very useful for creating demo MS files - `RaMS` uses these
functions to produce the sample data in `extdata`, with 6 MS files taking up
less than 5 megabytes of disk space. Many other programs provide the ability
to shrink files, but none (known to me) shrink files by excluding *m/z*
values and instead can only remove certain retention times.

Below, we begin with a large MS file containing both MS^1^ and MS^2^ data 
and extract only the data corresponding to
valine/glycine betaine and homarine.

```{r minifyfile, results='hide'}
library(RaMS)
msdata_files <- list.files(
  system.file("extdata", package = "RaMS"), full.names = TRUE, pattern = "mzML"
)[1:4]

initial_filename <- msdata_files[1]
output_filename <- gsub(x=paste0("minified_", basename(initial_filename)), "\\.gz", "")

masses_of_interest <- c(118.0865, 138.0555)
minifyMSdata(files = initial_filename, output_files = output_filename, 
             mz_include = masses_of_interest, ppm = 10, warn = FALSE)
```

Then, when we open the file up (with `RaMS` or other software) we are left with
the data corresponding only to those compounds:

```{r opendata, results='hide'}
init_msdata <- grabMSdata(initial_filename)
msdata <- grabMSdata(output_filename)
```

```{r showmini}
knitr::kable(head(msdata$MS1, 3))
knitr::kable(head(msdata$MS2, 3))
```

Both the TIC and BPC are updated to reflect the smaller file size as well:

```{r TICBPCdiffs}
par(mfrow=c(2, 1), mar=c(2.1, 2.1, 1.1, 0.1))
plot(init_msdata$BPC$rt, init_msdata$BPC$int, type = "l", main = "Initial BPC")
plot(msdata$BPC$rt, msdata$BPC$int, type = "l", main = "New BPC")
```

```{r cleanmini, include=FALSE}
layout(1)
unlink(output_filename)
```

***

The `minifyMSdata` function is vectorized so the exact same syntax can be used for multiple files:

```{r vectorizedmini}
dir.create("mini_mzMLs/")
output_files <- paste0("mini_mzMLs/", basename(msdata_files))
output_files <- gsub(x=output_files, "\\.gz", "")

minifyMSdata(files = msdata_files, output_files = output_files, verbosity = 0,
             mz_include = masses_of_interest, ppm = 10, warn = FALSE)
```

```{r ggplotmini, fig.height=2}
mini_msdata <- grabMSdata(output_files, verbosity = 0)

library(ggplot2)
ggplot(mini_msdata$BPC) + geom_line(aes(x=rt, y=int, color=filename)) + theme_bw()
```

These new files are valid according to the validator provided in MSnbase, which
means that most programs should be able to open them, but this feature is still
experimental and may break on quirky data. If that happens, please feel free to
submit a bug report at https://github.com/wkumler/RaMS/issues.

```{r cleanminimulti, include=FALSE}
unlink("mini_mzMLs", recursive = TRUE)
```

***

As an example of how I use this minification function, here's the code used to
create the minified files in the `\extdata` folder that ships with the package.
This was especially useful because the package can't be more than 5MB but it's
incredibly useful to include some standalone MS data for demos and vignettes
like this one. I don't actually run this code in the vignette itself to save
compilation time but it will run if you test it yourself.

These files originate from the [Ingalls Lab](https://sites.google.com/view/anitra-ingalls) at the University of Washington, USA and are published in the manuscript [**"Metabolic consequences of cobalamin scarcity in diatoms as revealed through metabolomics"**](https://doi.org/10.1016/j.protis.2019.05.004). Files are downloaded from the corresponding [Metabolights repository](https://www.ebi.ac.uk/metabolights/MTBLS703/files).

First, we identify the *m/z* values we'd like to keep in the minified files. For
the demo data, I'll use the Ingalls Lab list of targeted compounds - those we
have authentic standards for.

```{r mzs to include, eval=FALSE}
raw_stans <- read.csv(paste0("https://raw.githubusercontent.com/",
                             "IngallsLabUW/Ingalls_Standards/",
                             "b098927ea0089b6e7a31e1758e7c7eaad5408535/",
                             "Ingalls_Lab_Standards_NEW.csv"))

mzs_to_include <- as.numeric(unique(raw_stans[raw_stans$Fraction1=="HILICPos",]$m.z))
# Include glycine betaine isotopes for README demo
mzs_to_include <- c(mzs_to_include, 119.0899, 119.0835)
```

Then, we download the raw MS data from the online repository into which it's
been deposited.

```{r download from MBLs, eval=FALSE}
if(!dir.exists("vignettes/data"))dir.create("vignettes/data")
base_url <- "ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS703/FILES/"
chosen_files <- paste0(base_url, "170223_Smp_LB12HL_", c("AB", "CD", "EF"), "_pos.mzXML")
new_names <- gsub(x=basename(chosen_files), "170223_Smp_", "")

mapply(download.file, chosen_files, paste0("vignettes/data/", new_names), 
       mode = "wb", method = "libcurl")
```

For MSMS data, we can also demo pulling data from MetabolomicsWorkbench:

```{r download from MW, eval=FALSE}
MW_url <- paste0(
  "https://www.metabolomicsworkbench.org/data/file_extract_7z.php?",
  "A=ST002830_rawdata.zip&F=TCR_081023_Fu_WorkBench%252FS30657.mzXML"
)
download.file(MW_url, destfile = "vignettes/data/S30657.mzXML")
```

Then we can actually perform the minification:

```{r minify, warning=FALSE, eval=FALSE}
library(RaMS)

if(!dir.exists("inst/extdata"))dir.create("inst/extdata", recursive = TRUE)
init_files <- list.files("vignettes/data/", full.names = TRUE)
out_files <- paste0("inst/extdata/", basename(init_files))
minifyMSdata(files = init_files, output_files = out_files, warn = FALSE,
             mz_include = mzs_to_include, ppm = 20)
```

Now we have four minified mzXML files in our inst/extdata folder. However, we'd
like to be able to demo the mzML functionality as well as that of mzXMLs, so
we can use [Proteowizard's](https://proteowizard.sourceforge.io/tools/msconvert.html) 
`msconvert` tool because `RaMS` can't convert between mzML and mzXML or vice 
versa. You'll need to install `msconvert` and add it to your path for this
step.

We also use `msconvert` to trim the files by retention time, keeping data 
between 4 and 15 minutes.

Finally, we gzip the files to get them as small as possible, also using `msconvert`.

```{r msconvertchunk, eval=FALSE}
system("msconvert inst/extdata/*.mzXML -o inst/extdata/temp --noindex")
system("msconvert --mzXML inst/extdata/*.mzXML -o inst/extdata/temp --noindex")
system('msconvert inst/extdata/temp/*.mzML --filter \"scanTime [240,900]\" -o inst/extdata -g')
system('msconvert inst/extdata/temp/*.mzXML --mzXML --filter \"scanTime [240,900]\" -o inst/extdata -g')
```

And then for the last few steps, we again rename the files (since `msconvert`
expands them to their full .raw names) and remove the ones we don't need for
the demos.

```{r renameremove, eval=FALSE}
init_files <- list.files("inst/extdata", full.names = TRUE)
new_names <- paste0("inst/extdata/", gsub(x=init_files, ".*(Smp_|Extracts_)", ""))
file.rename(init_files, new_names)

unlink("inst/extdata/temp", recursive = TRUE)
file.remove(list.files("inst/extdata", pattern = "mzXML$", full.names = TRUE))
file.remove(paste0("inst/extdata/", c("LB12HL_CD.mzXML.gz", "LB12HL_EF.mzXML.gz")))
```

To check that the new files look ok, we can see if we can read them with `RaMS`
and `MSnbase`.

```{r checkquality, eval=FALSE}
MSnbase::readMSData(list.files("inst/extdata", full.names = TRUE)[1], msLevel. = 1)
RaMS::grabMSdata(new_names[1])
```

Finally, remember to clean up the original downloads folder

```{r vigdata cleanup, eval=FALSE}
unlink("vignettes/data", recursive = TRUE)
```

***

README last built on `r Sys.Date()`