--- title: "Quality Control of IFCB Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quality Control of IFCB Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction This vignette provides an overview of quality control (QC) methods for Imaging FlowCytobot (IFCB) data using the `iRfcb` package. The package offers tools to analyze Particle Size Distribution (PSD) following Hayashi et al. (2025), verify geographical positions, and integrate contextual data from sources like ferrybox systems. These QC workflows ensure high-quality datasets for phytoplankton and microzooplankton monitoring in marine ecosystems. You'll learn how to: - Set up the `iRfcb` package and `Python` environment. - Analyze particle size distributions for data quality. - Check spatial metadata for proximity to land, basin classification, and missing positions. Follow this tutorial to streamline the QC process and ensure reliable IFCB data. ## Getting Started ### Installation You can install the package from CRAN using: ```{r, eval=FALSE} install.packages("iRfcb") ``` Some functions from the `iRfcb` package used in this tutorial require `Python` to be installed. You can download `Python` from the official website: [python.org/downloads](https://www.python.org/downloads/). The `iRfcb` package can be configured to automatically activate an installed Python virtual environment (venv) upon loading by setting an environment variable. For more details, please refer to the package [README](https://europeanifcbgroup.github.io/iRfcb/). Load the `iRfcb` library: ```{r, eval=FALSE} library(iRfcb) ``` ```{r, include=FALSE} library(iRfcb) ``` ```{r, include=FALSE} library(reticulate) # Define path to virtual environment env_path <- file.path(tempdir(), "iRfcb") # Or your preferred venv path # Install python virtual environment tryCatch({ ifcb_py_install(envname = env_path) }, error = function(e) { warning("Python environment could not be installed.") }) ``` ```{r, echo=FALSE} # Check if Python is available if (!py_available(initialize = TRUE)) { knitr::opts_chunk$set(eval = FALSE) warning("Python is not available. Skipping vignette evaluation.") } else { # List available packages available_packages <- py_list_packages(python = reticulate::py_discover_config()$python) # Check if pandas and matplotlib are available if (!"pandas" %in% available_packages$package || !"matplotlib" %in% available_packages$package) { knitr::opts_chunk$set(eval = FALSE) warning("Required python modules are not available. Skipping vignette evaluation.") } } ``` ### Download Sample Data To get started, download sample data from the [SMHI IFCB Plankton Image Reference Library](https://doi.org/10.17044/scilifelab.25883455.v3) (Torstensson et al. 2024) with the following function: ```{r, eval=FALSE} # Define data directory data_dir <- "data" # Download and extract test data in the data folder ifcb_download_test_data( dest_dir = data_dir, max_retries = 10, sleep_time = 30, verbose = FALSE ) ``` ```{r, include=FALSE} # Define data directory data_dir <- "data" # Download and extract test data in the data folder if (!dir.exists(data_dir)) { # Download and extract test data if the folder does not exist ifcb_download_test_data( dest_dir = data_dir, max_retries = 10, sleep_time = 30, verbose = FALSE ) } ``` ## Particle Size Distribution IFCB data can be quality controlled by analyzing the particle size distribution (PSD) (Hayashi et al. 2025). `iRfcb` uses the code available at [https://github.com/kudelalab/PSD](https://github.com/kudelalab/PSD), which is efficient in detecting samples with bubbles, beads, incomplete runs etc. Before running the PSD quality check, ensure the necessary Python environment is set up and activated: ```{r, eval=FALSE} # Define path to virtual environment env_path <- "~/.virtualenvs/iRfcb" # Or your preferred venv path # Install python virtual environment ifcb_py_install(envname = env_path) # Run PSD quality control psd <- ifcb_psd( feature_folder = "data/features/2023", hdr_folder = "data/data/2023", save_data = FALSE, output_file = NULL, plot_folder = NULL, use_marker = FALSE, start_fit = 10, r_sqr = 0.5, beads = 10 ** 12, bubbles = 150, incomplete = c(1500, 3), missing_cells = 0.7, biomass = 1000, bloom = 5, humidity = 70 ) ``` ```{r, include=FALSE} # Run PSD quality control psd <- ifcb_psd( feature_folder = "data/features/2023", hdr_folder = "data/data/2023", save_data = FALSE, output_file = NULL, plot_folder = NULL, use_marker = FALSE, start_fit = 10, r_sqr = 0.5, beads = 10 ** 12, bubbles = 150, incomplete = c(1500, 3), missing_cells = 0.7, biomass = 1000, bloom = 5, humidity = 70, fea_v = 2, # Use v2 features ) ``` The results can be printed and visualized through plots: ```{r} # Print output from PSD head(psd$fits) head(psd$flags) # Plot PSD of the first sample plot <- ifcb_psd_plot( sample_name = psd$data$sample[1], data = psd$data, fits = psd$fits, start_fit = 10 ) # Print the plot print(plot) ``` ## Geographical QC/QA ### Check If IFCB Is Near Land To determine if the IFCB is near land (i.e. ship in harbor), examine the position data in the `.hdr` files (or from vectors of latitudes and longitudes): ```{r} # Read HDR data and extract GPS position (when available) gps_data <- ifcb_read_hdr_data( "data/data/", gps_only = TRUE, verbose = FALSE # Do not print progress bar ) # Create new column with the results gps_data$near_land <- ifcb_is_near_land( gps_data$gpsLatitude, gps_data$gpsLongitude, distance = 100, # 100 meters from shore shape = NULL # Using the default NE 1:10m Land Polygon ) # Print output head(gps_data) # Alternatively, you can choose to plot the points on a map near_land_plot <- ifcb_is_near_land( gps_data$gpsLatitude, gps_data$gpsLongitude, distance = 2500, # 2500 meters from shore plot = TRUE, ) # Print the plot print(near_land_plot) ``` For more accurate determination, a detailed coastline `.shp` file may be required (e.g. the [EEA Coastline Polygon](https://www.eea.europa.eu/data-and-maps/data/eea-coastline-for-analysis-2/gis-data/eea-coastline-polygon)). Refer to the help pages of `ifcb_is_near_land()` for further information. ### Check Which Sub-Basin an IFCB Sample Is From To identify the specific sub-basin of the Baltic Sea (or using a custom shape-file) from which an IFCB sample was collected, analyze the position data: ```{r} # Define example latitude and longitude vectors latitudes <- c(55.337, 54.729, 56.311, 57.975) longitudes <- c(12.674, 14.643, 12.237, 10.637) # Check in which Baltic sea basin the points are in points_in_the_baltic <- ifcb_which_basin(latitudes, longitudes, shape_file = NULL) # Print output print(points_in_the_baltic) # Plot the points and the basins ifcb_which_basin(latitudes, longitudes, plot = TRUE, shape_file = NULL) ``` This function reads a pre-packaged shapefile of the Baltic Sea, Kattegat, and Skagerrak basins from the `iRfcb` package by default, or a user-supplied shapefile if provided. The shapefiles provided in `iRfcb` originate from [SHARK](https://shark.smhi.se/hamta-data/). ### Check If Positions Are Within the Baltic Sea or Elsewhere This check is useful if only you want to apply a classifier specifically to phytoplankton from the Baltic Sea. ```{r} # Define example latitude and longitude vectors latitudes <- c(55.337, 54.729, 56.311, 57.975) longitudes <- c(12.674, 14.643, 12.237, 10.637) # Check if the points are in the Baltic Sea Basin points_in_the_baltic <- ifcb_is_in_basin(latitudes, longitudes) # Print results print(points_in_the_baltic) # Plot the points and the basin ifcb_is_in_basin(latitudes, longitudes, plot = TRUE) ``` This function reads a land-buffered shapefile of the Baltic Sea Basin from the `iRfcb` package by default, or a user-supplied shapefile if provided. ### Find Missing Positions from RV Svea Ferrybox This function is used by SMHI to collect and match stored ferrybox positions when they are not available in the `.hdr` files. An example ferrybox data file is provided in `iRfcb` with data matching sample **D20220522T000439_IFCB134**. ```{r} # Print available coordinates from .hdr files head(gps_data, 4) # Define path where ferrybox data are located ferrybox_folder <- "data/ferrybox_data" # Get GPS position from ferrybox data positions <- ifcb_get_ferrybox_data(gps_data$timestamp, ferrybox_folder) # Print result head(positions) ``` ### Find Contextual Ferrybox Data from RV Svea The `ifcb_get_ferrybox_data()` function can also be used to extract additional ferrybox parameters, such as temperature (parameter number **8180**) and salinity (parameter number **8181**). ```{r} # Get salinity and temperature from ferrybox data ferrybox_data <- ifcb_get_ferrybox_data(gps_data$timestamp, ferrybox_folder, parameters = c("8180", "8181")) # Print result head(ferrybox_data) ``` This concludes this tutorial for the `iRfcb` package. For additional guides—such as data sharing and MATLAB integration—please refer to the other tutorials available on the project's [webpage](https://europeanifcbgroup.github.io/iRfcb/). See how data pipelines can be constructed using `iRfcb` in the following [Example Project](https://github.com/nodc-sweden/ifcb-data-pipeline). Happy analyzing! ```{r, include=FALSE} # Clean up unlink(data_dir, recursive = TRUE) unlink(env_path, recursive = TRUE) ``` ## Citation ```{r, echo=FALSE} # Print citation citation("iRfcb") ``` ## References - Hayashi, K., Enslein, J., Lie, A., Smith, J., Kudela, R.M., 2025. Using particle size distribution (PSD) to automate imaging flow cytobot (IFCB) data quality in coastal California, USA. International Society for the Study of Harmful Algae. https://doi.org/10.15027/0002041270 - Torstensson, A., Skjevik, A-T., Mohlin, M., Karlberg, M. and Karlson, B. (2024). SMHI IFCB Plankton Image Reference Library. SciLifeLab. Dataset. https://doi.org/10.17044/scilifelab.25883455.v3