---
title: "NetCDF with tidync"
author: "Michael D. Sumner"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    fig_width: 10
    fig_height: 10
vignette: >
  %\VignetteIndexEntry{NetCDF with tidync}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(pillar.subtle = FALSE, pillar.sigfig = 4)
```

The goal of tidync is to ease exploring the contents of a NetCDF source and to simplify the process of data extraction.  

Array dimension expressions virtually *slice* (i.e. *index*) into data arrays with immediate feedback, and ease programming for automating data extraction. No data is read until explicitly requested, and can be pulled directly as an array, or in long-form as a data frame. 

NetCDF is **Network Common Data Form**, a [data model][not-a-format] and an
[API](https://en.wikipedia.org/wiki/Application_programming_interface).

This is a very common, and *very general* way to store and work with scientific
array-based data. NetCDF is defined and provided by
[Unidata](https://www.unidata.ucar.edu/software/netcdf/).

Here we introduce traditional concepts of NetCDF, and show examples with
built-in files and online sources to demonstrate tidync functionality.

# NetCDF

NetCDF is a very widely used file format for storing array-based data as
*variables*. The **space** occupied by a **variable** is defined by its
**dimensions** and their metadata. Dimensions are by definition
*one-dimensional* (i.e. an atomic vector in R of length 1 or more), an array with coordinate metadata, units,
type and interpretation. The **space** of a variable is defined as one or more
of the dimensions in the file. A given variable won't necessarily use all the
available dimensions and no dimensions are mandatory or particularly special.

Some conventions exist to define usage and minimal standards for metadata for particular file schemas, but these are many and varied, and not always adhered to. Tidync does not provide a data interpretation as it is intended for general use, including by tools that create data formats. 

The R community is not particularly strong with use of NetCDF, though it
is common and widely used it pales compared to use in general climate science work, and there the most used tool is the [CDO Climate Data Operators](https://code.mpimet.mpg.de/projects/cdo).  In R the most common
tools used are ncdf4 and raster (which uses ncdf4). 

Both the RNetCDF and ncdf4 packages provide a traditional summary format,
familiar to many NetCDF users as the output of the command line program
[`ncdump`](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/utilities/Ncdump.html).


```{r}
ice_file <- system.file("extdata", "ifremer", "20171002.nc", package = "tidync", mustWork = TRUE)
library(RNetCDF)
print.nc(open.nc(ice_file))
```

Using `ncdump` at the command line on a suitable system would yield very similar
output to the print above.

```bash 
ncdump -h /path/to/extdata/ifremer/20171002.nc
```

With the ncdf4 package it's a slightly different approach, but gives the same
result.

```R
print(ncdf4::nc_open(ice_file))
```

Notice how the listing above is organized by *dimension* and then by *variable*.
It's not particularly obvious that some variables are defined within the same
set of dimensions as others.

A NetCDF file is a container for simple array based data structures. There is
limited capacity ^[I.e. it's not possible to query a file for an arbitrary sparse set of values, without constructing a degenerate hyperslab query for each point or reading a hyperslab containing cells not in the set.  Do you know different? Please let me know!] in the formal API for accessing data randomly
within a variable, the primary mechanism is to define offset and stride (start
and count) hyperslab indexes.



## tidync

Tidync aims to ease exploration of the contents of a NetCDF file and provides methods extract arbitrary hyperslabs. These can be used directly in array contexts, or in "long form" database contexts. 

On first contact with the file, the available variables are classified by grid and
dimension.  The "active" grid is the one that queries may be made against, and may be changed with the `activate` function. 


```{r}
library(tidync)
tidync(ice_file)
```

Here we see variables are clearly grouped by the *grid* they exist in, where
grid is a specific (and ordered!) set of dimensions. This allows us to see the
set of variables that implicitly co-exist, they have the same *shape*.  The
first grid "D0,D1,D2" has two variables, *concentration* and *quality_flag*, and
the second "D2" has only one variable *time*. There are no general rules here, a
file might have any number of dimensions and variables, and any variable might
be defined by one or more dimensions.

In this case the D2 grid has only one variable in its single dimension, and it
happens to be a special kind of variable - a "coordinate dimension", as
indicated by the `coord_dim` flag. In the traditional `ncdump` summary above
it's easy to see there's only really one data grid, in `ni,nj,time` that it
holds two variables, and that time is a special coordinate dimension - in
contrast neither `ni` or `nj` have an explicit 1-dimension variable. When there
are many dimensions and/or many variables those patterns are *not* easy to see.

A particular grid was chosen by default, this is the "D0,D1,D2" grid composed of
3 dimensions, generally the largest grid will be chosen as that is *usually* the
target we would be after. To choose a different grid we may nominate it by name,
or by member variable.

By name we choose a grid composed of only one dimension, and the summary print
makes a distinction based on which dimensions are *active*.

```{r activate}
tidync(ice_file) %>% activate("D2")
```

It is also possible to choose a grid by *variable name*. If we choose a variable in the default
grid, then it's no different to the first case.

```{r  NSE-activate}
tidync(ice_file) %>% activate("time")

## choose grid by variable name, which happens to be the default grid here
tidync(ice_file) %>% activate("quality_flag")

## same as the default
tidync(ice_file) %>% activate("D0,D1,D2")
```

The data variables available on a grid can be expanded out as a single data
frame, which all the coordinates copied out - this is not efficient(!) but if we
craft our queries sensibly to read only what we need, it's a very easy way to
explore the data in a file.

The 'hyper_filter' function allows specification of expressions to subset a variable based on each dimension's coordinate values.  

If no expressions are included we are presented with a table containing a row
for each dimension, its extent in coordinates and its length. For convenience we
also assign the activate form to an R variable, though we could just chain the
entire operation without this.

```{r}
concentration <- tidync(ice_file) 
concentration %>% hyper_filter() 
```


By specifying inequality expressions we see an *implicit* subsetting of the
array. Everything so far is implicit to delay any file-based computation
required to actually interact with the file and read from it.

Notice that these are "name = expr" paired expressions, because the right hand
side may be quite general we need the left hand side name to be assured of the
name of the dimension referred to.

```{r}
concentration %>% hyper_filter(nj = nj < 20)
```

We can also use the special internal variable 'index', which will test against
position in the dimension elements '1:length' rather than the values. It's not
different in this case because ni and nj are just position dimensions anyway.
The special 'dplyr' adverbs like 'between' will work.

```{r}
concentration %>% 
  hyper_filter(ni = index < 50, 
               nj = dplyr::between(index, 30, 100))
```

## Data extraction

How to use these idioms to extract actual data?

We can now exercise these variable choice and dimension filters to return actual
data, either in by slicing out a  "slab" in array-form, or as a data frame.

```{r}
hf <- concentration %>% 
  hyper_filter(ni = index > 150, 
               nj = dplyr::between(index, 30, 100))

## as an array
arr <- hf %>% hyper_array()
str(arr)

## as a data frame
hf %>% 
  hyper_tibble() %>% 
  dplyr::filter(!is.na(concentration)) %>% dplyr::distinct(concentration, quality_flag)
```

## Interrogating data by dimensions

The connection object 'hf' is available for determining what is available and how we might cut into it. 'time' interestingly is of length 1, so perhaps adds nothing to the information about this otherwise 2D data set. If we were to open this file in `ncdf4` or `RNetCDF` and wanted to take a subset of the file out, we would have to specify and `start` of 1 and a `count` of ` even though it's completely redundant. 


```{r sea-ice-example}
hf
```


The 'start' and 'count' values reported are directly
useable by the traditional API tools, and in particular by the functions
`ncdf4::ncvar_get()` (`varid`, `start`, `count`), its counterpart in
`RNetCDF::var.get.nc()` and command line tools like CDO.


But, it's a pain to have to know the dimension of the variable and specify every slot. Code in the traditional API that looks like this

```{r eval= FALSE}
## WARNING, pseudocode
var_get(con, variable, start = c(1, 1, 1), count = c(10, 5, 1))
```

Obviously, it should be reasonable to specify only the count of any dimensions that *we don't want the entirety of*. This problem manifests exactly in R arrays generally, we can't provide information only about the dimensions we want, they have to be specified explicitly - even if we mean *all of it*. 

It does not matter what we include in the filter query, it can be all, some or none of the available dimensions, in any order. 

```{r dimension-index,eval=TRUE}
hf %>% hyper_filter(nj = index < 20, ni = ni > 20)

hf %>% hyper_filter(nj = index < 20)
```

The other requirement we have is the ability to automate these task, and so far we have only interacted with the dimensionality information from the print out. For programming, there are functions `hyper_vars()`, `hyper_dims()` and `hyper_grids()` to report on elements in the source. The value of `hyper_vars()` and `hyper_dims()` is restricted to the *active* grid. The function `hyper_grids()` reports all available grids, with the currently active one indicated. The name of the current grid is also available via `active()`. 


```{r}
hyper_vars(hf)
hyper_dims(hf)

## change the active grid
hf %>% activate("D2") %>% 
  hyper_vars()

active(hf)

hf %>% activate("D2") %>%
  active()
```

## Transforms

Under the hood tidync manages the relationship between dimensions and coordinates via *transforms* tables. This is a 1-dimensional mapping between dimension index and its coordinate value (if there's no explicit value it is assumed to be equal to the 1-based index). These tables are available via the `hyper_transforms()` function. 

Each table has columns `<dimension-name>`, `index`, `id`, `name`, `coord_dim` (whether the dimension has explicit coordinates, in the `<dimension-name>` column), and `selected`. The `selected` column records which of the dimension elements is currently requested by a `hyper_filter` query, and by default is set to `TRUE`. Expressions in the filter function work by updating this column. 


```{r}
hyper_transforms(hf)
```

## Future improvements

* support multiple sources at once for lazy read of a virtual grid
* support groups
* support compound types
* allow a group-by function for a polygon layer, against a pair of dimensions to
classify cells
* allow a truly DBI and dplyr level of lazy read, with more filter, select,
mutate and collect idioms
* provide converters to raster format, stars format


[not-a-format]: https://twitter.com/TedHabermann/status/958034585002041344