Introduction
This document delineates the practices that all creators of ecocomDP
formatted datasets should adopt in order for the community to build a
cohesive and interoperable collection. It contains detailed descriptions
of practices, definitions of concepts, and solutions to many common
issues.
Early sections address important considerations for anyone thinking
about converting source datasets to the ecocomDP model, then focus
shifts to an examination of the model components in greater detail.
These shared practices are written with the intention to help simplify
the conversion process.
If you are new to the conversion process, we recommend reading the Getting Started and Concepts sections, reviewing the Create
and Model
Overview vignettes, and referring back to this document as questions
arise. A thorough understanding of the ecocomDP model and some
foundational concepts will greatly simplify the conversion process.
Each ecocomDP dataset (Level-1; L1) is created from a raw source dataset
(Level-0; L0) by a unique conversion script. Inputs are typically from
the APIs of data repositories and monitoring networks, and outputs are a
set of archivable files. The derived ecocomDP dataset is delivered to
users in a consistent format by
read_data()
and the conversion script provides a fully reproducible and automated
routine for updating the derived dataset whenever a new version of the
source data are released.
Getting Started
Identifying a candidate L0 dataset
Not all source datasets are good candidates for ecocomDP. Features of
a good candidate include datasets that:
- Are about community surveys not population surveys. An ecological
community is a multi-species (at least two) collection of organisms
living in a particular area, whereas a population is a
collection of organisms, all of the same species, that live in the same
place.
- Form long-term observation with a suggested minimum of 5 years,
which can be ignored for datasets with exceptionally wide spatial
extents.
- Have ongoing data collection and will eventually form long-term
observation.
- Are accompanied by information rich metadata written in the Ecological Metadata Language
(EML).
- Have data and metadata that are programmatically accessible from a
trustworthy, stable, and persistent host (i.e. repository or other
web-accessible data provider).
Understanding the L0 dataset
A thorough understanding of the L0 dataset is required before
actually performing any transformations. To gain understanding of an L0
dataset we recommend:
- Reading the abstract, methods, and keywords to gain insights into
the scope, structure, and temporal and spatial scales of the study.
- Reviewing the data object descriptions to understand how data are
related across tables.
- Importing the dataset into R and exploring any of your data specific
questions.
Major issues at this point may suggest the amount of work required to
convert the L0 dataset to the ecocomDP model is not worth it.
Resolving issues
After gaining a sufficient understanding of the L0 dataset, you are
ready to assess, and hopefully resolve, any issues that are obvious from
the start. To help draw out some of these apparent issues, you may want
to create a high-level plan for combining the L0 tables (i.e. Row-wise
bindings or joined with shared keys?) and mapping their columns to the
L1 ecocomDP model. Here are some solutions (ordered by priority) for
resolving issues at this stage in the creation process:
Work with the L0 author/manager to fix the issues - Fixing the
issue here both communicates best practices for future data curation and
immediately improves data quality.
- If the L0 dataset contains a column entirely of NAs, the conversion
script should be written so that the L0 variable will populate the
correct L1 table if it is ever filled in a future revision.
- When the structure of the L0 dataset (column names, codes used for
species, etc.) prevents optimal L1 dataset creation, proceed with
writing the conversion script and then provide feedback to the L0
author/manager. Ensure that your conversion script will have the
functionality to handle revisions to the L0 dataset.
- Make use of the
message()
function to alert ecocomDP
script maintainers to sections of code that could improve in future L0
dataset updates.
Modify L0 components - Modifying L0 components is only permitted
in rare cases. This list highlights the L0 components and specific
scenarios in which you should modify them:
- Duplicate column names - Only change the L0 column names if they
create conflicts during joins (i.e. two L0 tables having the same column
names) or share the name with an L1 table’s column name. In either case,
always preserve the original column name. Append a prefix or suffix to
the original column name so that it is distinct but still recognizable
enough to be used as a reference to the L0 metadata.
- Non-point locations - Currently, only point coordinates are accepted
by the ecocomDP model; use the centroid of bounding areas if presented
with non-point locations.
- Datetime formats - Conversion of datetimes to the ecocomDP standard
is required (e.g. YYYY-MM-DD hh:mm:ss or a subset thereof). Always
combine Year, Month, Day, and time columns to the temporal level of
observation (e.g. edi.334.2).
- Missing value codes - Missing value codes that are explicitly
declared in the L0 metadata should be converted to NA. When not declared
but unambiguously interpretable as a missing value (e.g. “” in a
comments field) then convert to NA. If it is ambiguous, don’t modify the
L0 data.
- Different temporal resolution among the L0 tables creates a
joining/stacking issue. To solve it, assign the coarsest temporal
resolution to the datetime field of all the L0 tables, then store the
more precise datetime variables in the observation_ancillary table
(e.g. edi.291.2).
Omit L0 data - ecocomDP is a flexible model but can’t handle
everything. Convert as much as possible and drop the remainder. If
content is dropped, then describe what and why using comments in the
conversion script. Some guidelines for when you should drop content:
- If it is necessary to drop observations (rows), see the section Omitting rows.
- If it can be derived from the preserved content (e.g. year can be
derived from date, a taxa’s common name can be derived from its species
name) then you can drop it from the L1 dataset if it does not fit the
model (i.e. L1 data tables don’t pass validation).
- If it is ancillary information to data in one of the L1 ancillary
tables that does not have a simple one-to-one, row-wise relationship,
then it is too far removed from the core content of the L1 dataset.
(e.g. A temperature sensor collects environmental data that supports the
core observations. This temperature data belongs in the
observation_ancillary table. If maintenance information for the
temperature sensor is included, it could be omitted from the L1 dataset
because it cannot be linked to the temperature observations within the
model.) However, if the data is so important that it shouldn’t be
dropped, then append a suffix to the supporting data’s original variable
name that references the ancillary data (e.g. ancillary data “Site” has
supporting data “lat” and “lon”. The variables “Site”, “lat_Site” and
“lon_Site” all belong in ancillary data).
- Large surveys, sampling campaigns, and experiments will occasionally
include information that is not central to the community ecology aspect
of the L0 dataset (e.g. edi.115.2,
edi.247.3).
See the section Omitting tables to
determine if or how you should omit entire data tables from the L1
dataset.
If the above options don’t solve the issue, then don’t convert it.
There are many more datasets out there in the world to convert!
Omitting rows
When an L0 dataset is really valuable, but issues with the dataset
(e.g. changes in temporal or spatial resolution across observations; edi.251.2)
prevent conversion, the best option may be to convert a subset of the
observations to the ecocomDP format. Follow these steps for omitting
rows from the L0 dataset:
- Append
message(paste0("This L1 dataset is derived from a version of ", source_id, "with omitted rows."))
below the create_eml()
function call in the
create_ecocomDP()
function definition.
- After omitting rows, there may be residual columns that only applied
to the now-omitted rows. You can remove these columns, but do not remove
columns outside of this context.
Omitting tables
You may decide that only a subset of the data tables within an L0
dataset are well-suited to the ecocomDP format. In this case you have
the option to omit entire data tables and only convert those that will
fit the model.
When determining which tables to convert and which to omit, first
identify which table(s) contain the “core” observation information. This
will be the backbone of the intermediate “flat” table. Once the flat
table is instantiated around the core observation, determine the shared
keys by which to join the other L0 tables. The following types of
variables are examples of common keys shared between data tables:
- Time (a column for year, month, day, date, or combination).
- Location (a column for site, plot, transect, or combination).
- Taxonomic information ( a species, scientific name, common name, or
similar column).
- Other shared variables.
If you encounter tables that can’t be joined to the core observation
information, possibly because they focus on a different
time/location/taxon entirely, omit these problematic tables. Apply the
following changes to the conversion script to highlight the table
omission:
- In the “Join and flatten source data” section of the create_ecocomDP
function definition, write a brief comment that highlights and justifies
the decision to trim the L0 tables
(i.e.
# atmospheric_gas_concentrations table did not share a key with the bird_count table and was omitted
).
- Append
message(paste0("This L1 dataset is derived from a trimmed version of ", source_id, "with omitted tables"))
below the create_eml()
function call in the
create_ecocomDP()
function definition.
Creating the Conversion Script
Write a conversion script to create an ecocomDP dataset from a
standard set of minimal inputs (i.e. arguments to the
create_ecocomDP()
function). The conversion script should
have some tolerance (i.e. error handling) for re-running the script at a
later time on a changed source dataset. The script should utilize
functionality that will either handle a revised source dataset or alert
the ecocomDP script maintainers and provide them with enough information
to quickly diagnose and fix any problems.
Currently, only the EDI Data Repository is recognized by the ecocomDP
project. If you would like support extended to your data repository,
then please place a request in the ecocomDP project issue
tracker and we will add the supporting code to index and read
ecocomDP datasets from your repository.
To convert an L0 dataset, implement the following processes in your
conversion script:
- Join source data and relevant content into a single “flat”
table.
- Use
create_*()
functions to parse out the relevant
derived L1 tables.
- Write tables to file with
write_tables()
- Validate L1 tables with
validate_data()
- Create EML metadata with
create_eml()
For details on the processes within the conversion script, see the Create
vignette.
Basics
Do’s
- R is the only supported language.
- Every ecocomDP dataset must be fully created from the R script named
“create_ecocomDP.R”; this is the conversion script.
- The only code in the conversion script should be the
library()
calls, the main function definition, and
supporting function definitions outside of the
create_ecocomDP()
function.
- Explicitly state the dependencies of the function as
library()
calls for each
(e.g. library(dplyr)
)
- Use the function name “create_ecocomDP”
(i.e.
create_ecocomDP <- function(...) {...}
and only
use the allowed set of arguments:
path
- Where the ecocomDP tables will be written
source_id
- Identifier of the source dataset
derived_id
- Identifier of the derived dataset
url
- The URL by which the derived tables and metadata
can be accessed by a data repository. This argument is used when
automating the repository publication step, but not used when manually
publishing.
- Use a regex to refer to tables with names that are likely to change
with future revisions.
- Convert missing value codes to NA. This is the ecocomDP
standard.
- Whenever possible, programmatically insert information from the L0
metadata rather than manually copying and pasting. This facilitates the
low effort transfer from a revised source dataset to the derived
dataset, enabling automated maintenance.
- Comment your code liberally describing “what” and “why” so that
future maintainers don’t have to spend as much time learning the nuances
of the data.
- Use messaging (i.e.
message()
) and section headers
(ctrl-shift-R in RStudio) at the beginning of each major code block to
help maintainers with debugging, should it be needed.
Don’ts
- The rule of thumb is to never change the L0 dataset; only
restructure/reformat:
- See Value-Added Features for
acceptable information to add.
- Do not average or modify values. Users (e.g., synthesis working
groups) will perform these calculations for their own needs.
- Don’t change data types if it results in loss of data (e.g. coercing
numeric characters interspersed with character codes, where coercion
transforms the codes to NA; edi.115.2)
- Don’t index by position, always index column by name
(e.g.
data$temp
not data[[3]]
). The order of
columns may change in revised L0 datasets which will cause problems if
indexed by position but not if indexed by name.
- Don’t drop data from the L0. Even redundant information should not
be removed.
- Never call
rm(list = ls())
in the scripts. This will
remove the global environment needed by automated maintenance
routines.
- Don’t call the create_ecocomDP() function from within the conversion
script. Otherwise the automated maintenance routine won’t be able to
control the function call (i.e. input arguments and when it runs).
Resolving Issues in the L1 Tables
Refer to this section to resolve specific issues while creating the
L1 tables. For more in-depth descriptions on these tables and their
columns, see the Model
Overview vignette.
observation
Store the core observations being analyzed in this table.
Observations must be linked to a taxon and to a location. Linking to
ancillary observations is optional.
event_id
- If an L0 dataset contains values corresponding to the survey/event
number, always use these values.
- If the L0 dataset does not contain values corresponding to the
survey/event number, group observations by the Frequency of Survey (see
(Frequency of Survey)[#frequency_of_survey] section for more
information) and programmatically assign an event_id. Note: Always
arrange the intermediate “flat” table chronologically before assigning
event_id; event_ids should correspond chronologically to surveys.
- If the Frequency of Survey can not be determined accurately from the
L0 metadata or from patterns in the L0 data tables, omit the event_id
column entirely.
datetime
- If the L0 data tables do not contain a date or datetime column but
instead have start_date and end_date columns, use the date from the
start_date column to assign datetime in the observation table.
Additionally, store start_date and end_date as-is in the
observation_ancillary table.
taxon_id
- If variables (columns) in L0 data tables are individual species,
gather (e.g.
tidyr::pivot_longer()
) into columns of values
and taxon_names. Manually add the variable_name and unit columns to
describe the measurement. Use the column description and units from the
L0 metadata, if applicable.
variable_name
- Every dataset has a core observation. In certain cases, it is
acceptable to have multiple core observation variables.
- If a derivation of abundance/count (areal density, biomass, etc.) is
provided along with raw abundance/counts, both variables constitute core
observations.
- Core observations (e.g. abundance/counts) that are subdivided into
smaller classified groups constitute their own core observations.
- Example: L0 dataset has organism counts spread across columns
“lessThan2mm”, “2to5mm”, and “greaterThan5mm”. These columns should be
preserved as-is, not aggregated into a single count column. (e.g. edi.248.2,
which is classified by size class and method).
- If core observations are missing associated counts, but each row
clearly represents an observation of a single taxon, manually assign
these three columns:
- A variable_name column populated with the value “count” for every
observation.
- A unit column with the value “number” for every observation.
- A value column with the value 1 for every observation.
location
Store identifying information about a place (longitude, latitude,
elevation) in this table. This table is self-referencing so that sites
can be nested.
location_id
- If an L0 dataset contains id values corresponding to the location,
always opt to use these values (i.e. don’t assign your own).
- If an L0 dataset does not contain location identifiers then assign
them by grouping on the location name at the meaningful level of
observation (see Levels of
Observation section).
- If an L0 dataset does not contain location identifiers or names
corresponding to location at the meaningful level of observation and the
entire study occurs at the same location, assign a location_id of 1 and
a logical location_name derived from the L0 metadata to each
observation.
- If an L0 dataset does not contain location identifiers or names
corresponding to location at the meaningful level of observation,
determine a unique constraint among the sampling location attributes
(e.g. latitude, longitude) and use that to create a sampling
location_id.
location_name
- If an L0 dataset does not contain names corresponding to location at
the level of observation, a name derived from the L0 metadata should be
assigned to each observation.
latitude and longitude
- Each row of the location table should contain a latitude and
longitude value describing a point location. This information should be
in the L0 dataset (e.g. metadata, data tables, or other entities).
Sometimes there is geographic information at a different spatial scale
than the listed locations. Follow these recommendations for determining
the correct coordinates to provide:
- The L0 dataset only has areas/polygons (instead of point locations)
at a location. In such cases, use the centroid of each area (Figure
2A).
- The L0 dataset only has point locations at finer spatial scale than
a location. In such cases, use centroid of the constituent coordinates
(e.g. centroid of transects within a site) (Figure 2B).
- The L0 dataset only has point locations at a coarser level than a
location. In such cases, transpose coarser coordinates to the location
(e.g. each plot within a site would get the site’s coordinates) (Figure
2C).
- The L0 dataset only has an area/polygon at a coarser level than a
location. In such cases, use the centroid of the area/polygon as
coordinates for the location (Figure 2D).
- The L0 dataset has incomplete but accurate point locations for some
locations. In such cases, the one-to-one matching of geographic coverage
with some of the locations is best. The L1 user may be able to infer
coverage to other adjacent locations with coordinates and elevations
having NA values (Figure 2E).
- Coordinates of location vary with each observation (e.g. specific
coordinates where a plankton tow was done within the boundaries of an
established/static station, edi.338.2)
(Figure 2F).
- First, take the average across the entire dataset of all observation
coordinate corresponding to each location_name.
- Second, in the location_ancillary table, list the exact coordinates
for each date by including the datetime field. If this raises non-unique
composite key issues (from
validate_data()
), then see the
observation_ancillary section.