--- title: "Handout to the R package divDyn v0.8.3 for diversity dynamics using fossil sampling data" author: "Adam T. Kocsis, Carl J. Reddin, Wolfgang Kiessling" date: '`r Sys.Date()`' output: pdf_document vignette: > %\VignetteIndexEntry{Handout to the R packare divDyn v0.8.3 for diversity dynamics using fossil sampling data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(knitr) library(divDyn) ``` # 1. Introduction The purpose of this vignette is to guide users through the basic capabilities of the ‘divDyn’ package. Fossil occurrence databases, such as the Paleobiology Database (PaleoDB, https://paleobiodb.org/) are readily available to be used in analyses of diversity, extinction and origination patterns (the dynamics of biodiversity), with a certain toolkit that has become standard since the creation of the database. Until now, the implementation of most of these tools have been the responsibilities of individual researchers, with no software package to rely on. This R package intends to fill this gap. ## 1.1. Installation To install this beta version of the package, you must download it either from the CRAN servers or its dedicated GitHub repository (https://github.com/divDyn/r-package/). All minor updates will be posted on GitHub as soon as they are finished, so please check this regularly. The version on CRAN will be lagging for some time, as it takes the servers many days to process everything and updates are expected to be frequent. All questions should be addressed to Adam Kocsis, the creator and maintainer of the package (adam.kocsis@fau.de). Instead of spending it on actual research, a tremendous amount of time was invested in making this piece of software useable and user-friendly. If you use a method implemented in the package in a publication, please cite both its reference(s) and the ‘divDyn’ package itself (Kocsis et al. 2019). # 2. Necessary Data Most functionality in the ‘divDyn’ package assumes that the time dimension is broken down to discrete intervals. Accordingly, most functions are built on two fundamental data structures: a time scale table and an occurrence dataset. ## 2.1. Time scales The workflow presented here is based on the discretization of geological time, which is constrained by stratigraphy. These intervals of time (bins) represent the basic units of the analysis, and their sequence is coded in the time scale table. Even if we develop a geological model that outputs robust estimates in a continuous time axis, the calculation of metrics presented in the package will require discretization. We added implementations of the basic functionalities for continuous time (chapter ‘4.3. Slicing’) as well, but we do not deem it as reliable as using stratigraphic bins for million-year-scale, deep-time analyses. As age estimates are dependent on the different geological ‘time scales’, binning the data can change more than necessary, which can have random effects on the resulting series. In order to demonstrate the workflow of binned analyses, we added an example table to the package. This table represents a somewhat altered form (see below) of the stage-level geological time scale of Gradstein et al. (2020). You can attach this table using the ``data()`` function. ```{r package, echo= TRUE, eval=FALSE} library(divDyn) ``` ```{r dat, echo= TRUE} # attach the time scale object data(stages) ``` Every row in this table represents a bin in the timescale. The most important variable in this table is the slice number (in this case ``num``). This variable links every occurrence to one of the bins. You can gather additional information by typing ``?stages`` to the console. You can visualize the timescale by using the ``plots()`` function included in the package: ```{r tsplot1, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, boxes="sys", shading= "series") ``` For easier navigation in the plot, the time dimension can be indicated with three variables: the radiometric dates that serve as coordinates; boxes of intervals under lowest ``ylim`` value of the plot; and vertical shades over the plotting area. The time scale to be plotted can be altered by changing the values of the main argument ``tsdat`` and by providing the appropriate column names for the boxes and shading arguments. In order to use the system (period) names as labels and the stages as shades, just change the function input accordingly (the ``xlim`` values will limit the x axis plot): ```{r tsplot2, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, boxes="system", shading="stage", xlim=59:81) ``` The function was designed to enable the highest level of customization. You can customize the distribution of plotting area with the ``xlim`` (accepts exact ages and sequences of bins, see the examples), ``ylim``, ``prop`` and ``gap`` arguments, and the color of the shading. You can also customize the characteristics of the general plotting (calling ``plot()``, the boxes of time slices (calling ``rect()``) and the labels within them (calling the ``text()`` function). You can directly control the arguments of these functions that ``tsplot()`` uses to draw the elements of the timescale by adding the additional arguments as lists to the ``plot.args``, ``boxes.args`` and ``labels.args`` arguments. For instance, if you want your boxes to feature red italic fonts as labels, just add col="red", and font=3 the way you would regularly use them with the text() function, but wrap them up in a list, and assign it to the ‘labels.args’ argument: ```{r tsplot3, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, boxes="sys", shading="series", labels.args=list(col="red", font=3), shading.col=c("white", "wheat")) ``` Although you can set it manually with the ``boxes.args`` argument, the time scale plotting function also features colored plotting of the boxes, via the argument ``boxes.col``. The ``stages`` object contains hexadecimal RGB codes for the system (period), series (epoch) and stage (age) entries of the ICS table of stratigraphy in the ``systemCol``, ``seriesCol`` and ``col`` columns, respectively. ```{r tsplot4, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, boxes=c("sys"), shading="sys", boxes.col="systemCol") ``` Plotting of multiple levels as boxes is also supported now. Use vectors of column names if you want to add these. ```{r tsplot5, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, boxes=c("short","system"), shading="short", xlim=59:69, boxes.col=c("col","systemCol"), labels.args=list(cex=0.5)) ``` Accordingly, the ``boxes.args`` and ``labels.args`` accept hierarchical input that corresponds to the levels of stratigraphic hierarchy. ```{r tsplot6, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, boxes=c("short","system"), shading="short", xlim=59:69, boxes.col=c("col","systemCol"), labels.args=list(list(cex=0.5),list(cex=1)), boxes.args=list(list(),list(density=80))) ``` Naturally, you can use other time scale plotting packages such as the ‘geoscale’ package developed by J. Bell (2015). These work great for Phanerozoic scale analyses, but be sure to check the compatibility of the time scale you use for the binning and the time scale used for plotting. As we have experience with problems that stemmed from the incompatibility of analyzing and plotting data, we recommend using your own timescale for plotting. ## 2.2. Occurrence table The occurrence tables contain unary information about the presence of a taxon at a specified locality (can be global). In these tables, each occurrence is represented by a row which has to include a name of the taxon. This data format is similar to the following example: ```{r examp, echo= FALSE, results=TRUE} structure <- data.frame( tax= c("Sp1", "Sp1", "Sp1", "Sp2","Sp3", "Sp2"), bin= c(1,1,2, 2,2,3), locality = c("first", "second", "second", "second", "first", "second")) structure ``` The functions of the package will have to be pointed to this column, by specifying its name in the ``tax`` argument of the function in question. Additional variables can be added that contain specific information about the time and locality of the occurrence, as well as other variables that help with grouping the individual entries (taxon/collection information). The utility of this long format is in its unbounded nature, with the acquisition of newer data points the time and spatial coverage of the dataset can extend without problems. ### 2.2.1. Stratigraphic assignment Most functions rely on processes that subset the data to contain occurrences that represent the same time interval. This column can be specified with setting the ``bin`` argument accordingly. However, to get this column, a number of processes has to be run on the raw data. Although the package already incorporates functions and data to assign downloaded occurrences to stratigraphic bins, those are illustrated in a separate vignette (https://github.com/divDyn/ddPhanero). ### 2.2.2. The example file In order to demonstrate most capabilities of the package we have added a fossil occurrence table of Scleractinian corals that we used in an earlier study (Kiessling and Kocsis, 2015). This subset was downloaded from the PaleoDB and was extended with information on inferred photosymbiotic status, growth types, degree of integration, ecological environment, inferred depth, substrate lithology and latitudinal groups. Additional details are available by typing ``?corals`` to the console. This dataset is embedded in the package and can be attached using the data() function: ```{r corDat, echo= TRUE} data(corals) ``` ```{r tibbleOrNot, echo= FALSE, eval=FALSE} library(tibble) corals <- as_tibble(corals) ``` This dataset was resolved to the 10 my timescale of the PaleoDB (``ten``, now only available through FossilWorks) and the stage-level time scale (variable ``stg``) that is included in the package (see Section 2.1, ``stages``). This latter is the basis of all inference and plotting. The values of the ``stg`` column of the coral table refers to entries in the ``num`` column of the ``stages`` table. Please note that this dataset does not include Holocene occurrences. The occurrences designated with ``stg==95`` are just single entries that include extant genera; therefore all other entries of this subset are missing, except for the variables linked to the taxa. The rest of the occurrences represent actual fossils. ```{r fossils, echo= TRUE, results=TRUE} fossils <- corals[corals$stg!=95,] # the number of occurrences nrow(fossils) ``` # 3. Basic patterns ## 3.1. Ranges We can gain preliminary knowledge by examining the basic patterns of the stratigraphic ranges . Probably the most apparent of these are the stratigraphic ranges of the taxa, which can be easily summarized in the FAD-LAD matrix: ```{r fadlad, echo= TRUE, results=FALSE } fl <- fadlad(fossils, bin="stg", tax="genus") ``` You can also use the occurrences' own age estimate to calculate the ranges, just type in ?fadlad to see some more examples. The ranges of fossil taxa are the primary data quality feedback we can have from the massive amount of fossil occurrences. You can easilly visualize these with the `ranges()` function. To keep things simple, just assign the age (stage) mean age to the occurrences: ```{r ranplot, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} fossils$mid <- stages$mid[fossils$stg] tsplot(stages, shading="series", boxes="sys",xlim=c(260,0)) ranges(fossils, tax="genus", bin="mid") ``` The function automatically selects the taxa that have ranges that fall into the `xlim` values of the open device (you can suppress this if you want to). If you zoom in with the main plotting function, you can see this effect. You can also add the taxon labels by setting the `labs` argument to `TRUE`. ```{r ranplot2, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, shading="series", boxes="series",xlim=c(100,81)) ranges(fossils, tax="genus", bin="mid", labs=T, labels.args=list(cex=0.2)) ``` As you can see from the example above (`labs.args` argument), the argumentation of this function works in a similar way to the that of the `tsplot()` function. You can get an even more detailed look if you set the `filt` argument to `"orig"`. This will limit the displayed taxa to those that originated within the interval. `occs=TRUE` will also plot the sampled occurrences on the ranges. ```{r ranplot3, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, shading="stage", boxes=c("stage","series"),xlim=c(100,80)) ranges(fossils, tax="genus", bin="mid", labs=T, labels.args=list(cex=0.6), filt="orig", occs=T) ``` This function can also plot the taxa by groups. Here is the same plot, but by separating the taxa based on symbiotic status: ```{r ranplot4, echo= TRUE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, shading="stage", boxes=c("stage","series"),xlim=c(100,80)) ranges(fossils, tax="genus", bin="mid", labs=T, labels.args=list(cex=0.6), filt="orig", occs=T, group="ecology") ``` You can also plot survivorship curves. The function ``survivors()`` calculates the proportions of survivors from every bin to all the remaining bins. ```{r surv, echo= TRUE, results=FALSE} surv <- survivors(corals, bin="stg", tax="genus") ``` You can plot these values for every cohort to get survivorship curves. As these curves can be thought of as products of an exponential decay, it is customary to log the y axis, which you can do with adding ``log="y"`` to the arguments of the main ``plot()`` function in ``plot.args``: ```{r survTsplotblank, echo=TRUE, plot=TRUE, fig.height=5.5, fig.width=7} # time scale plot tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), ylab="proportion of survivors present", ylim=c(0.01,1),plot.args=list(log="y")) ``` Then the curves can be plotted for every column with ```{r survTsplotShow, echo=TRUE, eval=FALSE} # lines for every cohort for(i in 1:ncol(surv)) lines(stages$mid, surv[,i]) ``` ```{r survTsplot1, echo=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # time scale plot tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), ylab="proportion of survivors present", ylim=c(0.01,1),plot.args=list(log="y")) # lines for every cohort for(i in 1:ncol(surv)) lines(stages$mid, surv[,i]) ``` The values shown here are also called ‘forward’ survivorship proportions, but you can also plot the ‘backward’ survivorships to see how certain cohorts emerge over geologic time, by setting the ``method`` argument of the survivors() function appropriately: ```{r survTsplot2, echo=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} survFor<-survivors(corals, tax="genus", bin="stg", method="backward") # plot tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), ylab="proportion of constituents present", ylim=c(0.001,1),plot.args=list(log="y")) # for every cohort for(i in 1:ncol(surv)) lines(stages$mid, survFor [,i]) ``` Large synchronous changes in these curves represent times where major changes happened in the history of the group. Major extinctions are apparent on the forward survivorship, whilst major origination episodes show up on the backward survivorship curves. However, ways that are more robust exist to quantify the factors that influence diversity. ## 3.2. Sampling parameters ### 3.2.1. Basic descriptors The patterns of the data are constrained by sampling processes. These can have a direct influence on the patterns of diversity dynamics and therefore should be taken into consideration when the conclusions are drawn from the data. The ``sumstat()`` function calculates basic sampling metrics that characterize the entire dataset. ```{r sampTOT, echo=TRUE, result=TRUE} samp <-sumstat(fossils, tax="genus", bin="stg", coll="collection_no", ref="reference_no", duplicates=FALSE) samp ``` This includes the total number of occurrences, collections, references and statistics of gappiness. You can also calculate most of these basic sampling metrics for every bin with the ``binstat()`` function. ```{r binstat, echo=TRUE, result=FALSE} samp <-binstat(fossils, tax="genus", bin="stg", coll="collection_no", ref="reference_no") ``` The message above indicates that multiple entries of the same genus are present in the same collections. As this is a species-level occurrence dataset, this is understandable. By default, these entries are omitted from the calculations above, but you can toggle this manually by setting the ``duplicates`` argument of binstat() accordingly (``TRUE`` will keep the duplicates, ``FALSE`` will omit them – without notification). The function calculates the basic bin-wise statistics that do not require multi-bin pattern recognition (for instance for Alroy’s (2008) three-timer sampling completeness, which is output by the ``divDyn()`` function). Optionally, additional metrics implemented in the ``indices()`` function can be applied to information in every bin, by adding ``indices=TRUE`` to the function call. ```{r binstatInd, echo=TRUE, result=FALSE} samp <-binstat(fossils, tax="genus", bin="stg", coll="collection_no", ref="reference_no", indices=TRUE) colnames(samp) ``` All results of the output of this function can be plotted then in a straightforward way, by referring to the elements of the data frame. For instance, the ``occs`` element contains the number of sampled occurrences and ``coll`` refers to the number of collections: ```{r occsAndColls, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} oldPar <- par(mar=c(4,4,2,4)) # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="number of occurrences", ylim=c(0,3000)) lines(stages$mid[1:94], samp$occs) # the collections (rescaled, other axis) lines(stages$mid[1:94], samp$colls*5, col="blue") axis(4, col="blue",col.ticks="blue",col.axis="blue", at=seq(0,3000,500), labels=seq(0,600,100)) mtext(4, text="number of collections", col="blue", line=2) par(oldPar) ``` The ``binstat()`` function also calculates other basic statistics such as the numbers of references, sampled taxa, single-collection taxa, single-reference taxa, double collection and double reference taxa along with the sampling coverage estimator Good’s *u* (1953), the coverage estimator suggested by Alroy (*u’ *, 2010) that is based on the number of single-reference taxa. ### 3.2.2. Plotting of counts and proportions You can trace the trajectory of the occurrences that have different attributes through the bins by using the ``parts()`` function. This requires only two arguments: vector of bin identifiers and a vector that contains the categories entries. The bin identifier also determines the coordinates along the independent variable (time), so the numerical bin entries have to be replaced by the age estimates. ```{r parts, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # numerical ages, as bins fossils$stgMid <- stages$mid[fossils$stg] #plotting tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="number of occurences", ylim=c(0,3000)) parts(fossils$stgMid, fossils$bath) ``` If you check out ``?parts``, there will be additional examples using artificial data. In the default case, the category names are plotted where they are the most abundant. This plot can be even nicer if you use opacity (RGBA values for colors) and adding a proper legend. ```{r parts2, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} cols <- c("#FF0000AA", "#00FF00AA", "#0000FFAA") # reorder too reord <- c("shal","deep","uk") plotnames <-c("shallow", "deep", "unknown") tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="the number of occurrences", ylim=c(0,3000)) parts(fossils$stgMid, fossils$bath, col=cols, ord=reord, labs=F) legend("topleft", inset=c(0.01, 0.01), legend= plotnames, fill=cols, bg="white") ``` The dominance of shallow occurrences are even more apparent with proportions. You can use the ``parts`` function to plot these, rather than the counts, by adding ``prop=T`` to the function call: ```{r parts3, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="proportion of occurrences", ylim=c(0,1)) parts(fossils$stgMid, fossils$bath, prop=T, col=cols, ord=reord, labs=F) legend("bottomleft", inset=c(0.01, 0.1), legend= plotnames, fill=cols, bg="white") ``` ### 3.2.3. Single-collection and single reference taxa Single-interval taxa were thought to be byproducts of temporary increases of sampling intensity (Foote and Raup, 1996) in range-based datasets such as Sepkoski’s compendium (2002). This phenomenon urged researchers to develop metrics that ignore these taxa in calculating metrics of biodiversity and evolution. Single-collection taxa can be quickly omitted from the datasets by using the ``omit()`` function. This will return a logical vector, indicating the rows that should be omitted. The ``om`` argument can be set either to omit the single-collection occurrences (``om="coll"``). Single-reference taxa can be omitted in the same way, although the term needs additional clarification if multiple bins exist. Taxa that were only described in a single reference (``om="ref"``). Some taxa appear in multiple bins and in each of them they are described by only one reference. These you can omit with ``om="binref"``. ```{r omit, echo=TRUE, result=TRUE} omitColl <- omit(corals, tax="genus", om="coll", coll="collection_no") omitRef <- omit(corals, tax="genus", om="ref", ref="reference_no") omitBinref <- omit(corals, bin="stg", tax="genus", om="binref", ref="reference_no") # the conserved number of occurrences will be sum(!omitColl) sum(!omitRef) sum(!omitBinref) ``` # 4. Raw diversity dynamics The main calculations of the package are contained in the ``divDyn()`` function. This function calculates patterns of taxon occurrences and stratigraphic ranges to derive estimates over time for richness, origination/extinction rates and sampling probabilities. In order to calculate these you only need an occurrence table with variables including the taxon names (``tax``) and the time identifiers (``bin``). The rest of the information in the table will not be used by the function. ```{r ddFirst, echo=TRUE, result=FALSE} ddFirst<-divDyn(corals, bin="stg", tax="genus", noNAStart=TRUE) ``` The output of the ``divDyn()`` function is a ``data.frame`` class object that contains values for each bins in rows. The first column (``bin`` variable) contains the bin identifiers. The rest of the variables are explained in the documentation. ## 4.1. Diversity (taxonomic richness) with discrete time input You read this because you have interests in the changes of diversity. Calculating time series of diversity does not require additional action than running the basic ``divDyn()`` function. ### 4.1.1.Taxon counts All calculations in this function depend on patterns of occurrences in the bin/taxon matrix. The numbers of taxa that belong to different categories form the basis of the metrics. These categories are explained by Foote (1999) and Alroy (2014) and are available in the help file ``?divDyn``. ### 4.1.2. Different metrics of richness Calculating time series of diversity does not require additional action than running the basic ``divDyn()`` function. In this package, among the traditional range-based methods the range-through (variable ``divRT``) and the boundary-crosser (``divBC``) diversities are implemented. Plotting can be also facilitated by matching the indices of the diversity values in the variables and their numbers. If the bin numbers are positive integers this is turned on by setting the ``noNAStart`` argument to ``FALSE``, which is the default. This change results in ``NA``s and zeros in the final table in rows, which bins are not sampled. ```{r ddRec, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # metrics ddRec <-divDyn(corals, bin="stg", tax="genus") # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="range-through richness (diversity)", ylim=c(0,250)) # lines lines(stages$mid, ddRec$divRT, col="black", lwd=2) ``` This plot shows the trajectory of coral diversity over the post-Permian interval, using raw data and the range-through diversity metric. Besides the 0 values at the start of the time series (no sampled corals), you might also notice a sharp increase of richness in the latest intervals. This could be the result of an effect called the ‘Pull of the Recent’. The ‘Pull of the Recent’ (Raup, 1979) is a smearing phenomenon that arises when range-based methods are calculated from datasets where sampling probability changes abruptly. As we know the Recent much better in comparison to other intervals (sampling probability is much higher, around one), the number of ranges that connect first occurrences of taxa to the extant time interval (``stg == 95``) is much higher than those that link two non-Recent time intervals. The results in spur of diversity as the Recent is approached. On the other hand, omitting the recent interval (effectively decreasing sampling probability to 0) leads to edge effects that result in the opposite phenomenon. The discrepancy between these two approaches can be visualized by the omission of the recent ‘occurrences’: ```{r ddFossils, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # metrics dd <-divDyn(fossils, bin="stg", tax="genus") # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="richness (diversity)", ylim=c(0,250)) # lines lines(stages$mid, ddRec$divRT, col="black", lwd=2) lines(stages$mid[1:94], dd$divRT, col="blue", lwd=2) # legend legend("topleft", legend=c("with recent", "without recent"), col=c("black", "blue"), lwd=c(2,2), bg="white") ``` Over the last couple of decades, a number of metrics were proposed to efficiently express diversity given such distorting effects of incomplete sampling and the discretized time dimension. As these more straightforward approaches that use range interpolations have known issues, occurrence datasets, allow the calculation of more direct estimators, such as sampled-in-bin (SIB, variable ``divSIB``) diversities. ```{r ddDiv, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # metrics dd <-divDyn(fossils, bin="stg", tax="genus") # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="richness (diversity)", ylim=c(0,250)) # lines lines(stages$mid[1:94], dd$divRT, col="red", lwd=2) lines(stages$mid[1:94], dd$divBC, col="blue", lwd=2) lines(stages$mid[1:94], dd$divSIB, col="green", lwd=2) # legend legend("topleft", legend=c("RT", "BC", "SIB"), col=c("red", "blue", "green"), lwd=c(2,2,2), bg="white") ``` Although range-interpolation does not bias the ‘SIB’ metric, it is more affected by changes of sampling intensity sampling. The three-timer sampling completeness is an effective expression of changes in sampling (``samp3t`` variable): ```{r samp3t, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="three-timer sampling completeness") # lines lines(stages$mid[1:94], dd$samp3t, col="black", lwd=2) ``` The SIB series can be partially corrected by the three-timer sampling-completeness (Alroy §ref ). Although this can be a convenient correction it also increases the estimation error. Nevertheless, this is the least biased estimator for diversity. ```{r ddCorr, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="three-timer sampling completeness", ylim=c(0,250)) lines(stages$mid[1:94], dd$divSIB, col="black", lwd=2) lines(stages$mid[1:94], dd$divCSIB, col="blue", lwd=2) # legend legend("topleft", legend=c("SIB", "corrected SIB" ), col=c("black", "blue"), lwd=c(2,2), bg="white") ``` The ``omit()`` function is also embedded in the ``divDyn()`` function, so that the metrics are calculated after the omission of poorly sampled taxa (switched off by default). You can add this filter, by adding (the appropriate ``om`` arguments to the ``divDyn()`` function call. ## 4.2. Taxonomic rates with discrete time input Analyzing time series of originations and extinctions helps us to describe not just how diversity changed over time, by also why it changed. With taxonomic rates we can decompose changes in diversity to the relative contribution of cladogenetic processes (i.e. origination, the birth of a new taxon) and that of the disappearance of taxa (i.e. extinction, the death of a taxon). In the discrete time model, the most straightforward way to express these contributions is through simple exponential decay models (Raup, 1985). For convenience, these variables are named as ``extTYPE`` and ``oriTYPE`` for extinction and origination rates, respectively. ### 4.2.1. Different metrics of turnover In the package, the proportional rates (``extProp`` and ``oriProp``) and per capita rates (``extPC`` and ``oriPC``) of Foote (1999), the three-timer rates (``ext3t`` and ``ori3t``) of Alroy (2008), their corrected counterparts (``extC3t`` and ``oriC3t``), the gap-filler rates (``extGF`` and ``oriGF``) of Alroy (2014) and the improved second-for-third (``ext2f3`` and ``ori2f3``) of Alroy (2015) are provided. ```{r ddExt, echo=TRUE, result=FALSE, plot=TRUE, fig.height=5.5, fig.width=7} # basic plot tsplot(stages, shading="series", boxes="sys", xlim=52:95, ylab="extinction rates", ylim=c(0,2)) lines(stages$mid[1:94], dd$extPC, col="black", lwd=2) lines(stages$mid[1:94], dd$extGF, col="blue", lwd=2) lines(stages$mid[1:94], dd$ext2f3, col="green", lwd=2) # legend legend("topright", legend=c("per capita", "gap-filler", "second-for-third"), col=c("black", "blue", "green"), lwd=c(2,2,2), bg="white") ``` The documentation of the function (type in console ?divDyn) contains the formulas for the calculation of the rates. With heterogeneous, incomplete sampling the metrics have different properties that can be summarized as follows: * Proportional ‘rates’: These metrics express what proportion of the cohort of taxa disappears until the new interval. * The per capita rates of Foote (1999) use the range-through assumption to establish ranges for the taxa in the dataset. The rate value expresses what proportion of the taxa decayed until the end of the interval. The method is biased by the Signor-Lipps effect and edge effects (Foote 2000). * The ‘three-timer’ and ‘corrected three-timer’ rates (Alroy, 2008) are different estimators of the per capita rates but will converge on them when sampling tends to completeness. They use moving windows around the focal interval to select data that is more relevant to the focal interval. This metric is unbiased by the Signor-Lipps and edge effects. The three-timer sampling completeness can be used to correct this metric, but it will not improve its susceptibility to random sampling error. * The ‘gap-filler’ rates (Alroy, 2014): gap-filler extinction rates are improved versions of the three-timer rates, using four-bin moving windows instead of three. This makes the rates less resistant to random error (more taxa are recognized by the categorizing procedure) and are an improvement on the three-timer rates. The method sometimes results in the seemingly nonsensical negative extinction rates, when the true extinction rates are near 0. * The ‘second-for-third’ rates (Alroy, 2015) are further improvement of ideas the three-timer and gap-filler rates represent. By adding an algorithmic approach the frequency of negative extinction rates are further decreased. Extinction and origination proportions described by Alroy (2015) can also be used (``E2f3`` and ``O2f3``, respectively). ### 4.2.2. Selectivity testing for the per capita rates Although extinction and origination rates can provide information about events and background processes on their own, sometimes it is more useful to assess them in comparison between taxa. The most frequent way of doing this is by splitting a group to subsets and comparing the patterns of turnover in certain intervals between the two groups. In the case of selectivity testing, an ecologically, taxonomically important grouping variable indicates that one of the groups has a higher turnover at a specific point in time. For the sake of convenience in hypothesis testing for selectivity for one state of the variable (e.g. heavily calcified), the available methods assess selectivity between two groups. The general problem of extinction selectivity is the error estimation of the rate values. More data (e.g. more taxa) will lead to more reliable estimates while, splitting the same dataset to two subsets will lead to higher uncertainty in the rate estimates for the two rate values in the selected interval. The selectivity testing incorporates two aspects: (1) the difference between the two rate values, (2) and whether this distance is meaningful or not. In practice, these two criteria can be summarized by whether it is more supported by the data to describe two rate values, or is it better founded to describe just one. These two alternatives are sometimes called as *single-rate model* and the *dual-rate model* (Kiessling and Simpson, 2011; Kiessling and Kocsis, 2015). In the case that a dual rate model is better supported, that means that the difference between the two values of rates is statistically meaningful, and the extinction or origination event selectively affected the group with the higher value. For example, here are the raw origination rates of corals, plotted for the *az* and the *z* group separately and together: ```{r selPrim, echo=TRUE, results=TRUE, plot=TRUE, fig.height=5.5, fig.width=7} # split by ecology (az = azooxanthellate, z = zooxantehllate) z<- fossils[fossils$ecology=="z",] az<- fossils[fossils$ecology=="az",] # calculate diversity dynamics ddZ<-divDyn(z, tax="genus", bin="stg") ddAZ<-divDyn(az, tax="genus", bin="stg") # origination rate plot tsplot(stages, boxes="sys", shading="series", xlim=54:95, ylab="raw per capita originations") lines(stages$mid[1:94], dd$oriPC, lwd=2, lty=1, col="black") lines(stages$mid[1:94], ddZ$oriPC, lwd=2, lty=1, col="blue") lines(stages$mid[1:94], ddAZ$oriPC, lwd=2, lty=2, col="red") legend("topright", inset=c(0.1,0.1), legend=c("total", "z", "az"), lwd=2, lty=c(1,1,2), col=c("black", "blue", "red"), bg="white") ``` The question that arises in this case is whether some intervals can be characterized by actually higher origination rates (for *az* taxa in the Cretaceous), or is this just a sampling artifact. The testing framework is implemented currently only for the per capita extinction rates of Foote (1999) in the ``ratesplit`` function. The function will take the occurrence dataset as an argument and will calculate the rates values similarly to the ``divDyn`` function. The function requires a separator variable for the selection testing (``sel``) that will be used for splitting the dataset into two (this column has to have two possible states i.e. binary code). The implementation of the process for the symbiotic status is: ```{r selFirst, echo=TRUE, results=TRUE, plot=TRUE, fig.height=5.5, fig.width=7} rs<-ratesplit(fossils, sel="ecology", tax="genus", bin="stg") rs ``` The default output of this function by default is a list of two vectors: ``ext`` and ``ori``, extinction rates and origination rates, respectively. Each vector contains the bin numbers where the dual rate model is better supported than the single rate model, where selectivity is plausible. In the example above, selectivity for the extinctions is unlikely and it is supported for bin 57 (Norian), 59 (Hettangian), 75 (Albian) and 81 (Maastrichtian). The statistical testing in this case is based on the number of taxa and can be performed in two different ways, set with the ``method`` argument. The less stringent ``binom`` method implements binomial testing, with a significance value set with the ``alpha`` parameter that defaults to 0.05. ```{r sel2, echo=TRUE, results=TRUE, plot=TRUE, fig.height=5.5, fig.width=7} rsBin95<-ratesplit(fossils, sel="ecology", tax="genus", bin="stg", method="binom") rsBin95 rsBin90<-ratesplit(fossils, sel="ecology", tax="genus", bin="stg", method="binom", alpha=0.1) rsBin90 ``` The more conservative approach is to use model selection criteria for the testing of support for the *single * or *dual rate models*, which involves calculating Akaike Information Criteria and then Akaike weights. This is the default method (``method=AIC``). The ``alpha`` argument in this case depicts the minimum Akaike weight that serves as a threshold for the *dual rate model* to be supported. As the ratio of these weights represent likelihood ratios, by default it is set to 0.89 that roughly represents 8 times higher likelihood for the *dual rate model*. This can be toggled the way it is deemed useful for the question at hand. ```{r sel3, echo=TRUE, results=TRUE, plot=TRUE, fig.height=5.5, fig.width=7} rsAIC0.5<-ratesplit(fossils, sel="ecology", tax="genus", bin="stg", alpha=0.5) rsAIC0.5 ``` In the case above, the 0.5 ``alpha`` value indicates that the dual rate model should be supported if it is more likely than the single rate model. This output, listing supported intervals, is useful for plotting the selectivity values that could be easily visualized by dots. You can add these to the plot of *z* and *az* coral origination rates by running: ```{r selDot, echo=TRUE, eval=FALSE, plot=FALSE} # select rate values in the intervals of selectivity selIntervals<-cbind(ddZ$oriPC, ddAZ$oriPC)[rs$ori,] # which is higher? TRUE: AZ, FALSE: Z groupSelector<-apply(selIntervals, 1, function(x) x[1]