\documentclass[a4paper]{article} \usepackage{a4wide} \setlength{\parskip}{0.7ex plus0.1ex minus0.1ex} \setlength{\parindent}{0em} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% packages for paper %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage[latin1]{inputenc} \usepackage[round]{natbib} \bibliographystyle{abbrvnat} \usepackage{tabularx} \usepackage{appendix} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for vignette %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% instead of \usepackage{Sweave} \RequirePackage[T1]{fontenc} \RequirePackage{graphicx,ae,fancyvrb} \IfFileExists{upquote.sty}{\RequirePackage{upquote}}{} \setkeys{Gin}{width=0.8\textwidth} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl} \DefineVerbatimEnvironment{Soutput}{Verbatim}{} \DefineVerbatimEnvironment{Scode}{Verbatim}{fontshape=sl} \newenvironment{Schunk}{}{} %% Set PDF 1.5 and compression, including object compression %% Needed for MiKTeX -- most other distributions default to this \ifx\pdfoutput\undefined \else \ifx\pdfoutput\relax \else \ifnum\pdfoutput>0 % PDF output \pdfminorversion=5 \pdfcompresslevel=9 \pdfobjcompresslevel=2 \fi \fi \fi %% special latex commands \makeatletter \newcommand\code{\bgroup\@makeother\_\@makeother\~\@makeother\$\@codex} \def\@codex#1{{\normalfont\ttfamily\hyphenchar\font=-1 #1}\egroup} \makeatother \let\proglang=\textsf \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \author{Elizabeth A. Freeman, Tracey S. Frescino, Gretchen G. Moisen} \title{\pkg{ModelMap}: an \proglang{R} Package for Model Creation and Map Production} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \SweaveOpts{engine=R} %\VignetteIndexEntry{ModelMap: an R package for Model Creation and Map Production} %\VignetteDepends{randomForest, PresenceAbsence, raster} %\VignetteKeywords{species distribution models, random forest, stochastic gradient boosting, map production, raster, R} %\VignettePackage{ModelMap} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{abstract} The \pkg{ModelMap} package \citep{ModelMap} for \proglang{R} \citep{R} enables user-friendly modeling, validation, and mapping over large geographic areas though a single R function or GUI interface. It constructs predictive models of continuous or discrete responses using Random Forests or Stochastic Gradient Boosting. It validates these models with an independent test set, cross-validation, or (in the case of Random Forest Models) with Out OF Bag (OOB) predictions on the training data. It creates graphs and tables of the model validation diagnostics. It applies these models to GIS image files of predictors to create detailed prediction surfaces. It will handle large predictor files for map making, by reading in the GIS data in sections,thus keeping memory usage reasonable. \end{abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Maps of tree species presence and silvicultural metrics like basal area are needed throughout the world for a wide variety of forest land management applications. Knowledge of the probable location of certain key species of interest as well as their spatial patterns and associations to other species are vital components to any realistic land management activity. Recently developed modeling techniques such as Random Forest \citep{Breiman01} and Stochastic Gradient Boosting \citep{Friedman01,Friedman02} offer great potential for improving models and increasing map accuracy \citep{evans09, moisen06}. The \proglang{R} software environment offers sophisticated new modeling techniques, but requires advanced programming skills to take full advantage of these capabilities. In addition, spatial data files can be too memory intensive to analyze easily with standard \proglang{R} code. The \pkg{ModelMap} package provides an interface between several existing \proglang{R} packages to automate and simplify the process of model building and map construction. While spatial data is typically manipulated within a Geographic Information System (GIS), the \pkg{ModelMap} package facilitates modeling and mapping extensive spatial data in the \proglang{R} software environment. \pkg{ModelMap} has simple to use GUI prompts for non-programmers, but still has the flexibility to be run at the command line or in batch mode, and the power to take full advantage of sophisticated new modeling techniques. \pkg{ModelMap} uses the \pkg{raster} package to read and predict over GIS raster data. Large maps are read in by row, to keep memory usage reasonable. The current implementation of \pkg{ModelMap} builds predictive models using Random Forests, Quantile Regression Forests, and Conditional Inference Forests. Stochastic Gradient Boosting models are not currently supported. Random Forest models are constructed using the \pkg{randomForest} package \citep{randomForest}. For more information on Quantile Regression Forests and Conditional Inference Forests see the additional vignette, "`Pick Your Flavor of Random Forest". The \pkg{ModelMap} package models both continuous and binary response variables. For binary response, the \pkg{PresenceAbsence} package \citep{PresenceAbsence} package is used for model diagnostics. Random Forest models are built as an ensemble of classification or regression trees \citep{breiman84}. Classification and regression trees are intuitive methods, often described in graphical or biological terms. Typically shown growing upside down, a tree begins at its root. An observation passes down the tree through a series of splits, or nodes, at which a decision is made as to which direction to proceed based on the value of one of the explanatory variables. Ultimately, a terminal node or leaf is reached and predicted response is given. Trees partition the explanatory variables into a series of boxes (the leaves) that contain the most homogeneous collection of outcomes possible. Creating splits is analogous to variable selection in regression. Trees are typically fit via binary recursive partitioning. The term binary refers to the fact that the parent node will always be split into exactly two child nodes. The term recursive is used to indicate that each child node will, in turn, become a parent node, unless it is a terminal node. To start with a single split is made using one explanatory variable. The variable and the location of the split are chosen to minimize the impurity of the node at that point. There are many ways to minimizing the impurity of each node. These are known as splitting rules. Each of the two regions that result from the initial split are then split themselves according to the same criteria, and the tree continues to grow until it is no longer possible to create additional splits or the process is stopped by some user-defined criteria. The tree may then be reduced in size using a process known as pruning. Overviews of classification and regression trees are provided by \citet{death00}, \citet{vayssieres00}, and \citet{moisen08}. While classification and regression trees are powerful methods in and of themselves, much work has been done in the data mining and machine learning fields to improve the predictive ability of these tools by combining separate tree models into what is often called a committee of experts, or ensemble. Random Forests and Stochastic Gradient Boosting are two of these newer techniques that use classification and regression trees as building blocks. \emph{Random Forests} --- In a Random Forests model, a bootstrap sample of the training data is chosen. At the root node, a small random sample of explanatory variables is selected and the best split made using that limited set of variables. At each subsequent node, another small random sample of the explanatory variables is chosen, and the best split made. The tree continues to be grown in this fashion until it reaches the largest possible size, and is left un-pruned. The whole process, starting with a new bootstrap sample, is repeated a large number of times. As in committee models, the final prediction is a (weighted) plurality vote or average from prediction of all the trees in the collection. \emph{Stochastic Gradient Boosting} --- Stochastic gradient boosting is another ensemble technique in which many small classification or regression trees are built sequentially from pseudo-residuals from the previous tree. At each iteration, a tree is built from a random sub-sample of the dataset (selected without replacement) producing an incremental improvement in the model. Ultimately, all the small trees are stacked together as a weighted sum of terms. The overall model accuracy gets progressively better with each additional term. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Package Overview} The \pkg{ModelMap} package for \proglang{R} enables user-friendly modeling, diagnostics, and mapping over large geographic areas though simple R function calls: \code{model.build()}, \code{model.diagnostics()}, and \code{model.mapmake()}. The function \code{model.build()} constructs predictive models of continuous or discrete responses using Random Forests. The function \code{model.diagnostics()} validates these models with an independent test set, cross-validation, or (in the case of Random Forest Models) with Out OF Bag (OOB) predictions on the training data. This function also creates graphs and tables of the basic model validation diagnostics. The functions \code{model.importance.plot()} and \code{model.interaction.plot} provide additional graphical tools to examine the relationships between the predictor variable. The function \code{model.mapmake()} applies the models to GIS image files of predictors to create detailed prediction surfaces. This function will handle large predictor files for map making, by reading in the GIS data in sections, thus keeping memory usage reasonable. The \pkg{raster} package is used to read and write to the GIS image files. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Interactive Model Creation} The \pkg{ModelMap} package can be run in a traditional \proglang{R} command line mode, where all arguments are specified in the function call. However, in a \proglang{Windows} environment, \pkg{ModelMap} can also be used in an interactive, pushbutton mode. If the functions \code{model.build()}, \code{model.diagnostics()}, and \code{model.mapmake()} are called without argument lists, pop up windows ask questions about the type of model, the file locations of the data, response variable, predictors, etc \ldots To provide a record of the options chosen for a particular model and map, a text file is generated each time these functions are called, containing a list of the selected arguments. This paper concentrates on the traditional command line function calls, but does contain some tips on using the GUI prompts. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{File Names} File names in the argument lists for the functions can be provided either as the full path, or as the base name, with the path specified by the folder argument. However, file names in the Raster Look Up Table (the \code{rastLUTfn}, described in section~\ref{sec:rastLUT}) must include the full path. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Training Data} Training and test data can be supplied in two forms. The argument \code{qdata.trainfn} can be either an \proglang{R} data frame containing the training data, or the file name (full path or base name) of the comma separated values (CSV) training data file. If a filename is given, the file must be a comma-delimited text file with column headings. The data frame or CSV file should include columns for both response and predictor variables. In a \proglang{Windows} environment, if \code{qdata.trainfn = NULL} (the default), a GUI interface prompts the user to browse to the training data file. Note: If \code{response.type = "binary"}, any response with a value greater than \code{0} is treated as a presence. If there is a cutoff value where anything below that value is called trace, and treated as an absence, the response variable must be transformed before calling the functions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Independent Test Set for Model Validation} The argument \code{qdata.testfn} is the file name (full path or base name) of the independent data set for testing (validating) the model's predictions, or alternatively, the \proglang{R} data frame containing the test data. The column headings must be the same as those in the training data (\code{qdatatrainfn}). If no test set is desired (for example, cross-validation will be performed, or RF models with out-of-bag estimation), set \code{qdata.testfn = FALSE}. In a \proglang{Windows} environment, if \code{qdata.testfn = NULL} (default), a prompt will ask the a test set is available, and if so asks the user to browse to the test data file. If no test set is available, the a prompt asks if a proportion of the data should be set aside as an independent test set. If this is desired, the user will be prompted to specify the proportion to set aside as test data, and two new data files will be generated in the output folder. The new file names will be the original data file name with \code{"_train"} and \code{"_test"} pasted on the end. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Missing predictor values} There are three circumstances that can lead to having missing predictor values. First, there are true missing values for predictors within the test set or study area. Second, there are categorical predictors with categories that are present in the test or mapping data but not in the training data. And finally, portions of the mapping rectangle lie outside of the study area. Each of the three cases is handled slightly differently by \pkg{ModelMap}. In the first instance, true mising data values in the test set or within the study area for production mapping could be caused by data collection errors. These are data points or pixels for which you may still need be interested in a prediction based on the other remaining predictors. These missing values should be coded as \code{NA} in the training or test data. In \proglang{Imagine} image files, pixels of the specified \code{NODATA} value will be read into \proglang{R} as \code{NA}. The argument \code{na.action} will determine how these \code{NA} pixels will be treated. For model diagnostics, there are 2 options: (1) \code{na.action = "na.omit"} (the default) where any data point or pixel with any \code{NA} predictors is omitted from the model building process and the diagnostic predictions, or returned as \code{-9999} in the map predictions; (2) \code{na.action = "na.roughfix"} where before making predictions, a missing categorical predictor is replaced with the most common category for that predictor, and a missing continuous predictor is replaced with the median for that predictor. Currently, for map making only one option is available: \code{na.action = "na.omit"}. The second type of missing value occurs when using categorical predictors. There may be cases where a category is found in the validation test set or in the map region that was not present in the training data. This is a particularly common occurrence when using cross-validation on a small dataset. Again, the argument \code{na.action} will determine how these data points or pixels are treated. If \code{na.action = "na.omit"}, no prediction will be made for these locations. For model diagnostics, with \code{na.action = "na.roughfix"} the most common category will be substituted for the unknown category. Again, for map making \code{na.action = "na.omit"} is the only available option. In either instance, a warning will be generated with a list of the categories that were missing from the training data. After examining these categories, you may decide that rather than omitting these locations or substituting the most common category, a better option would be to collapse similar categories into larger groupings. In this case you would need to pre-process your data and run the models and predictions again. The final type of missing predictor occurs when creating maps of non-rectangular study regions. There may be large portions of the rectangle where you have no predictors, and are uninterested in making predictions. The suggested value for the pixels outside the study area is \code{-9999}. These pixels will be ignored, thus saving computing time, and will be exported as \code{NA}. Note: in \proglang{Imagine} image files, if the specified \code{NODATA} is set as \code{-9999}, any \code{-9999} pixels will be read into \proglang{R} as \code{NA}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{The Model Object} The models built by the \pkg{ModelMap} package are stochastic models. If a seed is not specified (with argument \code{seed}) each function call will result in a slightly different model. The function \code{model.build()} returns the model object. To keep this particular model for use in later \proglang{R} sessions, assign the function output to an \proglang{R} object, then use the functions \code{save()} and \code{load()}. Random Forest is implemented through the \pkg{randomForest} package within \proglang{R}. Random Forest has relatively few user set parameters and is not very sensitive to tuning of these parameters. The number of predictors used to select the splits (the \code{mtry} argument) is the primary user specified parameter that can affect model performance, and the default for \pkg{ModelMap} is to automatically optimize this parameter with the \code{tuneRF()} function from the randomForest package. In most circumstance, Random Forest is less likely to over fit data. For an in depth discussion of the possible penalties of increasing the number of trees (the \code{ntree} argument) see \citet{lin02}. The \pkg{randomForest} package provides two measures to evaluate variable importance. The first is the percent increase in Mean Standard Error (MSE) as each variable is randomly permuted. The second is the increase in node purity from all the splits in the forest based on a particular variable, as measured by the gini criterion \citep{Breiman01}. These importance measures should be used with caution when predictors vary in scale or number of categories \citep{strobl07}. Stochastic gradient boosting is currently disabled in \pkg{ModelMap}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Spatial Raster Layers} The \pkg{ModelMap} uses the \pkg{raster} package to read spatial rasters into \proglang{R}. The data for predictive mapping in \pkg{ModelMap} should be in the form of pixel-based raster layers representing the predictors in the model. The layers must be a file type recognizeable by the \pkg{raster} package, for example \proglang{ERDAS Imagine} image (single or multi-band) raster data formats, having continuous or categorical data values. For effective model development and accuracy, if there is more than one raster layer, the layers must have the same extent, projection, and pixel size. To speed up processing of predictor layers, \code{model.mapmake()} builds a single raster brick object containing one layer for each predictor in the model. By default, this brick is stored as a temp file and deleted once the map is complete. If the argument \code{keep.predictor.brick = TRUE} then the brick will be saved in a native \pkg{raster} package format file, with the file name constructed by appending \code{'_brick.grd'} to the \code{OUTPUTfn}. The function \code{model.mapmake()} by default outputs an \proglang{ERDAS Imagine} image file of map information suitable to be imported into a GIS. (Note: The file extension of the \code{OUTPPUTfn} argument can be used to specify other file types, see the help file for the \code{writeFormats()} function in the \pkg{raster} package for a list of possible file types and extensions.) Maps can then be imported back into \proglang{R} and view graphically using the \pkg{raster} package. The supplementary materials in \citet{elith08} also contain \proglang{R} code to predict to grids imported from a GIS program, including large grids that need to be imported in pieces. However this code requires pre-processing of the raster data in the GIS software to produce ASCII grids for each layer of data before they can be imported into \proglang{R}. \pkg{ModelMap} simplifies and automates this process, by reading Imagine image files directly, (including multi band images). \pkg{ModelMap} also will verify that the extent of all rasters is identical and will produce informative error messages if this is not true. \pkg{ModelMap} also simplifies working with masked values and missing predictors. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Raster Look Up Table} \label{sec:rastLUT} The Raster Look Up Table (\code{rastLUTfn}) provides the link between the spatial rasters for map production and the column names of the Training and Test datasets. The Raster Look Up Table can be given as an \proglang{R} data frame specified by the argument \code{rastLUTfn} or read in from a CSV file specified by \code{rastLUTfn}. The \code{rastLUTfn} must include 3 columns: (1) the full path and base names of the raster file or files; (2) the column headers from the Training and Test datasets for each predictor; (3) the layer (band) number for each predictor. The names (column 2) must match not only the column headers in Training and Test data sets (\code{qdata.trainfn} and \code{qdata.testfn}), but also the predictor names in the arguments \code{predList} and \code{predFactor}, and the predictor names in \code{model.obj}. In a windows environment, the function \code{build.rastLUT()} may be used to help build the look-up-table with the aid of GUI prompts. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Examples} These examples demonstrate some of the capabilities of the \pkg{ModelMap} package by building three Random Forest models: continuous response, binary response and categorical response. The continuous response variables are percent cover for two species of interest: Pinyon and Sage. The binary response variables are Presence/Absence of these same species.The categorical response is vegetation category. Next, model validation diagnostics are performed with three techniques: an independent test set, Out Of Bag estimation, and cross-validation. Note: in an actual model comparison study, rather than a package demonstration, the models would be compared with the same validation technique, rather than mixing techniques. Finally, spatial maps are produced by applying these models to remote sensing raster layers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Example dataset} The dataset is from a pilot study in Nevada launched in 2004 involving acquisition and photo-interpretation of large-scale aerial photography, the Nevada Photo-Based Inventory Pilot (NPIP) \citep{Frescino09}. The data files for these examples are included in the \pkg{ModelMap} package instalation in the \proglang{R} library dirrectory. The datasets are under the 'external' then under 'vignetteexamples'. The predictor data set consists of 6 predictor variables: 5 continuous variables, and 1 categorical variable \mbox{(Table \ref{tab:one})}. The predictor layers are 250-meter resolution, pixel-based raster layers including Moderate Resolution Imaging Spectro-radiometer (MODIS) satellite imagery \citep{justice02}, a Landsat Thematic Mapper-based, thematic layer of predicted land cover, National Land Cover Dataset (NLCD) \citep{homer04}, and a topographic layer of elevation from the National Elevation Dataset \citep{gesch02}. The continuous response variables are percent cover of Pinyon and Sage. The binary response variables are presence of Pinyon and Sage. The categorical response variable is the vegetation category: TREE, SHRUB, OTHERVEG, and NONVEG. The MODIS data included 250-meter, 16-day, cloud-free, composites of MODIS imagery for April 6, 2005: visible-red (RED) and near-infrared (NIR) bands and 2 vegetation indices, normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI) \citep{huete02}. The land cover and topographic layers were 30-meter products re-sampled to 250 meter using majority and mean summaries, respectively. The rectangular subset of Nevada chosen for these maps contains a small mountain range surrounded by plains, and was deliberately selected to lie along the edge of the study region to illustrate how \pkg{ModelMap} handles unsampled regions of a rectangle \mbox{(Figure \ref{fig:ExElev})}. \begin{table} \begin{center} \begin{tabular}{|l|l|l|} \hline Name& Type& Description\\ \hline ELEV250& Continuous& 90m NED elevation (ft)\\ & & resampled to 250m, average of 49 points\\ NLCD01\_250& Categorical& National Land Cover Dataset 2001 \\ & & resampled to 250m - min. value of 49 points\\ EVI2005097& Continuous& MODIS Enhanced vegetation index\\ NDV2005097& Continuous& MODIS Normalized difference vegetation index\\ NIR2005097& Continuous& MODIS Band 2 (Near Infrared)\\ RED2005097& Continuous& MODIS Band 1 (Red)\\ \hline \end{tabular} \end{center} \caption{\label{tab:one}Predictor variables} \end{table} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Example 1 - Random Forest - Continuous Response} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Example 1 builds Random Forest models for two continuous response variables: Percent Cover for Pinyon and Percent Cover for Sage. An independent test set is used for model validation. \subsubsection{Set up} After installing the \pkg{ModelMap} package, find the sample datasets from the \proglang{R} istallation and copy them to your working directory. The data consists of five files and is located in the vignette directory of \pkg{ModelMap}, for example, in \verb+C:\R\R-2.15.0\library\ModelMap\vignettes+. There are 5 files: \begin{tabular}{lll} & & VModelMapData.csv\\ & & VModelMapData\_LUT.csv\\ & & VModelMapData\_dem\_ELEVM\_250.img\\ & & VModelMapData\_modis\_STK2005097.img\\ & & VModelMapData\_nlcd\_NLCD01\_250.img\\ \end{tabular} <<Ex1 set options,echo=FALSE>>= options(prompt = "R> ") options(width = 75) options(continue=" ") pdf("Vplots.pdf") @ <<Ex1 set width,echo=false>>= options(width=60) @ Load the \pkg{ModelMap} package. <<Ex1 load package, results=hide>>= library("ModelMap") @ Next define some of the arguments for the models. Specify model type. The choices are \code{"RF"} for Random Forest models, \code{"QRF"} for quantile regression forest models, and \code{"CF"} for conditional inference forest models. See the "Pick your flavor of Random Forest" vignette for further information on QRF and CF models. <<Ex1 Define model type, results=hide>>= model.type <- "RF" @ Define training and test data file names. Note that the arguments \code{qdata.trainfn} and \code{qdata.testfn} will accept either character strings giving the file names of CSV files of data, or the data itself in the form of a data frame. <<Ex1 Define training and test files, results=hide>>= qdatafn <- "VModelMapData.csv" qdata.trainfn <- "VModelMapData_TRAIN.csv" qdata.testfn <- "VModelMapData_TEST.csv" @ Define the output folder. <<Ex1 define folder, results=hide>>= folder <- getwd() @ Split the data into training and test sets. In example 1, an independent test set is used for model validation diagnostics. The function \code{get.test()} randomly divides the original data into training and test sets. This function writes the training and test sets to the folder specified by \code{folder}, under the file names specified by \code{qdata.trainfn} and \code{qdata.testfn}. If the arguments \code{qdata.trainfn} and \code{qdata.testfn} are not included, filenames will be generated by appending \code{"_train"} and \code{"_test"} to \code{qdatafn}. <<Ex1 split training and test, results=hide>>= get.test( proportion.test=0.2, qdatafn=qdatafn, seed=42, folder=folder, qdata.trainfn=qdata.trainfn, qdata.testfn=qdata.testfn) @ Define file names to store model output. This filename will be used for saving the model itself. In addition, since we are not defining other output filenames, the names for other output files will be generated based on \code{MODELfn}. <<Ex1 Define model filenames, results=hide>>= MODELfn.a <- "VModelMapEx1a" MODELfn.b <- "VModelMapEx1b" @ Define the predictors and define which predictors are categorical. Example 1 uses five continuous predictors: the four predictor layers from the MODIS imagery plus the topographic elevation layer. As none of the chosen predictors are categorical set \code{predFactor} to \code{FALSE}. <<Ex1 Define predictors, results=hide>>= predList <- c( "ELEV250", "EVI2005097", "NDV2005097", "NIR2005097", "RED2005097") predFactor <- FALSE @ Define the response variable, and whether it is continuous, binary or categorical. <<Ex1 Define response, results=hide>>= response.name.a <- "PINYON" response.name.b <- "SAGE" response.type <- "continuous" @ Define the seeds for each model. <<Ex1 set seed, results=hide>>= seed.a <- 38 seed.b <- 39 @ Define the column that contains unique identifiers for each data point. These identifiers will be used to label the output file of observed and predicted values when running model validation. <<Ex1 Define Identifier, results=hide>>= unique.rowname <- "ID" @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Model creation} Now create the models. The \code{model.build()} function returns the model object itself. The function also saves a text file listing the values of the arguments from the function call. This file is particularly useful when using the GUI prompts, as otherwise there would be no record of the options used for each model. <<Ex1 Create Model, results=hide>>= model.obj.ex1a <- model.build( model.type=model.type, qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.a, predList=predList, predFactor=predFactor, response.name=response.name.a, response.type=response.type, seed=seed.a) model.obj.ex1b <- model.build( model.type=model.type, qdata.trainfn=qdata.trainfn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.b, predList=predList, predFactor=predFactor, response.name=response.name.b, response.type=response.type, seed=seed.b) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Model Diagnostics} Next make model predictions on an independent test set and run the diagnostics on these predictions. Model predictions on an independent test set are not stochastic, it is not necessary to set the seed. The \code{model.diagnostics()} function returns a data frame of observed and predicted values. This data frame is also saved as a CSV file. This function also runs model diagnostics, and creates graphs and tables of the results. The graphics are saved as files of the file type specified by \code{device.type}. For a continuous response model, the model validation diagnostics graphs are the variable importance plot \mbox{(Figure \ref{fig:Ex1aVarImp} and Figure \ref{fig:Ex1bVarImp})}, and a scatter plot of observed verses predicted values, labeled with the Pearson's and Spearman's correlation coefficients and the slope and intercept of the linear regression line \mbox{(Figure \ref{fig:Ex1aScatterplot} and Figure \ref{fig:Ex1bScatterplot})}. For Random forest models, the model diagnostic graphs also include the out of bag model error as a function of the number of trees\mbox{(Figure \ref{fig:Ex1aErrNtree} and Figure \ref{fig:Ex1bErrNtree})} In example 1, the diagnostic plots are saved as PDF files. These diagnostics show that while the most important predictor variables are similar for both models, the correlation coefficients are considerably higher for the Pinyon percent cover model as compared to the Sage model. <<Ex1 Model Diagnostics, results=hide>>= model.pred.ex1a <- model.diagnostics( model.obj=model.obj.ex1a, qdata.testfn=qdata.testfn, folder=folder, MODELfn=MODELfn.a, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="TEST", device.type=c("pdf"), cex=1.2) model.pred.ex1b <- model.diagnostics( model.obj=model.obj.ex1b, qdata.testfn=qdata.testfn, folder=folder, MODELfn=MODELfn.b, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="TEST", device.type=c("pdf"), cex=1.2) @ \begin{figure} \begin{center} \includegraphics{VModelMapEx1a_pred_importance} \end{center} \caption{\label{fig:Ex1aVarImp}Example 1 - Variable importance graph for Pinyon percent cover (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1b_pred_importance} \end{center} \caption{\label{fig:Ex1bVarImp}Example 1 - Variable importance graph for Sage percent cover (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1a_pred_scatterplot} \end{center} \caption{\label{fig:Ex1aScatterplot}Example 1 - Observed verses predicted values for Pinyon percent cover (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1b_pred_scatterplot} \end{center} \caption{\label{fig:Ex1bScatterplot}Example 1 - Observed verses predicted values for Sage percent cover (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1a_pred_error} \end{center} \caption{\label{fig:Ex1aErrNtree}Example 1 - Out of Bag error as a function of number of trees for Pinyon (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1b_pred_error} \end{center} \caption{\label{fig:Ex1bErrNtree}Example 1 - Out of Bag error as a function of number of trees for Sage (RF model).} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Comparing Variable Importance} The \code{model.importance.plot()} function uses a back-to-back barplot to compare variable importance between two models built from the same predictor variables \mbox{(Figure \ref{fig:Ex1CompImp})}. Variable Importance is calculated in various ways depending on the model type, the response type and the importance type. Importance measures are summarized in \mbox{(Table \ref{tab:two})}. \begin{table} \begin{center} \begin{tabular}{|l|l|l|l|} \hline Model& Response& Importance Type & Measured By\\ \hline RF & continuous & 1 - permutation & Percent Increase in MSE \\ RF & binary & 1 - permutation & Mean Decrease in Accuracy \\ RF & categorical& 1 - permutation & Mean Decrease in Accuracy \\ RF & continuous & 2 - node impurity & Residual Sum of Squares \\ RF & binary & 2 - node impurity & Mean Decrease in Gini \\ RF & categorical& 2 - node impurity & Mean Decrease in Gini \\ \hline \end{tabular} \end{center} \caption{\label{tab:two}Importance Measures} \end{table} <<Ex1 Compare Importance Plots, results=hide>>= model.importance.plot( model.obj.1=model.obj.ex1a, model.obj.2=model.obj.ex1b, model.name.1="Pinyon", model.name.2="Sage", sort.by="predList", predList=predList, scale.by="sum", main="Variable Importance", device.type="pdf", PLOTfn="VModelMapEx1CompareImportance", folder=folder) @ \begin{figure} \begin{center} \includegraphics{VModelMapEx1CompareImportance.pdf} \end{center} \caption{\label{fig:Ex1CompImp}Example 1 - Variable Importances for Pinyon verses Sage percent cover models.} \end{figure} The \code{model.importance.plot()} function can also be used to compare the two types of variable importance \mbox{(Figure \ref{fig:Ex1CompImpType})}. <<Ex1CompImpType,include=TRUE,include=TRUE,width=7.5,height=9.5, results=hide>>= opar <- par(mfrow=c(2,1),mar=c(3,3,3,3),oma=c(0,0,3,0)) model.importance.plot( model.obj.1=model.obj.ex1a, model.obj.2=model.obj.ex1a, model.name.1="", model.name.2="", imp.type.1=1, imp.type.2=2, sort.by="predList", predList=predList, scale.by="sum", main="Pinyon", device.type="none", cex=0.9) model.importance.plot( model.obj.1=model.obj.ex1b, model.obj.2=model.obj.ex1b, model.name.1="", model.name.2="", imp.type.1=1, imp.type.2=2, sort.by="predList", predList=predList, scale.by="sum", main="Sage", device.type="none", cex=0.9) mtext("Comparison of Importance Types",side=3,line=0,cex=1.8,outer=TRUE) par(opar) @ \begin{figure} \begin{center} <<Ex1CompImpTypeFig,fig=TRUE, results=hide, echo=FALSE, width=7.5,height=9.5>>= <<Ex1CompImpType>> @ \end{center} \caption{\label{fig:Ex1CompImpType}Example 1 - Variable Importances Types for continuous response models - the relative importance of predictor variables as measured by the mean decrease in accuracy from randomly permuting each predictor as compared to the decrease in node impurities from splitting on the variable.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Interaction Plots} The \code{model.interaction.plot()} function provides a diagnostic plot useful in visualizing two-way interactions between predictor variables. Two of the predictor variables from the model are used to produce a grid of possible combinations over the range of both variables. The remaining predictor variables are fixed at either their means (for continuous predictors) or their most common value (for categorical predictors). Model predictions are generated over this grid and plotted as the z axis. The \code{model.interaction.plot()} function was developed from the \code{gbm.perspec} function from the tutorial provided as appendix S3 in \citet{elith08}. The \code{model.interaction.plot()} function provides two graphical options: an image plot, and a 3-D perspective plot. These options are selected by setting \code{plot.type = "image"} or \code{plot.type = "persp"} The \code{x} and \code{y} arguments are used to specify the predictor variables for the X and Y axis. The predictors can be specified by name, or by number, with the numbers referring to the order the variables appear in \code{predList}. <<Ex1 Interaction Plots, results=hide>>= model.interaction.plot( model.obj.ex1a, x="NIR2005097", y="RED2005097", main=response.name.a, plot.type="image", device.type="pdf", MODELfn=MODELfn.a, folder=folder) model.interaction.plot( model.obj.ex1b, x="NIR2005097", y="RED2005097", main=response.name.b, plot.type="image", device.type="pdf", MODELfn=MODELfn.b, folder=folder) model.interaction.plot( model.obj.ex1a, x=1, y=3, main=response.name.a, plot.type="image", device.type="pdf", MODELfn=MODELfn.a, folder=folder) model.interaction.plot( model.obj.ex1b, x=1, y=3, main=response.name.b, plot.type="image", device.type="pdf", MODELfn=MODELfn.b, folder=folder) @ \begin{figure} \begin{center} \includegraphics{VModelMapEx1a_image_NIR2005097_RED2005097} \end{center} \caption{\label{fig:Ex1aIntNirRed}Example 1 - Interaction plot for Pinyon percent cover (RF model), showing interactions between two of the satellite based predictors (NIR2005097 and RED2005097). Image plot, with darker green indicating higher percent cover. Here we can see that predicted Pinyon cover is highest at low values of either NIR or RED. However low values of both predictors does not further raise the predicted cover.} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1b_image_NIR2005097_RED2005097} \end{center} \caption{\label{fig:Ex1bIntNirRed}Example 1 - Interaction plot for Sage percent cover (RF model), showing interactions between two of the satellite based predictors (NIR2005097 and RED2005097). Image plot, with darker green indicating higher percent cover. Here we can see that predicted Sage cover is lowest when high values of NIR are combined with RED lying between 2000 and 5000. When NIR is lower than 2900, RED still has an effect on the predicted cover, but the effect is not as strong. } \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1a_image_ELEV250_NDV2005097.pdf} \end{center} \caption{\label{fig:Ex1aIntElevNdv}Example 1 - Interaction plot for Pinyon percent cover (RF model), showing interactions between elevation and a satellite based predictor (ELEV250 and NDV2005097). Image plot, with darker green indicating higher percent cover. Here we can see that predicted Pinyon cover is highest at elevation greater than 2000m. In addition, high values of NDV slightly increase the predicted cover, but there seems to be little interaction between the two predictors.} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx1b_image_ELEV250_NDV2005097.pdf} \end{center} \caption{\label{fig:Ex1bIntElevNdv}Example 1 - Interaction plot for Sage percent cover (RF model), showing interactions between elevation and a satellite based predictor (ELEV250 and NDV2005097). Image plot, with darker green indicating higher percent cover. Here we do see an interaction between the two predictors. At low elevations, predicted Sage cover is low throughout the range of NDV, and particularly low at mid-values. At mid elevations, predicted Sage cover is high throughout the range of NDV. At high elevations NDV has a strong influence on predicted Sage cover with high cover tied to low to mid values of NDV.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Map production} Before building maps of the responses, examine the predictor variable for elevation \mbox{(Figure \ref{fig:ExElev})}: <<ExElevReadGDAL, results=hide>>= elevfn <- paste(folder,"/VModelMapData_dem_ELEVM_250.img",sep="") mapgrid <- raster(elevfn) @ <<ExElev,include=TRUE,width=4.5,height=6.5, results=hide>>= opar <- par(mar=c(4,4,3,6),xpd=NA,mgp=c(3, 2, .3)) col.ramp<-terrain.colors(101) zlim <- c(1500,maxValue(mapgrid)) legend.label<-rev(pretty(zlim,n=5)) legend.colors<-col.ramp[trunc((legend.label/max(legend.label))*100)+1] legend.label<-paste(legend.label,"m",sep="") legend.label<-paste((7:3)*500,"m") legend.colors<-col.ramp[c(100,75,50,25,1)] image( mapgrid, col = col.ramp, xlab="", ylab="", zlim=zlim, asp=1, bty="n", main="") legend( x=xmax(mapgrid),y=ymax(mapgrid), legend=legend.label, fill=legend.colors, bty="n", cex=1.2) mtext("Elevation of Study Region",side=3,line=1,cex=1.5) par(opar) @ \begin{figure} \begin{center} <<ExElevFig,fig=TRUE, echo=FALSE, width=4.7,height=5.7>>= <<ExElev>> @ \end{center} \caption{\label{fig:ExElev}Elevation of study region. Projection: Universal Transverse Mercator (UTM) Zone 11, Datum: NAD83} \end{figure} Run the function \code{model.mapmake()} to map the response variable over the study area. The \code{model.mapmake()} function can extract information about the model from the \code{model.obj}, so it is not necessary to re-supply the arguments that define the model, such as the type of model, the predictors, etc \ldots{} (Note: If model was created outside of \pkg{ModelMap}, it may be necessary to supply the \code{response.name} argument) Also, unlike model creation, map production is not stochastic, so it is not necessary to set the seed. The \code{model.mapmake()} uses a look up table to associate the predictor variables with the rasters. The function argument \code{rastLUTfn} will accept either a file name of the CSV file containing the table, or the data frame itself. Although in typical user applications the raster look up table must include the full path for predictor rasters, the table provided for the examples will be incomplete when initially downloaded, as the working directory of the user is unknown and will be different on every computer. This needs to be corrected by pasting the full paths to the user's working directory to the first column, using the value from \code{folder} defined above. <<Ex1 update raster LUT, results=hide>>= rastLUTfn <- "VModelMapData_LUT.csv" rastLUTfn <- read.table( rastLUTfn, header=FALSE, sep=",", stringsAsFactors=FALSE) rastLUTfn[,1] <- paste(folder,rastLUTfn[,1],sep="/") @ To produce a map from a raster larger than the memory limits of \proglang{R}, predictions are made one row at a time. Since this is a Random Forest model of a continuous response, the prediction at each pixel is the mean of all the trees. Therefore these individual tree predictions can also be used to map measures of uncertainty such as standard deviation and coefficient of variation for each pixel. To do so, set \code{map.sd = "TRUE"}. To calculate these pixel uncertainty measures, \code{model.map()} must keep all the individual trees in memory, so \code{map.sd = "TRUE"} is much more memory intensive. <<Ex1 Define numrows, results=hide>>= @ <<Ex1 Produce Maps, results=hide>>= model.mapmake( model.obj=model.obj.ex1a, folder=folder, MODELfn=MODELfn.a, rastLUTfn=rastLUTfn, na.action="na.omit", # Mapping arguments map.sd=TRUE) model.mapmake( model.obj=model.obj.ex1b, folder=folder, MODELfn=MODELfn.b, rastLUTfn=rastLUTfn, na.action="na.omit", # Mapping arguments map.sd=TRUE) @ The function \code{model.mapmake()} creates an \proglang{Imagine} image file of map information suitable to be imported into a GIS. As this sample dataset is relatively small, we can also import it into \proglang{R} for display. We need to define a color ramp. For this response variable, zero values will display as white, shading to dark green for high values. <<Ex1 define color sequence, results=hide>>= l <- seq(100,0,length.out=101) c <- seq(0,100,length.out=101) col.ramp <- hcl(h = 120, c = c, l = l) @ Next, we import the data and create the map \mbox{(Figure \ref{fig:Ex1Map})}. From the map, we can see that Pinyon percent cover is higher in the mountains, while Sage percent cover is higher in the foothills at the edges of the mountains. Note that the sample map data was taken from the South Eastern edge of our study region, to illustrate how \pkg{ModelMap} deals with portions of the rectangle that fall outside of the study region. The empty wedge at lower right in the maps is the portion outside the study area. \pkg{ModelMap} uses \code{-9999} for unsampled data. When viewing maps in a GIS, a mask file can be used to hide unsampled regions, or other commands can be used to set the color for \code{-9999} values. Since we know that percent cover can not be negative, we will set \code{zlim} to range from zero to the maximum value found in our map. <<Ex1Map,include=TRUE,width=7.5,height=5.7, results=hide>>= opar <- par(mfrow=c(1,2),mar=c(3,3,2,1),oma=c(0,0,3,4),xpd=NA) mapgrid.a <- raster(paste(MODELfn.a,"_map.img",sep="")) mapgrid.b <- raster(paste(MODELfn.b,"_map.img",sep="")) zlim <- c(0,max(maxValue(mapgrid.a),maxValue(mapgrid.b))) legend.label<-rev(pretty(zlim,n=5)) legend.colors<-col.ramp[trunc((legend.label/max(legend.label))*100)+1] legend.label<-paste(legend.label,"%",sep="") image( mapgrid.a, col=col.ramp, xlab="",ylab="",xaxt="n",yaxt="n", zlim=zlim, asp=1,bty="n",main="") mtext(response.name.a,side=3,line=1,cex=1.2) image( mapgrid.b, col=col.ramp, xlab="",ylab="",xaxt="n",yaxt="n", zlim=zlim, asp=1,bty="n",main="") mtext(response.name.b,side=3,line=1,cex=1.2) legend( x=xmax(mapgrid.b),y=ymax(mapgrid.b), legend=legend.label, fill=legend.colors, bty="n", cex=1.2) mtext("Percent Cover",side=3,line=1,cex=1.5,outer=T) par(opar) @ \begin{figure} \begin{center} <<Ex1MapFig,fig=TRUE,echo=FALSE, width=7.5,height=5.7>>= <<Ex1Map>> @ \end{center} \caption{\label{fig:Ex1Map}Example 1 - Maps of percent cover for Pinyon and Sage (RF models).} \end{figure} Next, we will define color ramps for the standard deviation and the coefficient of variation, and map these uncertainty measures. Often, as the mean increases, so does the standard deviation \citep{zar96}, therefore, a map of the standard deviation of the pixels \mbox{(Figure \ref{fig:Ex1StdevMap})} will look to the naked eye much like the map of the mean. However, mapping the coefficient of variation (dividing the standard deviation of each pixel by the mean of the pixel), can provide a better visualization of spatial regions of higher uncertainty \mbox{(Figure \ref{fig:Ex1CoefvMap})}. In this case, for Pinyon the coefficient of variation is interesting as it is higher in the plains on the upper left portion of the map, where percent cover of Pinyon is lower. <<Ex1 define stdev color sequence, results=hide>>= stdev.ramp <- hcl(h = 15, c = c, l = l) @ <<Ex1StdevMap,include=TRUE,width=7.5,height=5.7, results=hide>>= opar <- par(mfrow=c(1,2),mar=c(3,3,2,1),oma=c(0,0,3,4),xpd=NA) mapgrid.a <- raster(paste(MODELfn.a,"_map_stdev.img",sep="")) mapgrid.b <- raster(paste(MODELfn.b,"_map_stdev.img",sep="")) zlim <- c(0,max(maxValue(mapgrid.a),maxValue(mapgrid.b))) legend.label<-rev(pretty(zlim,n=5)) legend.colors<-stdev.ramp[trunc((legend.label/max(legend.label))*100)+1] legend.label<-paste(legend.label,"%",sep="") image( mapgrid.a, col=stdev.ramp, xlab="",ylab="",xaxt="n",yaxt="n", zlim=zlim, asp=1,bty="n",main="") mtext(response.name.a,side=3,line=1,cex=1.2) image( mapgrid.b, col=stdev.ramp,xlab="",ylab="",xaxt="n",yaxt="n", zlim=zlim, asp=1,bty="n",main="") mtext(response.name.b,side=3,line=1,cex=1.2) legend( x=xmax(mapgrid.b),y=ymax(mapgrid.b), legend=legend.label, fill=legend.colors, bty="n", cex=1.2) mtext("Standard Deviation of Percent Cover",side=3,line=1,cex=1.5,outer=T) par(opar) @ \begin{figure} \begin{center} <<Ex1StdevMapFig,fig=TRUE,echo=FALSE, width=7.5, height=5.7>>= <<Ex1StdevMap>> @ \end{center} \caption{\label{fig:Ex1StdevMap}Example 1 - Map of standard deviation of Random Forest trees at each pixel for Pinyon and Sage (RF models).} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <<Ex1 define coefv color sequence, results=hide>>= coefv.ramp <- hcl(h = 70, c = c, l = l) @ <<Ex1CoefvMap,include=TRUE,width=7.5,height=5.7, results=hide>>= opar <- par(mfrow=c(1,2),mar=c(3,3,2,1),oma=c(0,0,3,4),xpd=NA) mapgrid.a <-raster(paste(MODELfn.a,"_map_coefv.img",sep=""),as.image=TRUE) mapgrid.b <- raster(paste(MODELfn.b,"_map_coefv.img",sep=""),as.image=TRUE) zlim <- c(0,max(maxValue(mapgrid.a),maxValue(mapgrid.b))) legend.label<-rev(pretty(zlim,n=5)) legend.colors<-coefv.ramp[trunc((legend.label/max(legend.label))*100)+1] image( mapgrid.a, col=coefv.ramp, xlab="",ylab="",xaxt="n",yaxt="n",zlim=zlim, asp=1,bty="n",main="") mtext(response.name.a,side=3,line=1,cex=1.2) image( mapgrid.b, col=coefv.ramp, xlab="",ylab="",xaxt="n",yaxt="n", zlim=zlim, asp=1,bty="n",main="") mtext(response.name.b,side=3,line=1,cex=1.2) legend( x=xmax(mapgrid.b),y=ymax(mapgrid.b), legend=legend.label, fill=legend.colors, bty="n", cex=1.2) mtext("Coefficient of Variation of Percent Cover",side=3,line=1,cex=1.5,outer=T) par(opar) @ \begin{figure} \begin{center} <<Ex1CoefvMapFig,fig=TRUE,echo=FALSE, width=7.5, height=5.7>>= <<Ex1CoefvMap>> @ \end{center} \caption{\label{fig:Ex1CoefvMap}Example 1 - Map of coefficient of variation of Random Forest trees at each pixel for Pinyon and Sage (RF models).} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Example 2 - Random Forest - Binary Response} Example 2 builds a binary response model for presence of Pinyon and Sage. A catagorical predictor is added to the model. Out-of-bag estimates are used for model validation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Set up} Define model type. << Ex2 Define model type, results=hide>>= model.type <- "RF" @ Define data. <<Ex2 Define training and test files, results=hide>>= qdatafn <- "VModelMapData.csv" @ Define \code{folder}. <<Ex2 define folder, results=hide>>= folder <- getwd() @ Define model filenames. << Ex2 Define model filename, results=hide>>= MODELfn.a <- "VModelMapEx2a" MODELfn.b <- "VModelMapEx2b" @ Define the predictors. These are the five continuous predictors from the first example, plus one categorical predictor layer, the thematic layer of predicted land cover classes from the National Land Cover Dataset. The argument \code{predFactor} is used to specify the categorical predictor. << Ex2 Define predictors, results=hide>>= predList <- c( "ELEV250", "NLCD01_250", "EVI2005097", "NDV2005097", "NIR2005097", "RED2005097") predFactor <- c("NLCD01_250") @ Define the data column to use as the response, and if it is continuous, binary or categorical. Since \code{response.type = "binary"} this variable will be automatically translated so that zeros are treated as Absent and any value greater than zero is treated as Present. <<Ex2 Define response, results=hide>>= response.name.a <- "PINYON" response.name.b <- "SAGE" response.type <- "binary" @ Define the seeds for each model. <<Ex2 set seed, results=hide>>= seed.a <- 40 seed.b <- 41 @ Define the column that contains unique identifiers. <<Ex2 Define Identifier, results=hide>>= unique.rowname <- "ID" @ Define raster look up table. <<Ex2 update raster LUT, results=hide>>= rastLUTfn <- "VModelMapData_LUT.csv" rastLUTfn <- read.table( rastLUTfn, header=FALSE, sep=",", stringsAsFactors=FALSE) rastLUTfn[,1] <- paste(folder,rastLUTfn[,1],sep="/") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Model creation} Create the model. Because Out-Of-Bag predictions will be used for model diagnostics, the full dataset can be used as training data. To do this, set \code{qdata.trainfn <- qdatafn}, \code{qdata.testfn <- FALSE} and \code{v.fold = FALSE}. <<Ex2 Create Model, results=hide>>= model.obj.ex2a <- model.build( model.type=model.type, qdata.trainfn=qdatafn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.a, predList=predList, predFactor=predFactor, response.name=response.name.a, response.type=response.type, seed=seed.a) model.obj.ex2b <- model.build( model.type=model.type, qdata.trainfn=qdatafn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn.b, predList=predList, predFactor=predFactor, response.name=response.name.b, response.type=response.type, seed=seed.b) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Model Diagnostics} Make Out-Of-Bag model predictions on the training data and run the diagnostics on these predictions. This time, save JPEG, PDF, and PS versions of the diagnostic plots. Out of Bag model predictions for a Random Forest model are not stochastic, so it is not necessary to set the seed. Since this is a binary response model model diagnostics include ROC plots and other threshold selection plots generated by \pkg{PresenceAbsence} \citep{PresenceAbsence, freeman08a} \mbox{(Figure \ref{fig:Ex2aThresh}} and \mbox{Figure \ref{fig:Ex2bThresh})} in addition to the variable importance graph \mbox{(Figure \ref{fig:Ex2aVarImp}} and \mbox{Figure \ref{fig:Ex2bVarImp}).} For binary response models, there are also CSV files of presence-absence thresholds optimized by 12 possible criteria, along with their associated error statistics. For more details on these 12 optimization criteria see \citet{ freeman08a}. Some of these criteria are dependent on user selected parameters. In this example, two of these parameters are specified: required sensitivity (\code{req.sens}) and required specificity (\code{req.spec}). Other user defined parameters, such as False Positive Cost (\code{FPC}) and False Negative Cost (\code{FNC}) are left at the default values. When default values are used for these parameters, \code{model.diagnostics()} will give a warning. In this case: \begin{Schunk} \begin{Soutput} 1: In error.threshold.plot(PRED, opt.methods = optimal.thresholds(), ... : costs assumed to be equal \end{Soutput} \end{Schunk} The variable importance graphs show NLCD was a very important predictor for Pinyon presence, but not an important variable when predicting Sage presence. <<Ex2 Model Diagnostics, results=hide>>= model.pred.ex2a <- model.diagnostics( model.obj=model.obj.ex2a, qdata.trainfn=qdatafn, folder=folder, MODELfn=MODELfn.a, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="OOB", device.type=c("jpeg","pdf","postscript"), cex=1.2) model.pred.ex2b <- model.diagnostics( model.obj=model.obj.ex2b, qdata.trainfn=qdatafn, folder=folder, MODELfn=MODELfn.b, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="OOB", device.type=c("jpeg","pdf","postscript"), cex=1.2) @ Take a closer look at the text file of thresholds optimized by multiple criteria. These thresholds are used later to display the mapped predictions, so read this file into \proglang{R} now. <<Ex2 Optimal Threshold Table, results=hide>>= opt.thresh.a <- read.table( paste(MODELfn.a,"_pred_optthresholds.csv",sep=""), header=TRUE, sep=",", stringsAsFactors=FALSE) opt.thresh.a[,-1]<-signif(opt.thresh.a[,-1],2) opt.thresh.b <- read.table( paste(MODELfn.b,"_pred_optthresholds.csv",sep=""), header=TRUE, sep=",", stringsAsFactors=FALSE) opt.thresh.b[,-1]<-signif(opt.thresh.b[,-1],2) pred.prev.a <- read.table( paste(MODELfn.a,"_pred_prevalence.csv",sep=""), header=TRUE, sep=",", stringsAsFactors=FALSE) pred.prev.a[,-1]<-signif(pred.prev.a[,-1],2) pred.prev.b <- read.table( paste(MODELfn.b,"_pred_prevalence.csv",sep=""), header=TRUE, sep=",", stringsAsFactors=FALSE) pred.prev.b[,-1]<-signif(pred.prev.b[,-1],2) @ Optimized thresholds for Pinyon: <<Ex2a Optimal Threshold Table Show>>= opt.thresh.a @ And for Sage: <<Ex2b Optimal Threshold Table Show>>= opt.thresh.b @ Observed and predicted prevalence for Pinyon: <<Ex2a Prevalence Table Show>>= pred.prev.a @ And for Sage: <<Ex2b Prevalence Table Show>>= pred.prev.b @ The model quality graphs show that the model of Pinyon presence is much higher quality than the Sage model. This is illustrated with four plots: a histogram plot, a calibration plot, a ROC plot with it's associated Area Under the Curve (AUC), and an error rate verses threshold plot Pinyon has a double humped histogram plot, with most of the observed presences and absences neatly divided into the two humps. Therefor the optimized threshold values fall between the two humps and neatly divide the data into absences and presences. For Sage, on the other hand, the observed presences and absences are scattered throughout the range of predicted probabilities, and so there is no single threshold that will neatly divide the data into present and absent groups. In this case, the different optimization criteria tend to be widely separated, each representing a different compromise between the error statistics \citep{freeman08b}. Calibration plots provide a goodness-of-fit plot for presence-absence models, as described by \citet{pearce00}, \citet{vaughan05}, and \citet{reineking06}. In a Calibration plot the predicted values are divided into bins, and the observed proportion of each bin is plotted against the predicted value of the bin. For Pinyon, the standard errors for the bins overlap the diagonal, and the bins do not show a bias. For Sage, however, the error bars for the highest and lowest bins do not overlap the diagonal, and there is a bias where low probabilities tend to be over predicted, and high probabilities tend to be under predicted. The ROC plot from a good model will rise steeply to the upper left corner then level off quickly, resulting in an AUC near 1.0. A poor model (i.e. a model that is no better than random assignment) will have a ROC plot lying along the diagonal, with an AUC near 0.5. The Area Under the Curve (AUC) is equivalent to the chance that a randomly chosen plot with an observed value of present will have a predicted probability higher than that of a randomly chosen plot with an observed value of absent. The \pkg{PresenceAbsence} package used to create the model quality graphs for binary response models uses the method from \citet{delong88} to calculate Area Under the Curve (AUC). For these two models, the Area Under the Curve (AUC) for Pinyon is 0.97 and the ROC plot rises steeply, while the AUC for Sage is only 0.70, and the ROC plot is much closer to the diagonal. In the Error Rate verses Threshold plot sensitivity, specificity and Kappa are plotted against all possible values of the threshold \citep{fielding97}. In the graph of Pinyon error rates, sensitivity and specificity cross at a higher value, and also, the error statistics show good values across a broader range of thresholds. The Kappa curve is high and flat topped, indicating that for this model, Kappa will be high across a wide range of thresholds. For Sage, sensitivity and specificity cross at a lower value, and the Kappa curve is so low that it is nearly hidden behind the graph legend. For this model even the most optimal threshold selection will still result in a relatively low Kappa value. \begin{figure} \begin{center} \includegraphics{VModelMapEx2a_pred_thresholdplots} \end{center} \caption{\label{fig:Ex2aThresh}Example 2 - Model quality and threshold selection graphs for Pinyon presence (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx2b_pred_thresholdplots} \end{center} \caption{\label{fig:Ex2bThresh}Example 2 - Model quality and threshold selection graphs for Sage presence (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx2a_pred_importance} \end{center} \caption{\label{fig:Ex2aVarImp} Example 2 - Variable importance graph for Pinyon presence (RF model).} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx2b_pred_importance} \end{center} \caption{\label{fig:Ex2bVarImp} Example 2 - Variable importance graph for Sage presence (RF model).} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Comparing Variable Importance} The \code{model.importance.plot()} function uses a back-to-back barplot to compare variable importance between two models built from the same predictor variables \mbox{(Figure \ref{fig:Ex2CompImp})}. <<Ex2 Compare Importance Plots, results=hide>>= model.importance.plot( model.obj.1=model.obj.ex2a, model.obj.2=model.obj.ex2b, model.name.1="Pinyon", model.name.2="Sage", sort.by="predList", predList=predList, scale.by="sum", main="Variable Importance", device.type="pdf", PLOTfn="VModelMapEx2CompareImportance", folder=folder) @ \begin{figure} \begin{center} \includegraphics{VModelMapEx2CompareImportance.pdf} \end{center} \caption{\label{fig:Ex2CompImp}Example 2 - Variable Importances for Pinyon verses Sage presence models.} \end{figure} Because a binary response model is a two-class example of a categorical response model, we can use categorical tools to investigate the class specific variable importances. \mbox{(Figure \ref{fig:Ex2CompImpPA})} compares the relative importance of the predictor variables inr predicting Presences to their importances in predicting Absences. <<Ex2CompImpPA,include=TRUE,include=TRUE,width=7.5,height=7.5, results=hide>>= opar <- par(mfrow=c(2,1),mar=c(3,3,3,3),oma=c(0,0,3,0)) model.importance.plot( model.obj.1=model.obj.ex2a, model.obj.2=model.obj.ex2a, model.name.1="Absence", model.name.2="Presence", class.1="0", class.2="1", sort.by="predList", predList=predList, scale.by="sum", main="Pinyon Variable Importance", device.type="none", cex=0.9) model.importance.plot( model.obj.1=model.obj.ex2b, model.obj.2=model.obj.ex2b, model.name.1="Absence", model.name.2="Presence", class.1="0", class.2="1", sort.by="predList", predList=predList, scale.by="sum", main="Sage Variable Importance", device.type="none", cex=0.9) mtext("Presence-Absence Variable Importance Comparison",side=3,line=0,cex=1.8,outer=TRUE) par(opar) @ \begin{figure} \begin{center} <<Ex2CompImpPAFig,fig=TRUE,results=hide, echo=FALSE, width=7.5,height=7.5>>= <<Ex2CompImpPA>> @ \end{center} \caption{\label{fig:Ex2CompImpPA}Example 2 - Relative variable importances (permutation based) for predicting Presences verses predicting Absences for Pinyon and Sage.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Interaction Plots} Here we will look at how the \code{model.interaction.plot()} function works with a factored predictor variable. In image plots the levels of the factored predictor are shown as vertical or horizontal bars across the plot region. In 3-D perspective plots the levels are represented by ribbons across the prediction surface. <<Ex2 Interaction Plots, results=hide>>= model.interaction.plot( model.obj.ex2a, x="ELEV250", y="NLCD01_250", main=response.name.a, plot.type="image", device.type="pdf", MODELfn=MODELfn.a, folder=folder) model.interaction.plot( model.obj.ex2b, x="ELEV250", y="NLCD01_250", main=response.name.b, plot.type="image", device.type="pdf", MODELfn=MODELfn.b, folder=folder) model.interaction.plot( model.obj.ex2a, x="ELEV250", y="NLCD01_250", main=response.name.a, plot.type="persp", device.type="pdf", MODELfn=MODELfn.a, folder=folder, theta=300, phi=55) model.interaction.plot( model.obj.ex2b, x="ELEV250", y="NLCD01_250", main=response.name.b, plot.type="persp", device.type="pdf", MODELfn=MODELfn.b, folder=folder, theta=300, phi=55) @ \begin{figure} \begin{center} \includegraphics{VModelMapEx2a_image_ELEV250_NLCD01_250} \end{center} \caption{\label{fig:Ex2aImageElevNlcd}Example 2 - Interaction plot for Pinyon presence-absence (RF model), showing interactions between elevation and National Land Cover Dataset classes (ELEV250 and NLCD01\_250). Image plot, with darker green indicating higher probability of presence. Here we see that in all NLCD classes predicted Pinyon presence is strongly tied to elevation, with low presence below 2000m, and moderate presence at higher elevations.} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx2b_image_ELEV250_NLCD01_250} \end{center} \caption{\label{fig:Ex2bImageElevNlcd}Example 2 - Interaction plot for Sage presence-absence (RF model), showing interactions between elevation and National Land Cover Dataset classes (ELEV250 and NLCD01\_250). Image plot, with darker green indicating higher probability of presence. Here we see that predicted Sage presence is influenced by both elevation and NLCD class. In most NLCD classes predicted presence is highest between 1400m and 2000m, with very low presence at lower elevations and low presence at higher elevations. In contrast, in NLCD class 50 while predicted presence drops slightly as elevation increases, it remains quite high all the way to 3000m. } \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx2a_persp_ELEV250_NLCD01_250} \end{center} \caption{\label{fig:Ex2aPerspElevNlcd}Example 2 - Interaction plot for Pinyon presence-absence (RF model), showing interactions between elevation and National Land Cover Dataset classes (ELEV250 and NLCD01\_250). Perspective plot, with probability or presence shown on the Z axis. In the perspective plot (as compared to the image plot) it is easier to see that while predicted Pinyon presence is influenced by both elevation and NLCD class, the shape of the relationship between elevation and presence is similar in all NLCD classes, and while the overall probability is higher in some classes, the curves relating probability to elevation are generally parallel from class to class. Therefore there appears to be little 2-way interaction between these predictors. } \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx2b_persp_ELEV250_NLCD01_250} \end{center} \caption{\label{fig:Ex2bPerspElevNlcd}Example 2 - Interaction plot for Sage presence-absence (RF model), showing interactions between elevation and National Land Cover Dataset classes (ELEV250 and NLCD01\_250). Perspective plot, with probability or presence shown on the Z axis. Here we can see that while all NLCD classes have low predicted Sage presence at low elevation, in NLCD class 50 at mid elevations predicted presence shoots higher than the other classes, and then does not drop as far as the other classes at high elevations. Resulting in a different shape of the elevation verses probability curve for class 50. } \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Map production} The function \code{model.mapmake()} creates ascii text files of map predictions. <<Ex2 Produce Maps, results=hide>>= model.mapmake( model.obj=model.obj.ex2a, folder=folder, MODELfn=MODELfn.a, rastLUTfn=rastLUTfn, na.action="na.omit") model.mapmake( model.obj=model.obj.ex2b, folder=folder, MODELfn=MODELfn.b, rastLUTfn=rastLUTfn, na.action="na.omit") @ When working with categorical predictors, sometimes there are categories in the prediction data (either the test set, or the map data) not found in the training data. In this case, there were three classes for the predictor \code{NLCD01_250} that were not present in the training data. With the default \code{na.action = "na.omit"} the \code{model.mapmake()} function generated the following warnings, and these pixels will show up as blank pixels in the maps. \begin{Schunk} \begin{Soutput} 2: In production.prediction(model.obj = model.obj, rastLUTfn = rastLUTfn, : categorical factored predictor NLCD01_250 contains levels 41, 43, 20 not found in training data 3: In production.prediction(model.obj = model.obj, rastLUTfn = rastLUTfn, : Returning -9999 for data points with levels not found in the training data \end{Soutput} \end{Schunk} Begin by mapping the probability surface, in other words, the probability that the species is present at each grid point \mbox{(Figure \ref{fig:Ex2ProbSurf})}. First Define a color ramp. For this map, pixels with a high probability of presence will display as green, low probability will display as brown, and model uncertainty (probabilities near 50\%) will display as yellow. Notice that the map for Pinyon, is mostly dark green and dark brown, with a thin dividing line of yellow. With a high quality model, most of the pixels are assigned high or low probabilities. The map for Sage, however, is mostly yellow, with only occasional areas of green and brown. With poor quality models, many of the pixels are inderminate, and assigned probabilities near 50\%. <<Ex2 define color sequence, results=hide>>= h=c( seq(10,30,length.out=10), seq(31,40,length.out=10), seq(41,90,length.out=60), seq(91,100,length.out=10), seq(101,110,length.out=10)) l =c( seq(25,40,length.out=10), seq(40,90,length.out=35), seq(90,90,length.out=10), seq(90,40,length.out=35), seq(40,10,length.out=10)) probpres.ramp <- hcl(h = h, c = 80, l = l) @ Import the data and create the map. Since we know that probability of presence can range from zero to one, we will use those values for \code{zlim}. <<Ex2ProbabilitySurface,include=TRUE,width=7.5,height=5.7, results=hide>>= opar <- par(mfrow=c(1,2),mar=c(3,3,2,1),oma=c(0,0,3,4),xpd=NA) mapgrid.a <- raster(paste(MODELfn.a,"_map.img",sep="")) mapgrid.b <- raster(paste(MODELfn.b,"_map.img",sep="")) legend.subset<-c(100,80,60,40,20,1) legend.colors<-probpres.ramp[legend.subset] legend.label<-c("100%"," 80%"," 60%"," 40%"," 20%"," 0%") image( mapgrid.a, col=probpres.ramp, xlab="",ylab="",yaxt="n",main="",zlim=c(0,1), asp=1,bty="n",xaxt="n") mtext(response.name.a,side=3,line=1,cex=1.2) image( mapgrid.b, col=probpres.ramp, xlab="",ylab="",xaxt="n",yaxt="n", zlim=c(0,1), asp=1,bty="n",main="") mtext(response.name.b,side=3,line=1,cex=1.2) legend( x=xmax(mapgrid.b),y=ymax(mapgrid.b), legend=legend.label, fill=legend.colors, bty="n", cex=1.2) mtext("Probability of Presence",side=3,line=1,cex=1.5,outer=T) par(opar) @ \begin{figure} \begin{center} <<Ex2ProbabilitySurfaceFig,fig=TRUE, results=hide, echo=FALSE, width=7.5,height=5.7>>= <<Ex2ProbabilitySurface>> @ \end{center} \caption{\label{fig:Ex2ProbSurf}Example 2 - Probability surface map for presence of Pinyon and Sage (RF models).} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To translate the probability surface into a Presence-Absence map it is necessary to select a cutoff threshold. Probabilities below the selected threshold are mapped as absent while probabilities above the threshold are mapped as present. Many criteria that can be used for threshold selection, ranging from the traditional default of 50 percent, to thresholds optimized to maximize Kappa, to thresholds picked to meet certain management criteria. The choice of threshold criteria can have a dramatic effect on the final map. For further discussion on this topic see \citet{freeman08b}. Here are examples of Presence-Absence maps for Pinyon and Sage produced by four different threshold optimization criteria \mbox{(Figures \ref{fig:Ex2aPresMaps}} and \mbox{\ref{fig:Ex2bPresMaps})}. For a high quality model, such as Pinyon, the various threshold optimization criteria tend to result in similar thresholds, and the models tend to be less sensitive to threshold choice, therefore the Presence Absence maps from the four criteria are very similar. Poor quality models, such as this model for Sage, tend to have no single good threshold, as each criteria is represents a different compromise between errors of omission and errors of commission. It is therefore particularly important to carefully match threshold criteria to the intended use of the map. <<Ex2aPresenceMaps,include=TRUE,width=5.4,height=7.5, results=hide>>= opar <- par(mfrow=c(2,2),mar=c(2.5,3,4,1),oma=c(0,0,4,6),xpd=NA) mapgrid <- raster(paste(MODELfn.a,"_map.img",sep="")) criteria <- c("Default","MaxKappa","ReqSens","ReqSpec") criteria.labels<-c("Default","MaxKappa","ReqSens = 0.9","ReqSpec = 0.9") for(i in 1:4){ thresh <- opt.thresh.a$threshold[opt.thresh.a$opt.methods==criteria[i]] presencegrid <- mapgrid v <- getValues(presencegrid) v <- ifelse(v > thresh,1,0) presencegrid <- setValues(presencegrid, v) image( presencegrid, col=c("white","forestgreen"), zlim=c(0,1), asp=1, bty="n", xaxt="n", yaxt="n", main="",xlab="",ylab="") if(i==2){ legend( x=xmax(mapgrid),y=ymax(mapgrid), legend=c("Present","Absent"), fill=c("forestgreen","white"), bty="n", cex=1.2)} mtext(criteria.labels[i],side=3,line=2,cex=1.2) mtext(paste("threshold =",thresh),side=3,line=.5,cex=1) } mtext(MODELfn.a,side=3,line=0,cex=1.2,outer=TRUE) mtext(response.name.a,side=3,line=2,cex=1.5,outer=TRUE) par(opar) @ \begin{figure} \begin{center} <<Ex2aPresenceMapsFig,fig=TRUE,echo=FALSE, width = 5.25 , height = 7.5>>= <<Ex2aPresenceMaps>> @ \end{center} \caption{\label{fig:Ex2aPresMaps}Example 2 - Presence-Absence maps by four different threshold selection criteria for Pinyon (RF model).} \end{figure} <<Ex2bPresenceMaps,include=TRUE,width=5.4,height=7.5, results=hide>>= opar <- par(mfrow=c(2,2),mar=c(2.5,3,4,1),oma=c(0,0,4,6),xpd=NA) mapgrid <- raster(paste(MODELfn.b,"_map.img",sep="")) criteria <- c("Default","MaxKappa","ReqSens","ReqSpec") criteria.labels<-c("Default","MaxKappa","ReqSens = 0.9","ReqSpec = 0.9") for(i in 1:4){ thresh <- opt.thresh.b$threshold[opt.thresh.b$opt.methods==criteria[i]] presencegrid <- mapgrid v <- getValues(presencegrid) v <- ifelse(v > thresh,1,0) presencegrid <- setValues(presencegrid, v) image( presencegrid, col=c("white","forestgreen"), xlab="",ylab="",xaxt="n", yaxt="n", zlim=c(0,1), asp=1,bty="n",main="") if(i==2){ legend( x=xmax(mapgrid),y=ymax(mapgrid), legend=c("Present","Absent"), fill=c("forestgreen","white"), bty="n", cex=1.2)} mtext(criteria.labels[i],side=3,line=2,cex=1.2) mtext(paste("threshold =",thresh),side=3,line=.5,cex=1) } mtext(MODELfn.b,side=3,line=0,cex=1.2,outer=TRUE) mtext(response.name.b,side=3,line=2,cex=1.5,outer=TRUE) par(opar) @ \begin{figure} \begin{center} <<Ex2bPresenceMapsFig,fig=TRUE,echo=FALSE, width = 5.25 , height = 7.5>>= <<Ex2bPresenceMaps>> @ \end{center} \caption{\label{fig:Ex2bPresMaps}Example 2 - Presence-Absence maps by four different threshold selection criteria for Sage (RF model).} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Example 3 - Random Forest - Categorical Response} Example 3 builds a categorical response model for vegetation category. The response variable consists of four categories: TREE, SHRUB, OTHERVEG, and NONVEG. This model will use the same predictors as Model 2. Out-of-bag estimates are used for model validation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Set up} Define model type. << Ex3 Define model type, results=hide>>= model.type <- "RF" @ Define data. <<Ex3 Define training and test files, results=hide>>= qdatafn <- "VModelMapData.csv" @ Define \code{folder}. <<Ex3 define folder, results=hide>>= folder <- getwd() @ Define model filenames. << Ex3 Define model filename, results=hide>>= MODELfn <- "VModelMapEx3" @ Define the predictors. These are the five continuous predictors from the first example, plus one categorical predictor layer, the thematic layer of predicted land cover classes from the National Land Cover Dataset. The argument \code{predFactor} is used to specify the categorical predictor. << Ex3 Define predictors, results=hide>>= predList <- c( "ELEV250", "NLCD01_250", "EVI2005097", "NDV2005097", "NIR2005097", "RED2005097") predFactor <- c("NLCD01_250") @ Define the data column to use as the response, and if it is continuous, binary or categorical. <<Ex3 Define response, results=hide>>= response.name <- "VEGCAT" response.type <- "categorical" @ Define the seeds for each model. <<Ex3 set seed, results=hide>>= seed <- 44 @ Define the column that contains unique identifiers. <<Ex3 Define Identifier, results=hide>>= unique.rowname <- "ID" @ Define raster look up table. <<Ex3 update raster LUT, results=hide>>= rastLUTfn <- "VModelMapData_LUT.csv" rastLUTfn <- read.table( rastLUTfn, header=FALSE, sep=",", stringsAsFactors=FALSE) rastLUTfn[,1] <- paste(folder,rastLUTfn[,1],sep="/") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Model creation} Create the model. Because Out-Of-Bag predictions will be used for model diagnostics, the full dataset can be used as training data. To do this, set \code{qdata.trainfn <- qdatafn}, \code{qdata.testfn <- FALSE} and \code{v.fold = FALSE}. <<Ex3 Create Model, results=hide>>= model.obj.ex3 <- model.build( model.type=model.type, qdata.trainfn=qdatafn, folder=folder, unique.rowname=unique.rowname, MODELfn=MODELfn, predList=predList, predFactor=predFactor, response.name=response.name, response.type=response.type, seed=seed) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Model Diagnostics} Make Out-Of-Bag model predictions on the training data and run the diagnostics on these predictions. Save PDF versions of the diagnostic plots. Out of Bag model predictions for a Random Forest model are not stochastic, so it is not necessary to set the seed. Since this is a categorical response model model diagnostics include a CSV file the observed and predicted values, as well as a CSV file of the confusion matrix and its associated Kappa value and MAUC. <<Ex3 Model Diagnostics, results=hide>>= model.pred.ex3 <- model.diagnostics( model.obj=model.obj.ex3, qdata.trainfn=qdatafn, folder=folder, MODELfn=MODELfn, unique.rowname=unique.rowname, # Model Validation Arguments prediction.type="OOB", device.type="pdf", cex=1.2) @ Take a closer look at the text file output for the confusion matrix. Read this file into \proglang{R} now. <<Ex3 Confusion Matrix Show>>= CMX.CSV <- read.table( paste( MODELfn,"_pred_cmx.csv",sep=""), header=FALSE, sep=",", stringsAsFactors=FALSE) CMX.CSV @ The \pkg{PresenceAbsence} package function \code{Kappa()} is used to calculate Kappa for the confusion matrix. Note that while most of the functions in the \pkg{PresenceAbsence} package are only applicable to binary confusion matrices, the \code{Kappa()} function will work on any size confusion matrix. The \pkg{HandTill2001} package is used to calculate the Multiple class Area under the Curve (MAUC) and described by \cite{hand01}. The text file output of the confusion matrix is designed to be easily interpreted in Excel, but is not very workable for carrying out analysis in R. However, it is relativly easy to calculate the confusion matrix from the CSV file of the observed and predicted values. <<Ex3 Observed and Predicted Show>>= PRED <-read.table( paste( MODELfn,"_pred.csv",sep=""), header=TRUE, sep=",", stringsAsFactors=TRUE) head(PRED) @ For categorical models, this file contains the observed category for each location, the category predicted by majority vote, as well as one column for each category observed in the data, giving the proportion of trees that voted for that category. To calculate the confusion matrix from the file we will use the observed and predicted columns. The \code{read.table()} function will convert columns containing character strings to factors. If the categories had been numerical, the \code{as.factor()} function can be used to convert the columns to factors. Because there may be catergories present in the observed data that are missing from the predictions (and vice versa), to get a symetric confusion matrix it is important to make sure all levels are present in both factors. The following code will work for both numerical and character categories: <<Ex3 Prevalence Table Show>>= # #these lines are needed for numeric categories, redundant for character categories # PRED$pred<-as.factor(PRED$pred) PRED$obs<-as.factor(PRED$obs) # #adjust levels so all values are included in both observed and predicted # LEVELS<-unique(c(levels(PRED$pred),levels(PRED$obs))) PRED$pred<-factor(PRED$pred,levels=LEVELS) PRED$obs<- factor(PRED$obs, levels=LEVELS) # #calculate confusion matrix # CMX<-table( predicted=PRED$pred, observed= PRED$obs) CMX @ To calculate the errors of Omission and Comission: <<Ex 3 Marginals Show>>= CMX.diag <- diag(CMX) CMX.OMISSION <- 1-(CMX.diag/apply(CMX,2,sum)) CMX.COMISSION <- 1-(CMX.diag/apply(CMX,1,sum)) CMX.OMISSION CMX.COMISSION @ To calculate PCC: <<Ex3 PCC Show>>= CMX.PCC <- sum(CMX.diag)/sum(CMX) CMX.PCC @ To calculate Kappa: <<Ex3 Kappa Show>>= CMX.KAPPA <- PresenceAbsence::Kappa(CMX) CMX.KAPPA @ The MAUC is calculated from the category specific predictions (the percent of trees that voted for each category): <<Ex3 MAUC Show>>= VOTE <- HandTill2001::multcap( response = PRED$obs, predicted= as.matrix(PRED[,-c(1,2,3)]) ) MAUC <- HandTill2001::auc(VOTE) MAUC @ Note, both the \pkg{PresenceAbsence} package and the \pkg{HandTill2001} package have functions named \code{auc()}. The \code{::} operator is used to specify that we are calling the \code{auc()} function from the \pkg{HandTill2001} package. As in continuous and binary response models, the \code{model.diagnostics()} function creates a variable importance graph \mbox{(Figure \ref{fig:Ex3VarImp}).} \begin{figure} \begin{center} \includegraphics{VModelMapEx3_pred_importance} \end{center} \caption{\label{fig:Ex3VarImp} Example 3 - Overall variable importance graph for predicting vegetation category.} \end{figure} With categorical response models, the \code{model.diagnostics()} function alse creates category specific variable importance graphs, for example \mbox{(Figure \ref{fig:Ex3VarImpTree}).} \begin{figure} \begin{center} \includegraphics{VModelMapEx3_pred_importance_Category_TREE} \end{center} \caption{\label{fig:Ex3VarImpTree} Example 3 - Category specific variable importance graph for vegetation category "Tree".} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Comparing Variable Importance} In example 1 the \code{model.importance.plot()} function was used to compare the importance between two continuous Random Forest models, for percent cover of Pinyon and of Sage. Here we will compare the variable importances of the binary models from Example 2 with the categorical model we have just created in example 3 \mbox{(Figure \ref{fig:Ex3Comp}).} We are examining the question ``Are the same predictors important for determining vegetaion category as were important for determining species presence?'' Keep in mind that to use \code{model.importance.plot()} the two models must be built from the same predictor variables. For example, we could not use it to compare the models from Example 1 and Example 3, because in Example 1 \code{"NLCD01_250"} was not included in the predictors. <<Ex3CompImp,include=TRUE,include=TRUE,width=7.5,height=7.5, results=hide>>= opar <- par(mfrow=c(2,1),mar=c(3,3,3,3),oma=c(0,0,3,0)) model.importance.plot( model.obj.1=model.obj.ex2a, model.obj.2=model.obj.ex3, model.name.1="Pinyon", model.name.2="VEGCAT", type.label=FALSE, sort.by="predList", predList=predList, scale.by="sum", main="Pinyon Presence vs VEGCAT", device.type="none", cex=0.9) model.importance.plot( model.obj.1=model.obj.ex2b, model.obj.2=model.obj.ex3, model.name.1="Sage", model.name.2="VEGCAT", type.label=FALSE, sort.by="predList", predList=predList, scale.by="sum", main="Sage Presence vs VEGCAT", device.type="none", cex=0.9) mtext("Variable Importance Comparison",side=3,line=0,cex=1.8,outer=TRUE) par(opar) @ \begin{figure} \begin{center} <<Ex3CompImpFig,fig=TRUE, results=hide, echo=FALSE, width=7.5,height=7.5>>= <<Ex3CompImp>> @ \end{center} \caption{\label{fig:Ex3Comp}Example 3 - Comparison of variable importances between categorical model of vegetation category (VEGCAT) and binary models for Pinyon presence and Sage presence.} \end{figure} With categorical models the \code{model.importance.plot()} function also can be usde to compare the variable importance between categories of the same model. The importance measure used for category specific importance is the relative influence of each variable, calculated by randomly permuting each predictor variable, and looking at the decrease in model accuracy associated with each predictor. <<Ex3CompCatImp,include=TRUE,include=TRUE,width=7.5,height=7.5, results=hide>>= opar <- par(mfrow=c(2,1),mar=c(3,3,3,3),oma=c(0,0,3,0)) model.importance.plot( model.obj.1=model.obj.ex3, model.obj.2=model.obj.ex3, model.name.1="SHRUB", model.name.2="TREE", class.1="SHRUB", class.2="TREE", sort.by="predList", predList=predList, scale.by="sum", main="VEGCAT - SHRUB vs. TREE", device.type="none", cex=0.9) model.importance.plot( model.obj.1=model.obj.ex3, model.obj.2=model.obj.ex3, model.name.1="OTHERVEG", model.name.2="NONVEG", class.1="OTHERVEG", class.2="NONVEG", sort.by="predList", predList=predList, scale.by="sum", main="VEGCAT - OTHERVEG vs. NONVEG", device.type="none", cex=0.9) mtext("Category Specific Variable Importance",side=3,line=0,cex=1.8,outer=TRUE) par(opar) @ \begin{figure} \begin{center} <<Ex3CompCatImpFig,fig=TRUE,results=hide,echo=FALSE, width=7.5,height=7.5>>= <<Ex3CompCatImp>> @ \end{center} \caption{\label{fig:Ex3CompCat}Example 3 - Category specific comparison of variable importances for the model of vegetation category (VEGCAT). National Land Cover DATA (NLCD) and elevation (ELEV) are the most important predictor variables for both TREE and SHRUB categories, though the relative importance of the remote sensing bands differs between these two categories. ELEV is relativly less important for the NONVEG and OTHERVEG categories, while ELEV is important for classifying OTHERVEG but relativly unimportant for the classifying NONVEG. In other words, if the model lost the information contained in NLCD and ELEV, the predictions for TREE and SHRUB categories would suffer, but there would be less of an effect on the prediction accuracy for NONVEG. The predictions for OTHERVEG would suffer if NLCD were removed from the model, but would lose relativly little accuracy if ELEV were removed.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Interaction Plots} We will look at how the \code{model.interaction.plot()} function behaves with a categorical response variable. With categorical models, interactions can affect one prediction category, without influencing other categories. For example, if modelling disturbance type, it is possible that landslides might be influenced by an interaction between soil type and slope, while fires might be influenced by both variables individually, but without any interaction. Therefore when calling \code{model.interaction.plot()} it is neccessary to specify a particular category. The function then will graph how the probability of that category varies as a function of the two specified predictor variables. Here we look at the interaction between elevation and land cover class for two of our response categories \mbox{(Figure \ref{fig:Ex3SHRUBElevNlcd}, Figure \ref{fig:Ex3NONVEGElevNlcd} ).} <<Ex3 Interaction Plots, results=hide>>= model.interaction.plot( model.obj.ex3, x="ELEV250", y="NLCD01_250", main=response.name, plot.type="image", device.type="pdf", MODELfn=MODELfn, folder=folder, response.category="SHRUB") model.interaction.plot( model.obj.ex3, x="ELEV250", y="NLCD01_250", main=response.name, plot.type="image", device.type="pdf", MODELfn=MODELfn, folder=folder, response.category="NONVEG") @ \begin{figure} \begin{center} \includegraphics{VModelMapEx3_SHRUB_image_ELEV250_NLCD01_250} \end{center} \caption{\label{fig:Ex3SHRUBElevNlcd}Example 3 - Interaction plot for SHRUB vegetation category (RF model), showing interactions between elevation and National Land Cover Dataset classes (ELEV250 and NLCD01\_250). Image plot, with darker green indicating higher probability of being assigned to the specified category. Here we see the direct effect of NLCD class: SHRUB has a low probability of being assigned to landcover class 42. We also see a direct effect of elevation, with SHRUB having a slightly higher probability of being assigned to middle elevations. There is little evidence of interaction between these two predictors, since relationship of probability to elevation is similar for all land cover classes.} \end{figure} \begin{figure} \begin{center} \includegraphics{VModelMapEx3_NONVEG_image_ELEV250_NLCD01_250} \end{center} \caption{\label{fig:Ex3NONVEGElevNlcd}Example 3 - Interaction plot for NONVEG vegetation category (RF model), showing interactions between elevation and National Land Cover Dataset classes (ELEV250 and NLCD01\_250). Image plot, with darker green indicating higher probability of being assigned to the specified category. NONVEG has a much higher chance of being assigned than SHRUB in all combinations of the two predictor variables, reflecting its higher prevalence in the training data (56\% as opposed to 19\%). NONVEG does show some interaction between landcover class and elevation. In all land cover classes NONVEG has a higher chance of being predicted at low elevations, but while in most land cover classes the probability goes down at elevations above 1400m, in land cover class 42 NONVEG matains a high chance of being predicted to over 2000m. } \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Map production} The function \code{model.mapmake()} creates an ascii text files and an imagine image file of predictions for each map pixel. <<Ex3 Produce Maps, results=hide>>= model.mapmake( model.obj=model.obj.ex3, folder=folder, MODELfn=MODELfn, rastLUTfn=rastLUTfn, na.action="na.omit") @ With categorical models, the \code{model.mapmake()} function outputs a map file, using integer codes for each category, along with a table relating these codes to the original categories. In this example \code{na.action} was set to \code{"na.omit"}, therefore pixels with factored predictors with values not found in the training data will be omited from the map. Take a look at the codes: <<Ex3 Code key Show>>= MAP.CODES<-read.table( paste(MODELfn,"_map_key.csv",sep=""), header=TRUE, sep=",", stringsAsFactors=FALSE) MAP.CODES @ Column one gives the row number for each code. Column two gives the category names. Column three gives the integer codes used to represent each category in the map output. In this example the categories in the training data are character strings, and \code{model.mapmake()} assigned the integers 1 through the Number of categories. If the training data categories were already numeric codes, for example, \code{c(30,42,50,80)}, then \code{model.mapmake()} would keep the original values in the map output, and columns two and three would contain the same values. Next we define a color for each category. The \code{colors()} function will generate a list of the possible color names. Some are quite poetical. <<Ex3 define color sequence, results=hide>>= MAP.CODES$colors<-c("bisque3","springgreen2","paleturquoise1","green4") MAP.CODES @ Import the map output and transform the values of the map output to intergers from 1 to n (the number of map categories). Note that here in example 3, where the categorical responses were character strings, the \code{model.mapmake()} function generated integer codes from 1 to n for the raster output. So for this example this step is not actually neccessary. However, if the categories in the training data had been unevenly spaced numeric codes, for example, \code{c(30,42,50,80)}, then \code{model.mapmake()} would keep these original numeric codes in the raster output. In such cases, to produce a map in R with the \code{image()} function (as opposed to viewing the image in a GIS environment) creating a new raster where the numeric codes are replaced with the numbers 1 to n makes assigning specific colors to each category simpler. <<Ex3 Import data, results=hide>>= mapgrid <- raster(paste(MODELfn,"_map.img",sep="")) integergrid <- mapgrid v <- getValues(mapgrid) v <- MAP.CODES$row[match(v,MAP.CODES$integercode)] integergrid <- setValues(integergrid, v) @ Produce the map \mbox{(Figure \ref{fig:Ex3Map})}. <<Ex3ProductionMaps,include=TRUE,width=7.5,height=10, results=hide>>= opar <- par(mfrow=c(1,1),mar=c(3,3,2,1),oma=c(0,0,3,8),xpd=NA) image( integergrid, col = MAP.CODES$colors, xlab="",ylab="",xaxt="n",yaxt="n", zlim=c(1,nrow(MAP.CODES)), main="",asp=1,bty="n") mtext(response.name,side=3,line=1,cex=1.2) legend( x=xmax(mapgrid),y=ymax(mapgrid), legend=MAP.CODES$category, fill=MAP.CODES$colors, bty="n", cex=1.2) par(opar) @ \begin{figure} \begin{center} <<Ex3ProductionMapsFig,fig=TRUE,echo=FALSE, width=7.5,height=10>>= <<Ex3ProductionMaps>> @ \end{center} \caption{\label{fig:Ex3Map}Example 3 - Map of predicted vegetation category (RF model). The white pixels found just below the center of this map idicate pixels with factored predictor variables that have values not found in the training data.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusion} In summary, the \pkg{ModelMap} software package for \proglang{R} creates sophisticated models from training data and validates the models with an independent test set, cross-validation, or in the case of Random Forest Models, with out-of-bag (OOB) predictions on the training data. It creates graphs and tables of the model diagnostics. It applies these models to GIS image files of predictors to create detailed prediction surfaces. It will handle large predictor files for map making, by reading in the GIS data in sections, and output the prediction for each of these sections, before reading the next section. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \appendix \appendixpage \addappheadtotoc %\section{Argument List} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{table}[!h] \begin{center} \begin{tabularx}{\textwidth}{ |l|X| } \hline \multicolumn{2}{|c|}{Arguments for \code{model.build()}}\\ \hline \code{model.type} & Model type: \code{"RF"} or \code{"QRF"} or \code{"CF"}.\\ \code{qdata.trainfn} & Filename of the training data file for building model.\\ \code{folder} & Folder for all output.\\ \code{MODELfn} & Filename to save model object.\\ \code{predList} & Predictor short names used to build the model. \\ \code{predFactor} & Predictors from \code{predList} that are factors (i.e categorical).\\ \code{response.name} & Response variable used to build the model.\\ \code{response.type} & Response type: \code{"binary"} or \code{"continuous"}.\\ \code{unique.rowname} & Unique identifier for each row in the training data.\\ \code{seed} & Seed to initialize randomization to build stochastic models.\\ \code{na.action} & Specifies the action to take if there are \code{NA} values in the prediction data\\ \code{keep.data} & Should a copy of the predictor data be included in the model object. Useful if \code{model.interaction.plot} will be used later.\\ \hline \multicolumn{2}{|c|}{Random Forest Models:} \\ \hline \code{ntree} & Number of random forest trees.\\ \code{mtry} & Number of variables to try at each node of Random Forest trees.\\ \code{replace} & Should sampling be done with or without replacement.\\ \code{strata} & A (factor) variable that is used for stratified sampling.\\ \code{sampsize} & For classification, if strata provided, sampling is stratified by strata. For binary response models, if argument strata is not provided then sampling is stratified by presence/absence.\\ \hline \end{tabularx} \end{center} %\caption{\label{tab:oneA}Model Creation Arguments} \end{table} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{table} \begin{center} \begin{tabularx}{\textwidth}{ |l|X| } \hline \multicolumn{2}{|c|}{Arguments for \code{model.diagnostics}} \\ \hline \code{model.obj} & The model object to use for prediction, if the model has been previously created.\\ \code{qdata.trainfn} & Filename of the training data file for building model.\\ \code{qdata.testfn} & Filename of independent data set for testing (validating) model.\\ \code{folder} & Folder for all output.\\ \code{MODELfn} & Filename to save model object.\\ \code{response.name} & Response variable used to build the model.\\ \code{unique.rowname} & Name of column in training and test that uniquely identifies each row .\\ \code{diagnostic.flag} & Name of column in training that indicates subset of data to use for diagnostics.\\ \code{seed} & Seed to initialize randomization to build stochastic models.\\ \code{prediction.type}& Type of prediction to use for model validation: \code{"TEST"}, \code{"CV"}, \code{"OOB"} or \code{"TRAIN"}\\ \code{MODELpredfn} & Filename for output of validation prediction \code{*.csv} file.\\ \code{na.action} & Specifies the action to take if there are \code{NA} values in the prediction data or if there is a level or class of a categorical predictor variable in the validation test set or the mapping data set, but not in the training data set.\\ \code{v.fold} & The number of cross-validation folds.\\ \code{device.type} & Vector of one or more device types for graphical output: \code{"default"}, \code{"jpeg"}, \code{"pdf"}, \code{"postscript"}, \code{"win.metafile"}. \code{"default"} refers to the default graphics device for your computer\\ \code{DIAGNOSTICfn} & Filename for output files from model validation diagnostics.\\ \code{jpeg.res} & Pixels per inch for jpeg output.\\ \code{device.width} & Device width for diagnostic plots in inches.\\ \code{device.height}& Device height for diagnostic plots in inches.\\ \code{cex} & Cex for diagnostic plots.\\ \code{req.sens} & Required sensitivity for threshold optimization for binary response model.\\ \code{req.spec} & Required specificity for threshold optimization for binary response model.\\ \code{FPC} & False Positive Cost for threshold optimization for binary response model.\\ \code{FNC} & False Negative Cost for threshold optimization for binary response model.\\ \hline \end{tabularx} \end{center} %\caption{\label{tab:twoA}Model Validation Arguments} \end{table} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{table} \begin{center} \begin{tabularx}{\textwidth}{ |l|X| } \hline \multicolumn{2}{|c|}{Arguments for \code{model.mapmake()}} \\ \hline \code{model.obj} & The model object to use for prediction, if the model has been previously created.\\ \code{folder} & Folder for all output.\\ \code{MODELfn} & Filename to save model object.\\ \code{rastLUTfn} & Filename of \code{.csv} file for a raster look up table.\\ \code{na.action} & Specifies the action to take if there are \code{NA} values in the prediction data or if there is a level or class of a categorical predictor variable in the validation test set or the mapping data set, but not in the training data set.\\ \code{keep.predictor.brick} & If \code{TRUE} then the raster brick containing the predictors from the model object is saved as a native raster package format file.\\ \code{map.sd} & Should maps of mean, standard deviation, and coefficient of variation of the predictions be produced: \code{TRUE} or \code{FALSE}. Only used if \code{response.type = "continuous"}.\\ \code{OUTPUTfn} & Filename for output raster file for map production. If \code{NULL}, \mbox{\code{"\textit{modelfn}\_map.txt"}}.\\ \hline \end{tabularx} \end{center} %\caption{\label{tab:fiveA}Map Production} \end{table} \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <<Remove Rplots pdf, results=hide, echo=FALSE>>= dev.off() file.remove("Rplots.pdf") file.remove("Vplots.pdf") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\bibliographystyle{V} \bibliography{VModelMap} \end{document}