Skip to content

Instantly share code, notes, and snippets.

@cboettig
Last active February 3, 2021 21:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cboettig/7bbf318224782381f2640b51133f9245 to your computer and use it in GitHub Desktop.
Save cboettig/7bbf318224782381f2640b51133f9245 to your computer and use it in GitHub Desktop.
forecast-eml-debug.Rmd
---
title: "Example metadata for a Lotka-Volterra population growth forecast"
output: rmarkdown::html_document
vignette: >
%\VignetteIndexEntry{Logistic Growth Simulation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
#### Code {.tabset}
##### R
```{r setup, message=FALSE}
library(EML)
library(ncdf4)
library(emld)
library(lubridate)
library(tibble)
library(dplyr)
library(tidyr)
library(reshape2)
emld::eml_version("eml-2.2.0")
set.seed(42)
```
# Standard Option 3: summary CSV
Convert to a flat file format (CSV) with forecast distribution summaries saved
#### Code {.tabset}
##### R
```{r}
dfs = read.csv("terrestrial-2021-01-01-ISWG.csv")
## calculate summary statistics across ensemble members and output variables
# dfs <- df_combined %>% pivot_longer(cols=5:6,names_to="vars") %>% ## make species a column
# group_by(time, depth,obs_flag,species,forecast,data_assimilation) %>%
# summarize(mean = mean(value),
# sd = sd(value),
# Conf_interv_02.5 = quantile(value, 0.025),
# Conf_interv_97.5 = quantile(value, 0.975)) %>%
# pivot_longer(7:10,names_to="statistic") %>% ## make statistic column
# pivot_wider(names_from=species,values_from = value) ## go back to species wide
## recode statistics
# dfs$statistic[dfs$obs_flag == 1 & dfs$statistic == "sd"] <- "se"
# dfs$statistic[dfs$obs_flag == 2 & dfs$statistic == "Conf_interv_02.5"] <- "Pred_interv_0.25"
# dfs$statistic[dfs$obs_flag == 2 & dfs$statistic == "Conf_interv_02.5"] <- "Pred_interv_0.25"
# head(dfs)
# write.csv(dfs, file = "logistic-forecast-summary-multi-variable-multi-depth.csv")
```
#### {-}
# Standardized Metadata
### Forecast output entity: dataTable
Let's begin by documenting the metadata of the forecast output data table itself, which we'll do using EML's `dataTable` entity. In this example we'll assume we're working with format 2 (ensemble CSV), though most of the metadata would be identical for all three formats.
To start, the `attributes` table stores the basic metadata about the variable names and units.
#### Code {.tabset}
##### R
```{r}
## define variable names, units, etc
## in practice, this might be kept in a spreadsheet
attributes <- tibble::tribble(
~attributeName, ~attributeDefinition, ~unit, ~formatString, ~numberType, ~definition,
"time", "[dimension]{time}", "year", "YYYY-MM-DD", "numberType", NA,
"statistic", "[dimension]{type of statistic}", "dimensionless", NA, "character", "Type of statistic",
"siteID", "[dimension]{neon site}", "dimensionless", NA, "character", "The 4-digit code for NEON site ID",
"obs_flag", "[dimension]{observation error}", "dimensionless", NA, "integer", NA,
"nee", "[variable]{net ecosystem exchange}", "dimensionless", NA, "real", NA,
"le", "[variable]{latent heat}", "dimensionless", NA, "real", NA,
"vswc", "[variable]{volumetric soil water content}", "dimensionless", NA, "real", NA,
"forecast", "[flag]{whether time step assimilated data}", "dimensionless", NA, "integer", NA,
"data_assimilation", "[flag]{whether time step assimilated data}", "dimensionless", NA, "integer", NA
)
## note: EML uses a different unit standard than UDUNITS. For now use EML. EFI needs to provide a custom unitList.
attributes
attrList <- set_attributes(attributes,
col_classes = c("Date", "character", "character", "numeric","numeric",
"numeric","numeric", "numeric","numeric"))
```
#### {-}
Note that the EFI standard extends the EML `attibuteDefinition` to include a [variable_type] and {variable_definition}.
The options for [variable_type] include:
* dimension
* variable = output variable
* diagnostic = variable output purely for diagnostic purposes
* observation = data that is or could be compared to an output_variable
* obs_error
* flag
* initial_condition
* driver
* parameter
* random_effect
* process_error
We then link the attribute information with the info about a particular file
#### Code {.tabset}
##### R
```{r}
## sets metadata about the file itself (name, file type, size, MD5, etc)
physical <- set_physical("terrestrial-2021-01-01-ISWG.csv",
recordDelimiter='\n')
## set metadata for the file as a whole
dataTable <- eml$dataTable(
entityName = "forecast", ## this is a standard name to allow us to distinguish this entity from
entityDescription = "Forecast of NEE and LE for four NEON sites",
physical = physical,
attributeList = attrList)
```
#### {-}
Here `entityName = "forecast"` is a standard name within the EFI standard to allow us to distinguish this entity from metadata about parameters, drivers, etc.
There's a lot more optional terminology that could be exploited here -- for instance, the specification lets us define different missing value codes (and explanations) for each column, and allows us to indicate `precision`, `minimum` and `maximum`.
Note that `physical` type can document almost any formats as well, including
NetCDF etc. A NetCDF file would still document the variables measured in much
the same way regardless of the underlying representation.
### Provenance
Now that we've documented the actual data.frame itself, we can add additional metadata to the record describing our forecast, which is essential for citing, discovering, and interpreting the result. We start with some authorship information.
#### Code {.tabset}
##### R
```{r}
me <- list(list(individualName = list(givenName = "Alex",
surName = "Young")),
list(individualName = list(givenName = "Kathryn",
surName = "Fuller",
id = "https://orcid.org/0000-0001-9100-7635")),
list(individualName = list(givenName = "Elisa",
surName = "Stefaniak",
id = "https://orcid.org/0000-0003-2998-5619")),
list(individualName = list(givenName = "Jenna",
surName = "Zukswert",
id = "https://orcid.org/0000-0003-2246-0406")),
list(individualName = list(givenName = "Laura",
surName = "Super")))
```
#### {-}
### Coverage
Set Taxonomic, Temporal, and Geographic Coverage. (Look, apparently we're modeling population densities of *Gloeotrichia echinulata* and *Anabaena circinalis* cyanobacterium in Lake Sunapee, NH, USA)
#### Code {.tabset}
##### R
```{r}
# taxa <- tibble::tribble(
# ~Genus, ~Species,
# "Gloeotrichia", "echinulata",
# "Anabaena", "circinalis")
coverage <-
set_coverage(begin = first(dfs$time),
end = last(dfs$time))
coverage <- list(temporalCoverage = coverage$temporalCoverage)
```
#### {-}
### Keywords
Set key words. We will need to develop a EFI controlled vocabulary
#### Code {.tabset}
##### R
```{r}
keywordSet <- list(
list(
keywordThesaurus = "EFI controlled vocabulary",
keyword = list("forecast",
"timeseries")
))
```
#### {-}
### Abstract
Our dataset needs an abstract describing what this is all about.
#### Code {.tabset}
##### R
```{r}
abstract_text <- system.file("extdata", "abstract.md", package="EFIstandards", mustWork = TRUE)
```
#### {-}
### Dataset
Next, we'll combine these various bits to document the output dataset as a whole
#### Code {.tabset}
##### R
```{r}
dataset = eml$dataset(
title = "Summarized historical data as a forecast",
creator = me,
contact = me[[2]], # contact needs an id
pubDate = "2021-02-02",
intellectualRights = "http://www.lternet.edu/data/netpolicy.html.",
abstract = "",
dataTable = dataTable,
keywordSet = keywordSet,
coverage = coverage)
```
#### {-}
## Forecast-specific additionalMetadata
The EFI standard is using EML's `additionalMetadata` capacity. Section 2.1 of the EFI standard describes both the basic elements (2.1.1) and the info about model structure and uncertainty (2.1.2).
#### Code {.tabset}
##### R
```{r}
additionalMetadata <- eml$additionalMetadata(
metadata = list(
forecast = list(
## Basic elements
timestep = "1 day", ## should be udunits parsable; already in coverage -> temporalCoverage?
forecast_horizon = "35 days",
forecast_issue_time = "2021-02-02",
forecast_iteration_id = "80808080",
forecast_project_id = "ISWG",
metadata_standard_version = "0.3",
model_description = list(
forecast_model_id = "80808080",
name = "historical data summary as forecast",
type = "statistical",
repository = ""
),
## MODEL STRUCTURE & UNCERTAINTY CLASSES
initial_conditions = list(
# Possible values: absent, present, data_driven, propagates, assimilates
status = "absent"
),
drivers = list(
status = "absent"
),
parameters = list(
status = "absent"
),
random_effects = list(
status = "absent"
),
process_error = list(
status = "propagates",
propagates = list(
type = "analytic"
),
complexity = 1,
covariance = FALSE
),
obs_error = list(
status = "absent"
)
) # forecast
) # metadata
) # eml$additionalMetadata
```
#### {-}
Within the `additionalMetadata`, we anticipate the model structure info will be the most challenging to document.
All six model structural elements / uncertainty classes need to be documented, even if just to record that a specific element is absent in the model. The six classes are:
Class | Description
------------------ | -------------------------------------------------------
initial_conditions | Uncertainty in the initialization of state variables (Y)
drivers | Uncertainty in model drivers, covariates, and exogenous scenarios (X)
parameters | Uncertainty in model parameters ($\theta$)
random_effects | Unexplained variability and heterogeneity in model parameters ($\alpha$)
obs_error | Uncertainty in the observations of the output variables (g)
process_error | Dynamic uncertainty in the process model ($\epsilon$)
Each class has a very similar structure and the only required element in each is `status` which can take on the following valid values
status | Description
----------- | --------------------------------------------------------------
absent | This model does not contain this concept (e.g. a linear regression model doesn't have initial conditions, random effects, or process_error).
present | The model contains this concept (e.g. the model has parameters), but the values used are not derived from data and no uncertainty is represented
data_driven | The model contains this concept and the inputs are data driven but uncertainty in this input is not explicitly propagated into predictions.
propagates | The model propagates uncertainty about this term into forecasts
assimilates | The model iteratively updates this term through data assimilation.
Except for 'absent' this list is ordinal (e.g. data_driven implies present)
Next, the `<complexity>` tag is recommended, but not required, and should list the size/dimension of each status/uncertainty class at a single location (per dimensions defined above). For example, the logistic model has two initial conditions (n[1], n[2]), six parameters (r, K, and alpha for each species), and both process and observation error on the two state variables. `absent` statuses don't need to be documented as 0 as this is implied.
The `<covariance>` tag states whether the errors/uncertainties within an uncertainty class are being treated as independent or not.
Finally, the `<propagation>` and `<assimilation>` tags provide additional information about the methods used (see EFI standard doc).
## Assembling and validating the metadata
All we need now is to combine the dataset metadata with the forecast additionalMetadata
#### Code {.tabset}
##### R
```{r}
my_eml <- eml$eml(dataset = dataset,
additionalMetadata = additionalMetadata,
packageId = "80808080",
system = "datetime" ## system used to generate packageId
)
```
#### {-}
Once we have finished building our EML metadata, we can confirm it is valid. This will catch any missing elements. (Recall that what is 'required' depends on what you include -- for example, you don't have to document a `dataTable` at all, but if you do, you have to document the "physical" file format it is in (e.g. `csv`) and the attributes and units it uses!)
#### Code {.tabset}
##### R
##### HERE
```{r}
## check base EML
eml_validate(my_eml)
## check that the EML is also a valid EFI forecast
EFIstandards::forecast_validator(my_eml)
```
#### {-}
We are now ready to write out a valid EML document:
#### Code {.tabset}
##### R
```{r}
write_eml(my_eml, "forecast-eml.xml")
```
#### {-}
At this point, we could easily upload this metadata along with the data itself
to a repository (e.g. DataONE) manually or via an API.
We can also generate a JSON-LD version of EML:
#### Code {.tabset}
##### R
```{r}
emld::as_json(as_emld("forecast-eml.xml"), file = "forecast-eml.json")
```
```{r eval=FALSE, echo=FALSE}
# NOTE: the output files in this vignette
# should be stored in the package. to updat them change this chunk to eval=TRUE
devtools::load_all()
emlname <- system.file("extdata", "forecast-eml.xml", package="EFIstandards")
write_eml(my_eml, emlname)
emldname <- system.file("extdata", "forecast-eml.json", package="EFIstandards")
emld::as_json(as_emld(emlname), file = emldname)
ncpkgfile <- system.file("extdata", ncfname, package="EFIstandards")
# TODO -- add writing the nc file here too
file.copy(ncfname, ncpkgfile, overwrite=TRUE)
```
```{r include=FALSE}
## Cleanup
lapply(list.files(pattern = "[.]nc"), unlink)
lapply(list.files(pattern = "[.]csv"), unlink)
lapply(list.files(pattern = "[.]json"), unlink)
lapply(list.files(pattern = "[.]xml"), unlink)
```
#### {-}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment