Code
Contributors: An T. Nguyen, Tim Whiteaker
Introduction
This document describes best practices for archiving software, code, or scripts, such as a simulation model, data visualization package, or data manipulation scripts. The intent of these recommendations is to make research based on modeling or software more transparent rather than achieve exact reproducibility, i.e., provide sufficient documentation so that a knowledgeable person can understand algorithms, programming decisions, and their ramifications for the results, rather than run the model and obtain the same results.
Examples of candidate archives for code include CoMSES Net, which focuses on sharing models related to social and ecological sciences, and Zenodo, a popular DOI-minting all-purpose repository, that can conveniently archive a specific version of code in a GitHub repository. Alternatively, code may be archived in the EDI repository, either by itself or as part of a data package. The best practices in this document cover both archiving code in EDI and referencing code archived elsewhere.
While metadata for software may be described in detail using the EML <software> tree, there exists a project called CodeMeta which is specifically designed for software metadata. Therefore, one of the key recommendations in this document is to include a CodeMeta file when archiving software or code in EDI.
Recommendations for data packages
Considerations for archiving software or code
- If it is a model and/or a model-based dataset, please see the best practices for archiving model-based datasets.
- How likely is it that the code will be well maintained into the future? For example, code packages submitted to established code repositories may stay there only while they comply with all testing requirements and may be removed if not well maintained (e.g., the R package repository CRAN). If that commitment to code maintenance is unlikely, such a package should be archived in a repository without maintenance requirements.
- Should the code be archived as a separate package or with the data?
- If the code is used to generate several independent datasets it should be archived as a separate package.
- The software authors wishing to place it under a different license from that of the associated data, or to obtain a DOI for only the code, may be reasons to separate code and data packages.
- If deciding to package code separately, it may be archived on EDI or another repository. If archiving code outside of EDI, see Linking code and data for instructions on how to reference that code from related data packages in EDI.
- In most other cases, it is recommended to archive code and data together for context.
- Large community software packages are usually maintained and available elsewhere. However, they may undergo significant updates and it may make sense to archive the code of a certain version with the data for transparency reasons. Consider whether prior versions of a software package are available wherever that software is distributed.
- When choosing a repository for the code, consider the ease of the archiving process and how well the code can be described. For example, Zenodo offers an easy pathway to archive code that is currently in GitHub, though metadata requirements are very light. Following the best practices described herein, you would create a CodeMeta file in addition to EML if you were going to archive with EDI. This is more rigorous than Zenodo, but then your code is better described, and in a machine-readable way.
Documenting software/code
When describing the code with EML, include the code as an otherEntity in a data package. Although a well documented human readable text format of the code is preferred, in case of multiple scripts, and/or where directory structure is important, a zip archive may be used.
If archiving individual files, then for the <formatName> and <entityType> elements in EML, see recommendations in the Data entities chapter of the EML Best Practices. Some format names are included in examples below. Always check for the most up-to-date version of these names.
Example 1: EML <otherEntity> snippet for a script file.
otherEntity>
<entityName>R script to process CTD data</entityName>
<entityDescription>Annotated RMarkdown script to process, calibrate, and flag raw CTD data.</entityDescription>
<physical>
<objectName>BLE_LTER_CTD_QAQC.Rmd</objectName>
<size unit="byte">9674</size>
<authentication method="MD5">8547b7a63fcf6c1f0913a5bd7549d9d1</authentication>
<dataFormat>
<externallyDefinedFormat>
<formatName>text/markdown</formatName>
<externallyDefinedFormat>
</dataFormat>
</physical>
</entityType>R Markdown file</entityType>
<otherEntity> </
Software License
It is important to include a license to make it clear how others can use your work. We recommend the Creative Commons “no copyright reserved” (CC0) license, which places the software in the public domain and makes it easiest for end users to adapt and use your work. If a more restrictive license is required, we recommend the Apache License, Version 2.0 license, a permissive license that allows others to reuse, modify, and redistribute your software.
If a mix of data and code needs to be archived, and they each fall under different licenses, then separating them into different packages is advisable to eliminate ambiguity on which license applies to which portion of a data package. When a license other than a public domain dedication is used, then in addition to specifying the license in the metadata (see the <intellectualRights> element in EML), consider including a copy of the license at the beginning of the code files themselves so that the license is readily apparent to end users who peruse the code.
CodeMeta
Include a CodeMeta JSON file for code archived in EDI. CodeMeta offers a structured, machine-readable summary of the software’s purpose, authorship, and dependencies, improving discoverability and reuse. It complements EML but does not replace it.
Name the CodeMeta file “codemeta.json” and list it as an EML <otherEntity>. The <formatName> should be “application/json”, the <entityType> should be “CodeMeta, version 2.0”, and the <entityDescription> should indicate that this is a CodeMeta file for a given software or script in the data package.
For unnamed projects, e.g., one-off scripts for data processing, analysis, and/or visualisation, a CodeMeta file might seem excessive; however, CodeMeta files are simple to generate, and we recommend the below bare minimum. If there are multiple scripts each in their own <otherEntity> tag, we recommend aggregating information about them into one codemeta.json.
For more information on how to create a CodeMeta file, including a tool to create one, see the CodeMeta User Guide.
Example 2: Minimum recommended codemeta.json example for a single script in an unnamed project. The script filename is used for “name”, and license links are from SPDX (often used with CodeMeta).
{
"@context": ["https://doi.org/10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"name": "BLE_LTER_CTD_QAQC.Rmd",
"description": "RMarkdown script to calibrate and flag raw CTD data.",
"author": {
"@type": "Person",
"givenName": "Christina",
"familyName": "Bonsell",
"email": "cbonsell@utexas.edu",
"@id": "https://orcid.org/0000-0002-8564-0618"
},
"keywords": ["calibration", "CTD", "RMarkdown"],
"license": "https://spdx.org/licenses/Unlicense",
"dateCreated": "2013-10-19",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"version": "3.6.2",
"url": "https://r-project.org"
}
}
Example 3: sample <otherEntity> metadata for example 2’s codemeta.json.
otherEntity>
<entityName>CodeMeta file for BLE_LTER_CTD_QAQC.Rmd</entityName>
<entityDescription>CodeMeta file for annotated RMarkdown script to process, calibrate, and flag raw CTD data.</entityDescription>
<physical>
<objectName>codemeta.json</objectName>
<size unit="byte">729</size>
<authentication method="MD5">8547b7a63abc6c1f0913a5bd7549d9d1</authentication>
<dataFormat>
<externallyDefinedFormat>
<formatName>application/json</formatName>
<externallyDefinedFormat>
</dataFormat>
</physical>
</entityType>CodeMeta, version 2.0</entityType>
<otherEntity> </
Example 4: CodeMeta file for multiple scripts and two authors. The description provides a human-readable list of scripts while hasPart lists the scripts in a more machine-friendly format. For projects with more scripts, the description could be more general and the list of key files could be isolated to hasPart.
{
"@context": [
"https://doi.org/10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"name": "BLE_LTER_CTD_QAQC Scripts",
"description": "A set of RMarkdown scripts to calibrate, flag, and visualize raw CTD data. Includes BLE_LTER_CTD_QAQC.Rmd for data cleaning and quality control, and BLE_LTER_CTD_Plotter.Rmd for generating diagnostic plots.",
"author": [
{
"@type": "Person",
"givenName": "Christina",
"familyName": "Bonsell",
"email": "cbonsell@utexas.edu",
"@id": "https://orcid.org/0000-0002-8564-0618"
},
{
"@type": "Person",
"givenName": "Tim",
"familyName": "Whiteaker",
"email": "whiteaker@utexas.edu",
"@id": "https://orcid.org/0000-0002-1940-4158"
}
],
"keywords": ["calibration", "CTD", "RMarkdown", "QAQC"],
"license": "https://spdx.org/licenses/Unlicense",
"dateCreated": "2013-10-19",
"hasPart": [
{
"@type": "SoftwareSourceCode",
"name": "BLE_LTER_CTD_QAQC.Rmd",
"description": "Performs calibration, flagging, and quality control on raw CTD data."
},
{
"@type": "SoftwareSourceCode",
"name": "BLE_LTER_CTD_Plotter.Rmd",
"description": "Generates diagnostic plots for visual inspection of CTD data."
}
],
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"version": "3.6.2",
"url": "https://r-project.org"
}
}
The example below shows some additional metadata you can include. See also the more complete example on the codemetar R package page and the available CodeMeta terms.
Example 5: A more complete CodeMeta example for named projects. Example taken from the codemetar R package with edits for brevity.
{
"@context": ["https://doi.org/10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"name": "codemetar: Generate 'CodeMeta' Metadata for R Packages",
"description": "A JSON-LD format for software metadata",
"author": [{
"@type": "Person",
"givenName": "Carl",
"familyName": "Boettiger",
"email": "cboettig@gmail.com",
"@id": "https://orcid.org/0000-0002-1642-628X"
},
{
"@type": "Person",
"givenName": "Maëlle",
"familyName": "Salmon",
"@id": "https://orcid.org/0000-0002-2815-0399"
}
],
"codeRepository": "https://github.com/ropensci/codemetar",
"dateCreated": "2013-10-19",
"license": "https://spdx.org/licenses/GPL-3.0",
"version": "0.1.8",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"version": "3.5.3",
"url": "https://r-project.org"
},
"softwareRequirements": [{
"@type": "SoftwareApplication",
"identifier": "R",
"name": "R",
"version": ">= 3.0.0"
},
{
"@type": "SoftwareApplication",
"identifier": "git2r",
"name": "git2r",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
}
}
],
"keywords": ["metadata", "codemeta", "ropensci"]
}
Metadata to enable reproducibility
When archiving software, include a user guide with installation and usage instructions—especially if the script requires inputs that may not be available to others. When feasible, provide example data and configure the script to run with it by default.
In addition to the code, document other details that support reproducibility, such as dependencies, the operating system, and system locale. This information can be included in the EML metadata under methods/methodStep/description. Tools like sessionInfo() in R or conda in Python can help capture this automatically. Then you can include files like renv.lock, requirements.txt, or environment.yml alongside the code as an <otherEntity>.
For named projects, include the software name and version if applicable. For versioning, we recommend Semantic Versioning, and if using GitHub, reference a specific release tag or commit hash. Including a changelog or release notes can also help future users understand how the code has evolved.
Linking code and data
There are a few solutions for providing explicit machine-readable linkages between different entities/packages (the distinction between code/data doesn’t matter too much here). For most cases we recommend the simplest approach, which is to use the methods/methodStep/description element of EML. More advanced users may wish to utilize the other solutions described herein.
Descriptive approach
In the dataset methods/methodStep/description element, include verbal descriptions such as “results.csv was derived from raw_data.csv using script.R” and repeat for all entities. If code and data reside in different packages, be sure to specify that.
The EML dataSource element
Nested under methods/methodStep, <dataSource> elements describe other data packages that serve as source for the current package. <dataSource> looks like a mini-EML tree describing the source data. Example: ecocomDP packages list the original packages under <dataSource>. <dataSource> does not describe relationships between entities in the same package, and as far as we know there is no explicit way in EML to do so.
External software
Large community-backed tools or proprietary software such as ArcGIS Pro or Microsoft Excel do not need to be archived. However, if they have had any impact on the final data (e.g., ArcGIS Pro was used to modify spatial rasters), the EML methods section should describe the routines performed. Within the data package, indicate linkage to external software as follows.
- Briefly describe the software/code and its relationship to the data in EML’s methods/methodStep/description element.
- Names of all software used. Include both the common acronym and the full spelling.
- The URL(s) to all models/software used. Stable, persistent URLs pointing to exact version(s) are preferable, rather than generic links such as a project homepage. If the archived model has a DOI, then include a full citation to the model in the methods/methodStep/description text. The exception to this is when referencing tools such as Excel that have achieved global household name status.
- Broadly, the system setup used, if relevant.
- Information on exact versions for all code used (including dependencies). This is important, e.g., ArcGIS Pro 3.5.2 is very different from ArcGIS for Desktop 10.7.1. Different systems have methods to easily generate this information, e.g., a call to sessionInfo() in the R console.
- Consider, if applicable, to archive the “runfile” as its own data entity within the data package, i.e., the script(s) that sets parameters and/or calls on functions imported from external software.
Example 6: EML method description referring to external software.
methods>
<methodStep>
<description>
<para>
<
The seagrass coverage raster was created in ArcGIS Pro (version
2.4.3, by Esri) using the IDW geoprocessing tool on
sampling_points.csv with a power of 2 and the nearest
12 points.para>
</para>
<
The raster was then refined using the seagrass-refiner package
with the auto-refine option checked (Smith, 2017).para>
</para>
<
Smith, J. (2017). seagrass-refiner: a package that does the cool
seagrass stuff, Version 1.2, Zenodo. https://doi.org/this-is/a-fake-doi,
2017.para>
</description>
</methodStep>
</methods> </
Resources
CodeMeta website
CodeMeta generator for creating CodeMeta
CodeMeta crosswalks for a number of popular software
CodeMeta terms you can use for describing software
A description of some software licenses