Data entities

An EML document is a container for metadata that describe the digital products of research or data collection activities, which can include a wide range of file types and other data objects. Data entities are a class of EML elements that directly describe these digital objects, and altogether, six data entity types have been defined for this purpose. This chapter provides thorough guidance on using <dataTable> and <otherEntity> data entity elements because they are very commonly used in EML documents and are capable of describing a wide range of data objects or files. The <spatialRaster> and <spatialVector> entity types are occasionally used for geospatial data objects and are addressed in depth in the spatial data chapter of the “Data Package Design for Special Cases” companion document. The <view> and <storedProcedure> entity types are only briefly described in Appendix A because they refer to specific data products derived from relational databases that are rarely used. To assist with selection of an EML entity type, table A.1 in Appendix A describes the purpose, example data objects, and metadata features of all six available types.

Every EML data entity has a set of potential child elements in common, called the EntityGroup, for storing general information about the data resource. Some of these are required and some are recommended. In addition, each data entity type has a set of child elements specific to the type with, again, some required and some recommended. Table 10.1 summarizes typical use cases for <dataEntity> and <otherEntity>, and their required and recommended child elements. More details about child elements come later in the chapter.

Table 10.1. Required and recommended child elements for <dataTable> and <otherEntity> types. Members of the EntityGroup are given, followed by elements specific to the entity type. This is a subset of Table A.2 in Appendix A.

Entity Type Typical Uses Required and recommended elements
<dataTable> Tabular data objects with a fixed structure (delimited text files, simple spreadsheet, etc.) EntityGroup
    <entityName> (required)
    <entityDescription>
    <physical>
<attributeList> (required)
<numberOfRecords>
<otherEntity> Data objects not described by other standard entity types (images, non-tabular data, binary files, etc.) EntityGroup
    <entityName> (required)
    <entityDescription>
    <physical><
<entityType> (required)
<attributeList> (for tabular data)

Context note: Like many other EML elements, each of the entity elements has three attributes, ID, scope, and system. EDI will automatically fill them in with the actual entityID within the EDI repository system, scope = “document”, and system = “https://pasta.edirepository.org”. Hence, these attributes should not be used for local information as they will be overwritten upon submission. Use <alternateIdentifier> instead.

Recommendations for dataTable

Carefully consider what files or data objects should be described with a <dataTable> element. If at all possible, do not publish tabular data in dated, proprietary, or binary file formats such as Microsoft Excel files. When handling tabular data in such formats it is preferable to export to accessible, open data formats, such as plain delimited text files (such as CSV) for publication. The <attributeList> tree is required in <dataTable> entities to describe all attributes, or column variables, in the table. Best practices for <attributeList> are in Chapter 11.

Example 10.1: An example of a <dataTable> entity. The elements in the EntityGroup are shown, along with an abbreviated <attributeList>, which is required for a <dataTable>.

<dataTable>
  <entityName>Arthropod habitat table</entityName>
  <entityDescription>
    habitat description table for the sampling locations
  </entityDescription>
  <physical>
    <objectName>frs-1-arthro-hab.csv</objectName>
    <dataFormat>
      <textFormat>
        <numHeaderLines>1</numHeaderLines>
        <numFooterLines>0</numFooterLines>
        <recordDelimiter>\r</recordDelimiter>
        <numPhysicalLinesPerRecord>1</numPhysicalLinesPerRecord>
        <attributeOrientation>column</attributeOrientation>
        <simpleDelimited>
          <fieldDelimiter>,</fieldDelimiter>
          <quoteCharacter>"</quoteCharacter>
        </simpleDelimited>
      </textFormat>
    </dataFormat>
    <distribution>
      <online>
        <onlineDescription>frs-1 Arthro Habitat Data File</onlineDescription>
        <url function="download">
          http://www.ficstate.edu/lter/data/frs-1-arthro-hab.csv
        </url>
      </online>
    </distribution>
  </physical>
  <attributeList>

  </attributeList
</dataTable>

Recommendations for otherEntity

When publishing an <otherEntity>, we recommend using open, non-proprietary file formats to the extent possible, so that the published data objects can be used without restriction (such as requiring an expensive software license). Though <otherEntity> is something of a catch-all data entity type, it requires a free text <entityType> child element to describe whatever data object is being published. It is strongly recommended to place standardized descriptive text here that complements the standardized file format information included in physical/dataFormat/externallyDefinedFormat/formatName. This is demonstrated in Example 10.2, and Table 10.2 provides suggested <entityType> values for a variety of file formats. Further guidance on this, including sources for standard terms, is given below in the “Physical tree” section. Additional information needed to fully describe the file format, data structure, use, or other aspects of the data object can also be provided in the element’s <entityDescription> as needed.

Context note: When archiving HTML documents at EDI, the format name must be “text/html” in order for the document to be accepted.

Example 10.2: An example of an <otherEntity> element. The recommended elements in the EntityGroup are shown, as is a standardized value in the required <entityType> element. Note the corresponding IANA standard text in the <formatName> element.

<otherEntity>
  <entityName>Field and lab protocol for arthropod sampling</entityName>
  <entityDescription>
    An Adobe PDF (archival PDF/A) document describing the habitat, sampling locations, field collection protocols, and laboratory processing procedures for the FLS arthropod sampling program.
  </entityDescription>
  <physical>
    <objectName>frs-1-artho-protocol.pdf</objectName>
    <dataFormat>
      <externallyDefinedFormat>
        <formatName>application/pdf</formatName>
      </externallyDefinedFormat>
    </dataFormat>
    <distribution>
      <online>
        <onlineDescription>frs-1 Arthropod Protocol</onlineDescription>
        <url function="download">
          http://www.ficstate.edu/lter/protocols/frs-1-arthro-protocol.pdf
        </url>
      </online>
    </distribution>
  </physical>
  <entityType>Portable Document Format</entityType>
</otherEntity>

EntityGroup elements

In the EntityGroup, the <entityName> element is required, and <entityDescription> and <physical> (including optional <access>) are recommended. Optional additional EntityGroup elements include <alternateIdentifier>, <additionalInfo>, <coverage>, and <methods>. These are briefly described after the required and recommended elements, but for better readability by data users, it is generally recommended that their content be provided at the dataset level and not distributed throughout the EML document.

Entity name (required)

The entity name (<entityName>) is a human readable name of the data object, whether that is a file, database table, document, or something else. The content should not exceed 100 characters.

Context note: The EDI repository requires that <entityName> elements be unique within the dataset. This element will be displayed on the dataset landing page as the name for the entity.

Optional EntityGroup elements

All of the EntityGroup elements below are optional. They may add value at the data entity level under some circumstances, but in general they are more useful at the <dataset> level.

  • Alternate identifier: The primary identifier for a data entity belongs in the @id attribute of the entity element (e.g., <dataTable id=“xxx”>) and may be filled in by the repository, but adding the <alternateIdentifier> element can accommodate additional identifiers that might be used in a local data management system. It is used similarly to the <alternateIdentifier> element at the <dataset> level, as described in Chapter 3.
  • Annotations: The <annotation> element can be used at the entity level to establish links to semantic ontologies and other external resources that provide context or utility to the entity or its content. See Chapter 7 for more details.
  • Additional information: The <additionalInfo> element is a text field for any material that cannot be characterized by the other elements for the data type.
  • Coverage and methods: Entity-level <coverage> and <methods> elements can provide information on the geographic, taxonomic and temporal coverages, or data collection methods, for the data entity. The general recommendation is that these be placed at the <dataset> level instead, but there may be specific use cases that benefit from entity-level coverage or methods information. See more details about coverages in Chapter 7, and about methods in Chapter 9.

XPaths referenced in this chapter

Data table entity: /eml:eml/dataset/dataTable

Other entity: /eml:eml/dataset/otherEntity

Data table entity name: /eml:eml/dataset/dataTable/entityName

Other entity description: /eml:eml/dataset/otherEntity/entityDescription

Data table physical tree: /eml:eml/dataset/dataTable/physical/

Object name (filename): /eml:eml/dataset/dataTable/physical/objectName

Format name: …physical/distribution/dataFormat/externallyDefinedFormat/formatName

Other entity type: eml:eml/dataset/otherEntity/entityType

Other entity annotation: eml:eml/dataset/otherEntity/annotation