v2 Metadata Overview

Implementation of new metadata features

Posted by Max Burnette on May 06, 2022 · 3 mins read

Development on Clowder v2 is progressing. Recently our core development team had in-depth design discussions regarding the architecture of metadata for files and datasets in v2. We have begun implementing aspects of this architecture and wanted to describe it below.

Metadata Structure

In the database, metadata is composed of 4 pieces of information:

  • A resource, i.e. dataset id or file id + version (see below)
  • An agent that created the metadata, i.e. user (and optionally the extractor they triggered)
  • The contents of the metadata, with arbitrary fields
  • The context of the metadata in JSON-LD terms (i.e. define the fields)

One change in v2 is improved handling predefined metadata fields. Metadata fields can be defined in the database with:

  • Name & data type (e.g. string, int)
  • Flag to allow multiple values for this field or just one per resource
  • Flag to require the field for all objects at a space level
  • A context

This means that users can refer to the metadata fields in their context rather than the more involved provision of links to URIs or JSON schema documents. The specification of data types will also allow the Clowder UI to provide widgets for users adding the metadata via the interface, e.g. date fields providing a calendar widget.

File Versioning

In v2, files are automatically versioned, meaning users can replace files as needed. Older file versions will remain accessible for viewing and download. Because metadata is often generated by extractors based on the contents of a file, changes to a file may necessitate re-running extractors or replacing metadata that is no longer applicable.

To manage this, metadata is associated with a specific file version in v2. When a file is updated to a new version any existing metadata will be carried over, but changes to the metadata will only affect that specific version. Older file versions will retain the old metadata ( permitted users can still modify previous metadata versions).

The intent is that if, for example, a text file was updated with additional contents, re-running the wordcount extractor will generate correct wordcounts for the new version will retaining the correct wordcounts for the old version as well.

** User vs. Extractor metadata **

As briefly mentioned above, the distinction between user and extractor metadata categories is going away. Every piece of metadata will now have a user associated for ownership purposes, even metadata generated by an extractor - in those cases, the user who triggered the extractor will be listed. We intend for this to reduce some of the complexity in metadata handling and conflict detection.

We are adding more features to update and replace metadata. In Clowder v1, running the same extractor multiple times would attach duplicate metadata to the resource unless the extractor itself was coded to avoid this; we are now building in duplication and replacement handling to make it easy to update specific fields or whole metadata objects and avoid inadvertent duplication.