The Need for Identifiers¶

Version: 0.1

License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Objectives¶

The aim of the recipe is to provide an introduction to the notion of identifiers, the different kind of identifiers and the kind of infrastructure required to support their use in the context of CFDE and in the context of producing and consuming FAIR data.

Definitions¶

identifier: An identifier is a chain of characters (number, letter, symbol, or any combination of those) which is associated to an entity or a type of entities and which is used to refer to those entities in an indexing system.

unique identifier: A unique identifier (UID) is an identifier which refers to only one instance of thing

locally unique identifier: A locally unique identifier (LUID) is an identifier which is unique to a given context (hence, local) but which may clash if moved out of this specific context. By clash, we mean that if dereferenced in the wrong context, it could return an instance different from the one it pointed to when it was minted in its original context.

globally unique identifier: A globally unique identifier (GUID) is an identifier which is produced such that the probability of the exact same string is extremely low (but not null), hence global. Also known as universally unique identifier (UUID). Their production does not depend on central registration and relies on the generation of a 128-bit string.

import uuid
id = uuid.uuid4() 
print(id)
5b6d0be2-d47f-11e8-9f9d-ccaf789d94a0

persistent identifier: A persistent identifier (PID) is an identifier which provides a long-lasting reference to a digital resource. The key notion introduced is that ofpersistence, which can only exist if a resolution service exists. Well known examples are persistent identifiers are Digital Object Identifiers (DOI), Archive Retrieval Keys (ARK), Open Researcher and Contributor ID(ORCID) or Persistent Uniform Resource Locator (PURL). More examples can be found on the PID forum.

compact identifier: A compact identifier is a unique string consisting of a Prefix (assigned by curator), a colon (‘:’), and an Accession (e.g. local identifier string). The prefix is composed of an optional Provider Code, and an assigned Namespace, separated by a slash (‘/’).

identifier minting service: An identifier minting service is a piece of software infrastructure which allows the production of an identifier. It can be as simple as as UUID generator (see above) or random number generator.

identifier resolution service: An identifier resolution service is a piece of software infrastructure which given an identifier can return a web resource about the entity designated by the identifier. Such service is essential to realize the notion of persistence of identifiers.

The Importance of PIDs in FAIR and in the Context of CFDE¶

Data interoperability as described by FAIR assumes that entities and information about them can be retrieved throughinternet communication protocol. This can thus only be realized by relying on PIDs and resolution service to obtain the resources of interest.

For a number of digital objects (scholarly articles, grants), persistent identifiers such DOI exist and for research organizations and persons, RORID and ORCID are gaining popularity becoming more widely used.

Datasets are another type of digital objects which are being identified by DOI with services such as Figshare or Zenodo.

But interoperability goes far beyond reliable identification of these entities. For research scientists, entities such as:
* Taxonomic Information * Anatomical Information * Diseases and Conditions * Molecular Entities, such as: - Chemicals - Metabolites - Genes - Transcripts - Protein - Lipids * Pathways and Biochemical reactions * Assays * Instruments * Licenses

For all these entities, the NIH CFDE has identified a set of semantics artefacts/ontologies, the classes of which are all identified by PIDs and are used to annotated the datasets which will be represented in the C2M2 model and made available through the DERIVA system.

This specific NIH-CFDE recipe details which ontologies have been selected and how their use will be rolled out as the program progresses.

It is important to notice that for such resources, a central authority is responsible for issuing (minting) those identifiers as well as providing the landing pages for the different content types associated with these kind of PIDs. This works well to address the semantic markup needs that arise in the data curation, extraction, transformation, and load processes ETL. However, there is a need to be able to create resolvable identifiers on demands. For instance, to uniquely identify data files. In order to discussion this specific use case, a specific document is available and details the use of minids.

Refer to the specific recipe on MINIDS to learn how to mint such resolvable identifiers for your resources.

Bibliographic References¶

Klump, J. and Huber, R., 2017. 20 Years of Persistent Identifiers – Which Systems are Here to Stay?. Data Science Journal, 16, p.9. http://doi.org/10.5334/dsj-2017-009
McMurry JA, Juty N, Blomberg N, et al. Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol. 2017;15(6):e2001414. Published 2017 Jun 29. http://doi.org/10.1371/journal.pbio.2001414
Martin Fenner, Mercè Crosas, Jeffrey S Grethe, David Kennedy, Henning Herm-jakob, Philippe Rocca-Serra, Gustavo Durand, Robin Berjon, Sebastian Karcher,Maryann Martone, and Tim Clark. 2019. A data citation roadmap for scholarly data repositories. Scientific Data6, 1 (2019), 1–9.http://doi.org/10.1038/s41597-019-0031-8 4.https://www.pidforum.org/
John Kunze. The Entity (N2T) Resolver: low-risk, low-cost persistent identifica-tion.https://hdl.handle.net/1813/3688
Wimalaratne, S., Juty, N., Kunze, J. et al. Uniform resolution of compact identifiers for biomedical data. Sci Data 5, 180029 (2018). https://doi.org/10.1038/sdata.2018.29
Research Organization Registry