The goal of the document is to make the Data Origin more visible in the query results executed in the Virtual Observatory. The document lists meta-data required to provide sufficient traceability to end-users in order to improve the understanding of the resultsets and enabling its reuse and its citation.
-
(Data Origin information) A researcher has data in a VOTable that shows an odd feature. They would now like to talk to the creator of the data to help figure out whether that feature is physics or an artefact.
Requirement: contact information to producers present; but then let's not make that a MUST: This can be GDPR-relevant data, and it must be possible to leave it out if it is
The researcher completes his understanding with Data Origin information easily accesible from the VOtable, and this, regardless of the service which generated the result. For instance, a URL that links an article. [The information could contain the Author, the year of publication, related resources like an article or the original data URL]
When data provided by the service is derived from external resources, or if the data were performed with an additional curation, the nature and the links to the external resources are available.
For instance, a table published in a journal or by a Space Agency is also hosted in a Data Center like CDS, GAVO, etc. The data curation depends of the Data Center which can add associated data, enrich meta-data (eg: add filter for magnitude) or make a sub-selection of columns. [an advanced serialisation could be based on DOI vocabulary "isVariantFormiOf", "IsDerivedFrom", ...]
-
(Reproducibility) A researcher revisits work they did six months earlier in an ad-hoc fashion and would now like to reproduce it in a more structured fashion. Do do that, they need to know, say, which queries against which services, or perhaps which programs, produced the files. [Requirement: have the request parameters and a service identification (access url? ivoid?) in the data origin]
-
(Citation) While preparing a publication, a researcher would like to properly cite the software and data that went into their results. They now run a program to extract that information from the digital artefacts going into the publication -- perhaps even in separate parts of citations and acknowledgements. [Requirement: The data origin must indicate requests for citation and/or acknowledgement in a machine-readable way, preferably in a way that machines can generate BibTeX for whatever they specify]
The information allows the researcher to fill the template citation asked by journals.
Example (American Astronomical Society template):
"we searched optical astrometric data of these sources from the Gaia (Gaia Collaboration et al. 2016) Early Data Release 3 (Gaia Collaboration et al. 2021) via the Gaia archive (Gaia Collaboration 2020)."
- (Workflow) Give me a bibliography of everything I’ve used in the workflow" .
The VOTable resulting of a session contains homogenized metadata that can be merged and compared.
- What else ? ....
Tracing Data origin can be complex. It depends on the granularity expected.
- A basic approach consists to add information using key=value pairs. Each interaction is independent with no interaction with each other. This approach could be generally serialized in VOTable using the INFO tag.
- An advanced approach consists in a rich serialization that allows information to interact. This approach requires an advanced VOTable serialization. mivot is a response that enables to map data with data-models like DatasetDM or Provenance (or last-step-provenance)
The following metadata can be repeated and could follow a controlled vocabulary.
-
Author: name or ORCID
-
Organization: name or URL
-
Editor: name or URL
-
Journal: name or URL
-
Datacenter that provides the result: name or URL
-
Contact: email
-
Resource Identifier: ivoid of resource(s) hosted by the service which provides the result
-
Resource citation: DOI, bibcode of resource(s) hosted by the datacenter which returns the result
-
Original resource identifier: a remote ressource which was used to build the result
-
publication date
-
Original publication date:
-
Data version
-
Curation level: (controled vocabulary)
-
Operation: Operation as cutout, add-values executed on Data-center on the original data
-
Licence: (original) licenses - machine-readable URI is preferred
-
Access protocol: eg.TAP query, SCS, ...
-
Query: eg: ADQL
-
software version: Data Center version
-
Comment: Any additional information in text plain that complete the result
-
...?
Query information enables to link the registry and to reproduce the query. For queries on evolving dataset, the version or the date must complete the information.
meta-data | Description | Mandatory |
---|---|---|
ivoid | ivoid identifier to link registry | yes |
publisher | Data center that provides the VOTable | yes |
version | Dataset version (or release date) | |
service_protocol | Protcol access with version | |
request | Request url | |
request_post | (POST Request) POST arguments new | |
request_date | Query execution date | |
contact | email or URL contact | |
landing_page | Dataset landing page |
Serialisation example: <info> tag makes the jobs. see SCS example
Dataset-origin completes the "Query information" -
- Simple case providing a unique table in the output (eg: SCS)
meta-data | Description | Mandatory |
---|---|---|
Publication-id | Dataset identifier that can be used for citation | yes |
Curation-level | Controled vocabulary | |
Resource-version | Dataset version od last release | |
Rights | Licence URI | |
Rights-type | Licence type (eg: CC-by, CC-0, private, public) | |
Copyrights | Copyright text | |
Creator | Dataset Author(s) or group | |
Publication-ref | Identifier of the original resource that can be an article or the origin Data Center | |
Editor | editor name | |
Relation_type | controled vocabulary (VOResource: relationshipType ? ) to specify relation to related resource new | |
related_resource | Original resource new | |
Publication-date | Date of the original publication | |
resource_date | Date of original resource new |
Publication-id: can be prefixed with the identifier type: eg: bibcode:..., doi:..., ror:...
Serialisation example: <info> tag makes the jobs. see SCS example
- Complex output involving several tables (eg: TAP query, ObsCore result)
Dataset-origin depends on each table used for the output. Datamodels like Last-step -Provenance or DatasetDM allows to gather the metadata.
DatasetDM Example:
meta-data | Description | Mandatory |
---|---|---|
dataset:productType | ||
dataset:productSubType | controled vocabulary | |
dataset:DataID.datasetDID | dataset ivoid | yes |
dataset:DataID.title | dataset title | |
dataset:DataID.creationType | type of resource | |
dataset:DataID.date | Publication date of original dataset/article | |
dataset:Party.name | (first)Author | |
dataset:Curation.publisherDID | data-center identifier (ivoid) | yes |
dataset:Curation.rights | rights text | |
dataset:Curation.releaseDate | Data-center publication date | yes |
party.Organisation.email | Data-center contact | |
dataset:Curation.doi | Dataset DOI | |
dataset:Curation.bibcode | Dataset bibcode |
Serialisation example: DatasetDM serialisation. see TAP example
(see also: datasetDM in TAP (ivoa-talk)
This document describes simple means to declare basic provenance information in the Virtual Observatory.
Stable versions of this document are available through the IVOA document repository.
To build a PDF version this document, you will need a reasonably
complete LaTeX installation, a sufficiently capable make
, preferably
latexmk and probably
rsvg-convert. For further
details, see ivoatexDoc.
This document is distributed under CC-BY-SA.