Data Origin in the VO

The goal of the document is to make the Data Origin more visible in the query results executed in the Virtual Observatory. The document lists meta-data required to provide sufficient traceability to end-users in order to improve the understanding of the resultsets and enabling its reuse and its citation.

Use cases

(Data Origin information) A researcher has data in a VOTable that shows an odd feature. They would now like to talk to the creator of the data to help figure out whether that feature is physics or an artefact.

Requirement: contact information to producers present; but then let's not make that a MUST: This can be GDPR-relevant data, and it must be possible to leave it out if it is

The researcher completes his understanding with Data Origin information easily accesible from the VOtable, and this, regardless of the service which generated the result. For instance, a URL that links an article. [The information could contain the Author, the year of publication, related resources like an article or the original data URL]

When data provided by the service is derived from external resources, or if the data were performed with an additional curation, the nature and the links to the external resources are available.

For instance, a table published in a journal or by a Space Agency is also hosted in a Data Center like CDS, GAVO, etc. The data curation depends of the Data Center which can add associated data, enrich meta-data (eg: add filter for magnitude) or make a sub-selection of columns. [an advanced serialisation could be based on DOI vocabulary "isVariantFormiOf", "IsDerivedFrom", ...]
(Reproducibility) A researcher revisits work they did six months earlier in an ad-hoc fashion and would now like to reproduce it in a more structured fashion. Do do that, they need to know, say, which queries against which services, or perhaps which programs, produced the files. [Requirement: have the request parameters and a service identification (access url? ivoid?) in the data origin]
(Citation) While preparing a publication, a researcher would like to properly cite the software and data that went into their results. They now run a program to extract that information from the digital artefacts going into the publication -- perhaps even in separate parts of citations and acknowledgements. [Requirement: The data origin must indicate requests for citation and/or acknowledgement in a machine-readable way, preferably in a way that machines can generate BibTeX for whatever they specify]

The information allows the researcher to fill the template citation asked by journals.

Example (American Astronomical Society template):

"we searched optical astrometric data of these sources from the Gaia (Gaia Collaboration et al. 2016) Early Data Release 3 (Gaia Collaboration et al. 2021) via the Gaia archive (Gaia Collaboration 2020)."

(Workflow) Give me a bibliography of everything I’ve used in the workflow" .

The VOTable resulting of a session contains homogenized metadata that can be merged and compared.

What else ? ....

Metadata expected

Tracing Data origin can be complex. It depends on the granularity expected.

A basic approach consists to add information using key=value pairs. Each interaction is independent with no interaction with each other. This approach could be generally serialized in VOTable using the INFO tag.
An advanced approach consists in a rich serialization that allows information to interact. This approach requires an advanced VOTable serialization. mivot is a response that enables to map data with data-models like DatasetDM or Provenance (or last-step-provenance)

The following metadata can be repeated and could follow a controlled vocabulary.

Author: name or ORCID
Organization: name or URL
Editor: name or URL
Journal: name or URL
Datacenter that provides the result: name or URL
Contact: email
Resource Identifier: ivoid of resource(s) hosted by the service which provides the result
Resource citation: DOI, bibcode of resource(s) hosted by the datacenter which returns the result
Original resource identifier: a remote ressource which was used to build the result
publication date
Original publication date:
Data version
Curation level: (controled vocabulary)
Operation: Operation as cutout, add-values executed on Data-center on the original data
Licence: (original) licenses - machine-readable URI is preferred
Access protocol: eg.TAP query, SCS, ...
Query: eg: ADQL
software version: Data Center version
Comment: Any additional information in text plain that complete the result
...?

Proposal

Query information

Query information enables to link the registry and to reproduce the query. For queries on evolving dataset, the version or the date must complete the information.

meta-data	Description	Mandatory
ivoid	ivoid identifier to link registry	yes
publisher	Data center that provides the VOTable	yes
version	Dataset version (or release date)
service_protocol	Protcol access with version
request	Request url
request_post	(POST Request) POST arguments new
request_date	Query execution date
contact	email or URL contact
landing_page	Dataset landing page

Serialisation example: <info> tag makes the jobs. see SCS example

Dataset Origin

Dataset-origin completes the "Query information" -

Simple case providing a unique table in the output (eg: SCS)

meta-data	Description	Mandatory
Publication-id	Dataset identifier that can be used for citation	yes
Curation-level	Controled vocabulary
Resource-version	Dataset version od last release
Rights	Licence URI
Rights-type	Licence type (eg: CC-by, CC-0, private, public)
Copyrights	Copyright text
Creator	Dataset Author(s) or group
Publication-ref	Identifier of the original resource that can be an article or the origin Data Center
Editor	editor name
Relation_type	controled vocabulary (VOResource: relationshipType ? ) to specify relation to related resource new
related_resource	Original resource new
Publication-date	Date of the original publication
resource_date	Date of original resource new

Publication-id: can be prefixed with the identifier type: eg: bibcode:..., doi:..., ror:...

Serialisation example: <info> tag makes the jobs. see SCS example

Complex output involving several tables (eg: TAP query, ObsCore result)

Dataset-origin depends on each table used for the output. Datamodels like Last-step -Provenance or DatasetDM allows to gather the metadata.

DatasetDM Example:

meta-data	Description	Mandatory
dataset:productType
dataset:productSubType	controled vocabulary
dataset:DataID.datasetDID	dataset ivoid	yes
dataset:DataID.title	dataset title
dataset:DataID.creationType	type of resource
dataset:DataID.date	Publication date of original dataset/article
dataset:Party.name	(first)Author
dataset:Curation.publisherDID	data-center identifier (ivoid)	yes
dataset:Curation.rights	rights text
dataset:Curation.releaseDate	Data-center publication date	yes
party.Organisation.email	Data-center contact
dataset:Curation.doi	Dataset DOI
dataset:Curation.bibcode	Dataset bibcode

Serialisation example: DatasetDM serialisation. see TAP example

(see also: datasetDM in TAP (ivoa-talk)

About

This document describes simple means to declare basic provenance information in the Virtual Observatory.

Stable versions of this document are available through the IVOA document repository.

To build a PDF version this document, you will need a reasonably complete LaTeX installation, a sufficiently capable make, preferably latexmk and probably rsvg-convert. For further details, see ivoatexDoc.

This document is distributed under CC-BY-SA.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
ivoatex @ 9a4d63b		ivoatex @ 9a4d63b
reports		reports
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
data-origin.bib		data-origin.bib
data-origin.tex		data-origin.tex
fig-ext-ids.texfig		fig-ext-ids.texfig
local.bib		local.bib
registry.md		registry.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Origin in the VO

Use cases

Metadata expected

Proposal

Query information

Dataset Origin

About

About

Releases

Packages

Contributors 3

Languages

License

gilleslandais/ivoa-dcp-data-origin

Folders and files

Latest commit

History

Repository files navigation

Data Origin in the VO

Use cases

Metadata expected

Proposal

Query information

Dataset Origin

About

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages