Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Syriac dataset from Vienna Winterschool 2024 #174

Open
cmroughan opened this issue Jan 21, 2025 · 1 comment
Open

Adding Syriac dataset from Vienna Winterschool 2024 #174

cmroughan opened this issue Jan 21, 2025 · 1 comment

Comments

@cmroughan
Copy link

Hello! @ephremishac and I would like to submit the following dataset for Syriac: 140 folios of Serto from the 16th century, transcribed as part of the Vienna HTR Winter School.

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: ÖNB Cod. Syr. 1, Ground Truth from HTR Winter School 2024
url: https://github.com/HTR-School-Vienna/2024--Syriac
authors:
 - name: Ephrem
   surname: Aboud Ishac
   orcid: 0000-0003-2943-6556
   roles:
     - project-manager
 - name: Christine
   surname: Roughan
   orcid: 0009-0004-5999-8749
   roles:
     - project-manager
 - name: Ammar
   surname: Awad
   roles:
     - transcriber
 - name: Carlo Biuzzi
   surname: Emilio
   orcid: 0000-0002-6108-3650
   roles:
     - transcriber
 - name: Saranya
   surname: Chandran
   roles:
     - transcriber
 - name: Jennifer
   surname: Griggs
   orcid: 0000-0002-7857-806X
   roles:
     - transcriber
 - name: Polina
   surname: Ivanova
   orcid: 0009-0002-6853-2129
   roles:
     - transcriber
 - name: Branko
   surname: Malešević
   orcid: 0009-0008-2419-6323
   roles:
     - transcriber
 - name: Stefan
   surname: Marić
   orcid: 0009-0008-5129-1932
   roles:
     - transcriber
 - name: Francesca
   surname: Nateri
   roles:
     - transcriber
 - name: Ivan
   surname: Petrov
   orcid: 0000-0003-4386-0097
   roles:
     - transcriber
 - name: Cristina
   surname: Tava
   roles:
     - transcriber
 - name: Maria S.
   surname: Thomas
   orcid: 0009-0008-1416-3499
   roles:
     - transcriber
institutions: []
description: >-
 Ground truth of 140 folios of ÖNB Cod. Syr. 1. This ground truth was produced
 by participants of the Vienna 2024 HTR Winter School, who used Transkribus to
 manually correct a preliminary automatic transcription that had been generated
 using Kraken/eScriptorium.
language:
 - syr
production-software: Transkribus
automatically-aligned: false
script:
 - iso: Syrj
script-type: only-manuscript
time:
 notBefore: '1545'
 notAfter: '1545'
hands:
 count: '1'
 precision: exact
license:
 name: CC-BY 4.0
 url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
volume:
 - metric: lines
   count: 2869
citation-file-link: https://github.com/HTR-School-Vienna/2024--Syriac/blob/main/CITATION.cff
transcription-guidelines: >-
 The segmentation of the folios followed the SegmOnto vocabulary for annotation
 of regions:


 - MainZone: the main column of text.

 - MainZone-gold: any sections of the main column where the text is written in
 gold block characters, as in the start of the text here. (The - character is a
 substitution for SegmOnto's recommended : character for declaring subtypes,
 since Transkribus did not allow for use of the colon character in the region
 name.)

 - MarginTextZone: any marginal words or phrases, including catchwords. Also
 used for interlinear glosses.

 - NumberingZone: any page or folio numbers.


 The transcription includes spaces, the Syriac letters, some diacritics,
 punctuation, and no vowel dots or markings.


 - Allowed diacritics:
  - Syome
  - Dots over feminine suffix heh
  - Dots in pronouns: above for demonstrative, below for personal
  - Dots in verbs: to distinguish participles and perfects
  - Dots to distinguish homographs
 - Excluded diacritics:
  - Vowel dots
  - Dots of hardening and softening (qushoyo and rukokho)

 Punctuation marks were not normalized, but rather transcribed as they appear
 in the manuscript (. ܆ ܇ : ܀).


 Transkribus's unclear tag was used when readings were uncertain or the text
 was damaged or unclear.
@cmroughan
Copy link
Author

Following up -- the dataset is now available on Zenodo in addition to the Github repository. Since the image file count/size seems to use up Github's monthly bandwidth rather quickly, I would update our submission to change the value for url:

  • old version: url: https://github.com/HTR-School-Vienna/2024--Syriac
  • new version: url: https://doi.org/10.5281/zenodo.14714089

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant