Skip to content

Conversation

sliedes
Copy link

@sliedes sliedes commented Jul 2, 2024

This change adds rudimentary hOCR output support. Notes:

  • Currently it just adds bounding boxes, not baselines (which are also supported) to the hOCR output

  • It doesn't add any semantic layout stuff; instead, it just represents each word as an ocrx_word

  • Some of the metadata could be improved, such as adding the real image name and perhaps EasyOCR version number

  • I didn't check if EasyOCR supports multipage inputs; this will certainly break with those if it does

  • I left this comment in the source code; I'm not sure what to do with it (probably shouldn't be enabled by default):

# In order to get a browser-renderable HTML file, you can add this before the closing </body> tag:
#
# <script src="https://unpkg.com/hocrjs"></script>

Other than that, I validated the output with hocr-check from https://github.com/ocropus/hocr-tools and also checked that it validates as XHTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant