You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: explainer.md
+58-18
Original file line number
Diff line number
Diff line change
@@ -5,37 +5,41 @@ Authors: Jiewei Qian <qjw@google.com>, Matt Giuca <mgiuca@chromium.org>, Jon Nap
5
5
6
6
## Overview
7
7
8
-
Handwriting is a widely used input method, one key usage is to recognize the texts when users are drawing. This feature already exists on many operating systems (e.g. handwriting input methods). However, the web platform as of today can't tap into this capability. Developers need to integrate with third-party libraries (or cloud services), or to develop native apps.
8
+
Handwriting is a widely used input method, one key usage is to recognize the texts when users are drawing. This feature already exists on many operating systems (e.g. handwriting input methods). However, the web platform as of today doesn’t have this capability, the developers need to integrate with third-party libraries (or cloud services), or to develop native apps.
9
9
10
-
We want to add handwriting recognition capability to the web platform, so developers can use the it, which is readily available on the operating system.
10
+
We want to add handwriting recognition capability to the web platform, so developers can use the existing handwriting recognition features available on the operating system.
11
11
12
-
This document describes our proposal for a Web Platform API for performing on-line handwriting recognition from recorded real-time user inputs (e.g. touch, stylus). This API does not aim to support recognizing handwritings in images (off-line recognition).
13
-
14
-
The term “on-line” means the API recognizes the text as the users are drawing them. Specifically, the handwriting contains temporal information (e.g. the pen-tip is at position A at time T).
15
-
16
-
The recognizer will be able to function without the Internet, but might make use of cloud services (if available) to improve the result.
17
-
18
-
This API allow Web applications to:
19
-
20
-
1. Collect some handwriting inputs (ink strokes)
21
-
2. Request the user agent to recognize the texts
22
-
3. Retrieve the result
23
-
* Recognized texts as a JavaScript String
24
-
* Optional: Alternative results
25
-
* Optional: Extra information (e.g. character segmentation)
12
+
This document describes our proposal for a Web Platform API for performing on-line handwriting recognition from recorded real-time user inputs (e.g. touch, stylus). The term “on-line” means the API recognizes the text as the users are drawing them.
26
13
27
14
28
15
## Problem Description
29
16
30
17
Conceptually, handwriting inputs are drawings. A drawing captures the information required to recreate a pen-tip movement for text recognition purposes. Take the handwritten “WEB” for example:
* A **drawing** consists of multiple ink strokes (e.g. the above letter E consists of three ink strokes).
35
22
* An **ink stroke** represents one continuous pen-tip movement that happens across some time period (e.g. from one `touchstart` to its corresponding `touchend` event). The movement trajectory is represented by a series of ink points.
36
23
* An **ink point** is an observation of the pen-tip in space and time. It records the timestamp and position of the pen-tip on the writing surface (e.g. a `touchmove` event).
37
24
38
-
The job of a handwriting recognizer is to determine the text written in a drawing.
25
+
We want the handwriting API to enable web developers to fully utilize the capabilities available in common handwriting recognition libraries. The recognizer will need to:
26
+
* Accept a vector representation of a drawing (described in the above picture).
27
+
* Recognize texts as users are writing, in real-time (each recognition costs less than hundreds of milliseconds).
28
+
* Not rely on the Internet (a note taking website should still work in flight mode). Though the recognizer can make use of cloud services if available.
29
+
* Return the text that’s most likely written as a string.
30
+
* Allow web developers to control or fine-tune the recognizer. For example, allow developers to specify the language (an English recognizer won’t recognize Chinese characters).
31
+
* Offer an extensible way to add support for new features, in order to utilize the latest features available on the underlying libraries.
32
+
* Provide a way for developers to query feature support, so developers can decide if the recognizer should be used in their app.
33
+
34
+
To satisfy common use cases, the recognizer also need to:
35
+
Return a list of alternatives (candidates) of the text.
36
+
* Rank alternatives based on likelihood of being correct.
37
+
* Return segmentation result for each character (or words). So clients can know which strokes (and points) make up a character. One use case is in note taking apps, users select recognized texts, and delete all strokes.
38
+
39
+
Non-goals:
40
+
* Design an API to recognize texts in static images. That is optical character recognition, and is better aligned with Shape Detection API.
41
+
* Deliver consistent output across all platforms. This is very difficult, unless we implement a common (publicly owned) algorithm. Therefore, we allow the same drawing to yield different outputs. But we want to achieve a common output structure (e.g. what attribute A means, which attributes must be present).
42
+
39
43
40
44
41
45
## Existing APIs
@@ -299,6 +303,42 @@ A **prediction result** is a JavaScript object. It _must_ contain the text attri
299
303
The prediction result _may_ contain (if the implementation choose to support):
300
304
301
305
*`alternatives`: A list of JavaScript objects, where each object has a text field. These are the next best predictions (alternatives), in decreasing confidence order. Up to a maximum of `alternatives `(given in hints) strings. For example, the first string is the second best prediction (the best being the prediction result).
306
+
*`segmentationResult`: [TODO] Come up with a way to represent text segmentation.
307
+
308
+
### Segmentation result
309
+
[TODO] Come up with a way to represent grapheme set segmentation.
310
+
311
+
[TODO] per character (grapheme set) segmentation vs. word (phrase) segmentation. Or make this configurable?
Web Assembly would not allow the use of more advanced proprietary handwriting libraries (e.g. those available on the operating system). Web developers also need to manage distribution (and update) of such libraries (might take several megabytes for each update).
318
+
319
+
Web API can do the same task more efficiently (better models, zero distribution cost, faster computation). This topic was previously discussed in Shape Detection API and Text-to-Speech API.
320
+
321
+
322
+
### Why not use Shape Detection?
323
+
Handwriting (in this proposal) includes temporal information (how the shape is drawn, pixel by pixel). We believe this additional temporal information distinguishes handwriting recognition from shape detection.
324
+
325
+
If we take out the temporal information, the task becomes optical character recognition (given a photo of written characters). This is a different task, and indeed fits within the scope of shape detection.
326
+
327
+
### Grapheme clusters vs. Unicode code points
328
+
Grapheme clusters is the minimal unit used in writing. It represents visual shape. On the other hand, Unicode code points are a computer's internal representation. It represents meaning. The two concepts aren’t fully correlated.
329
+
330
+
Unicode combining marks are represented as a single code point. They are used to modify other characters, but not by themselves. This creates a problem when we need to distinguish between shape and meaning. For example, letter a (U+0061) and grave accent combining mark (U+0300) combines to à. Letter न (U+0928) and combining mark ि (U+093F) combines to letter नि.
331
+
332
+
Handwriting recognition concerns with shape (input) and meaning (output). It’s important to distinguish between those two. For example, when requesting to recognize only certain characters, grapheme clusters should be used.
333
+
334
+
### Ranking vs Score
335
+
It’s very common to use a score for assessing alternative texts. This is commonly implemented in machine learning algorithms. However, it is not a good idea for the Web.
336
+
337
+
We expect different browser vendors to offer varying recognizer implementations, this will inevitably lead to the score being incomparable.
338
+
339
+
Because the score is an implementation detail of machine learning models, the meaning of score changes if the model changes. Therefore, scores are not comparable unless everyone uses the same model.
340
+
341
+
Thus, we choose to use ranking instead of score. This gives some indication on which alternative is better. This avoids the scenario where web developers misunderstand the score’s implication and try to compare scores across different libraries, or filtering results based on it.
0 commit comments