Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unescape: Decode \u escaped characters for surrogate pairs correctly #9799

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Jan 6, 2025

Currently, we ignore surrogate pairs for \u escape on Unicode representation.
To handle this, we need to process with surrogate pairs manner.
Noe that this representation is also encoded \uXXXX representation on creating JSON.
On creating msgpack, this unescaping operation is effective.

Closes #9712.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
$ bin/fluent-bit -i stdout -o stdout

and send {"text": "\ud83e\udd17"} in the same terminal.

  • Debug log output from testing the change
Fluent Bit v4.0.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/01/06 18:17:45] [ info] Configuration:
[2025/01/06 18:17:45] [ info]  flush time     | 1.000000 seconds
[2025/01/06 18:17:45] [ info]  grace          | 5 seconds
[2025/01/06 18:17:45] [ info]  daemon         | 0
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  inputs:
[2025/01/06 18:17:45] [ info]      stdin
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  filters:
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  outputs:
[2025/01/06 18:17:45] [ info]      stdout.0
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  collectors:
[2025/01/06 18:17:45] [ info] [fluent bit] version=4.0.0, commit=09214ebc7b, pid=74663
[2025/01/06 18:17:45] [debug] [engine] coroutine stack size: 36864 bytes (36.0K)
[2025/01/06 18:17:45] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/01/06 18:17:45] [ info] [simd    ] NEON
[2025/01/06 18:17:45] [ info] [cmetrics] version=0.9.9
[2025/01/06 18:17:45] [ info] [ctraces ] version=0.5.7
[2025/01/06 18:17:45] [ info] [input:stdin:stdin.0] initializing
[2025/01/06 18:17:45] [ info] [input:stdin:stdin.0] storage_strategy='memory' (memory only)
[2025/01/06 18:17:45] [debug] [stdin:stdin.0] created event channels: read=25 write=26
[2025/01/06 18:17:45] [debug] [input:stdin:stdin.0] buf_size=16000
[2025/01/06 18:17:45] [debug] [stdout:stdout.0] created event channels: read=28 write=29
[2025/01/06 18:17:45] [ info] [sp] stream processor started
[2025/01/06 18:17:45] [ info] [output:stdout:stdout.0] worker #0 started
{"text": "\ud83e\udd17"}
[2025/01/06 18:17:47] [debug] [task] created task=0x6000032ec000 id=0 OK
[2025/01/06 18:17:47] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1736155066.742995000, {}], {"text"=>"🤗"}]
[2025/01/06 18:17:47] [debug] [out flush] cb_destroy coro_id=0
[2025/01/06 18:17:47] [debug] [task] destroy task=0x6000032ec000 (task_id=0)
^C[2025/01/06 18:17:48] [engine] caught signal (SIGINT)
[2025/01/06 18:17:48] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2025/01/06 18:17:48] [ info] [output:stdout:stdout.0] thread worker #0 stopped
  • Attached Valgrind output that shows no leaks or memory corruption was found
==80122== 
==80122== HEAP SUMMARY:
==80122==     in use at exit: 0 bytes in 0 blocks
==80122==   total heap usage: 2,999 allocs, 2,999 frees, 1,341,761 bytes allocated
==80122== 
==80122== All heap blocks were freed -- no leaks are possible
==80122== 
==80122== For lists of detected and suppressed errors, rerun with: -s
==80122== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from 27a363e to b758796 Compare January 6, 2025 10:10
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from b758796 to dfe15aa Compare January 6, 2025 10:45
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from e02c52c to b4c023e Compare January 6, 2025 10:58
@vit-zikmund
Copy link

vit-zikmund commented Jan 6, 2025

Thanks for following up this quick @cosmo0920!
I see you're struggling with the error cases there. As the unescaping function is expected to return the number of processed bytes, wouldn't it be better to stick to that and in the error cases set the ch being returned to the replacement character (ch = L'\uFFFD') I suggested in my issue comment footnote?

On the other hand, staying strict and rejecting that sequence is likely much better for the user, who won't suddenly find magic replacements in their data.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from b4c023e to daee871 Compare January 8, 2025 04:12
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from 7f20877 to 8febcae Compare January 8, 2025 05:41
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from 154dfee to 8febcae Compare January 8, 2025 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect parsing of escaped characters from higher unicode planes in a JSON string
2 participants