Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

NHDaly · 2025-04-08T17:15:10Z

In julia, we are allowed to construct strings with non-UTF-8 data. Per the docstring for String:

while they are interpreted as being UTF-8 encoded, they can be composed of any byte sequence. Use isvalid to validate that the underlying byte sequence is valid as UTF-8.

Julia provides a series of functions for indexing a string via codeunits, such as codeunit(str, i) -> UInt8 and codeunits(str::Str) -> Base.CodeUnits.

However, we cannot use SubString{String} to build a view over a string, which is indexing non-UTF-8 data by codeunits.
This is surprising, since the underlying struct appears architected to support it:

julia> dump(view("\xa8\xce\xa8", 1:1))
SubString{String}
  string: String "\xa8Ψ"
  offset: Int64 0
  ncodeunits: Int64 1

but we cannot construct it, since the default constructor has been replaced with one taking a start and end character offset.

struct SubString{T<:AbstractString} <: AbstractString
    string::T
    offset::Int
    ncodeunits::Int

    function SubString{T}(s::T, i::Int, j::Int) where T<:AbstractString
        i ≤ j || return new(s, 0, 0)
        @boundscheck begin
            checkbounds(s, i:j)
            @inbounds isvalid(s, i) || string_index_err(s, i)
            @inbounds isvalid(s, j) || string_index_err(s, j)
        end
        return new(s, i-1, nextind(s,j)-i)
    end
end

Can we provide an additional function to allow constructing a SubString{String} via offset and ncodeunits, allowing a SubString to not refer to a valid utf-8 string?

The text was updated successfully, but these errors were encountered:

vtjnash · 2025-04-08T18:34:28Z

Can you comment on why you would need to construct a SubString which isn't a sub-slice of the parent string? Simple local testing seems to suggest it works just fine with malformed data, as long as your slices are well-formed

julia> SubString("α\x80\x80\x80", 1, 3)
"α\x80"

NHDaly · 2025-04-08T19:01:08Z

Imagine that I am storing non utf-8 data in the string, and I want to get a SubString of that string, as a view to the first 2 bytes, and the bytes are "\xa8\xce\xa8". There is no way to get it without copying, since the current API is indexing based on character indexes:

julia> s = "\xa8\xce\xa8"
"\xa8Ψ"

julia> ncodeunits(SubString{String}(s, 1, 2))
3

You can get it via a copy, like this, of course, but this is no longer a SubString of the original string:

julia> SubString{String}(String(codeunits(s)[1:2]))
"\xa8\xce"

julia> ncodeunits(SubString{String}(String(codeunits(s)[1:2])))
2

jakobnissen · 2025-04-08T19:05:36Z

I'm skeptical of this, but interested in the use case. If you want the first 2 bytes, independent on whether these two bytes actually correspond to characters, in what sense do you really want a string? That is, if you want the first two bytes, don't you actually want view(codeunits(s), 1:2)?

NHDaly · 2025-04-08T19:39:43Z

Yeah, i asked my colleague the same thing, but the issue is that we are using the same data structure both to store UTF-8 strings and non-UTF-8 strings, and it's up to the caller to know whether or not you can call the utf-8-specific functions.

Which, I will note, is exactly the same choice made by julia's String type. You can store either utf-8 data or non-utf-8 data in a String, and you simply need to know (or check isvalid) whether your data is UTF-8.

(The context for this is that we are implementing the storage for strings in our database engine, which has the same loose definition of string data as Julia: a string can contain arbitrary data, but if it is utf-8, you can use the corresponding functions for it. So we want to support treating this data as an <: AbstractString for the cases where it is valid utf-8, and we also want to support indexing it by codeunit for the cases where it isn't, again just like String.

However, as a performance optimization in some contexts, we are loading a series of strings together from disk as one giant string, and then using SubStrings to refer to the individual strings. In such cases, we want the individual strings to behave like AbstractString. We just want to avoid copying the data if we can, to make it more efficient.)

jakobnissen · 2025-04-08T19:47:18Z

Julia does allow reading non-UTF8 in as strings, but it has never allowed slicing or indexing at invalid character indexes, even when constructing a normal String. It seems like if we allow constructing a view at invalid indices, we'd also have to allow slicing and indexing at invalid indices.
I see the use case. But perhaps it would be better served by reading the data in as an array, then using StringViews.jl to construct the views? After all, this use case doesn't exactly sound like "utf8 by convention, but we don't error on invalid strings" - this sounds like reading binary data and then selectively choosing to interpret some of these bytes as strings.

NHDaly · 2025-04-08T19:56:34Z

Interesting suggestion. Possibly that would be a better fit, we should look into it.

I think that we are something like "utf8 by convention, but if you store invalid strings loaded from binary data, that's okay too, as long as you accept that you might get errors on the functions that expect utf8 strings," which i thought was more like the Julia philosophy too. But maybe that's wrong?

Julia does allow reading non-UTF8 in as strings, but it has never allowed slicing or indexing at invalid character indexes, even when constructing a normal String. It seems like if we allow constructing a view at invalid indices, we'd also have to allow slicing and indexing at invalid indices.

I guess that julia does support indexing strings via codeunits, through the codeunits(s) interface. But you're right that turning that into a sliced string is a bit of gymnastics:

julia> String(codeunits(s)[1:2])
"\xa8\xce"

I'm basically asking if we can have the same thing for String Views. Something like:

substring_view_codeunits(s, 1, 2)

or maybe just

SubString{String}(codeunits(s), 1:2)

or something?

Seelengrab · 2025-04-10T06:25:06Z

Imagine that I am storing non utf-8 data in the string,

Maybe I'm missing something, but why not use a Memory for this? What advantage does using a String (and thus SubString{String}) have here? It doesn't sound like you're treating that data as an actual String in its own sense.

NHDaly · 2025-04-10T15:29:01Z

My point is that sometimes we are treating the data like a utf8 string. We want to be able to call string functions on it like uppercase, length, etc, which do assume utf8, but we also want to support byte-based functionality if it isn't utf8. Again, I will note that this is exactly the same flexibility that Julia String offers.

vtjnash · 2025-04-10T15:32:12Z

I think the main concern with that is if you haven't put any separators in your data (even at least \0), then those functions will return wrong answers and corrupt your data in this edge case. If you do have any sort of separator in the data, then SubString construction would have been okay.

jakobnissen · 2025-04-10T16:03:51Z

Maybe I'm dense (pun intended) but I still don't quite understand. You want to use String as a generic byte storage for a buffer that contains both UTF8 and non-UTF8.
Presumably, the substrings should only point to the UTF8 part of the data. That implies its start and end indices of the substring are also valid, so the current constructor shouldn't be a problem?
Here is an example:

julia> data = rand(UInt8, 500)
       s = "rødgrød med fløde"
       data[100:100+ncodeunits(s)-1] .= codeunits(s)
       str = String(data)
       view(str, 100:prevind(str, 100+ncodeunits(s)))
"rødgrød med fløde"

NHDaly added strings "Strings!" design Design of APIs or of the language itself labels Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

NHDaly commented Apr 8, 2025 •

edited

Loading

vtjnash commented Apr 8, 2025

NHDaly commented Apr 8, 2025 •

edited

Loading

jakobnissen commented Apr 8, 2025

NHDaly commented Apr 8, 2025

jakobnissen commented Apr 8, 2025

NHDaly commented Apr 8, 2025

Seelengrab commented Apr 10, 2025

NHDaly commented Apr 10, 2025 •

edited

Loading

vtjnash commented Apr 10, 2025

jakobnissen commented Apr 10, 2025 •

edited

Loading

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

Comments

NHDaly commented Apr 8, 2025 • edited Loading

vtjnash commented Apr 8, 2025

NHDaly commented Apr 8, 2025 • edited Loading

jakobnissen commented Apr 8, 2025

NHDaly commented Apr 8, 2025

jakobnissen commented Apr 8, 2025

NHDaly commented Apr 8, 2025

Seelengrab commented Apr 10, 2025

NHDaly commented Apr 10, 2025 • edited Loading

vtjnash commented Apr 10, 2025

jakobnissen commented Apr 10, 2025 • edited Loading

NHDaly commented Apr 8, 2025 •

edited

Loading

NHDaly commented Apr 8, 2025 •

edited

Loading

NHDaly commented Apr 10, 2025 •

edited

Loading

jakobnissen commented Apr 10, 2025 •

edited

Loading