Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

Open
NHDaly opened this issue Apr 8, 2025 · 10 comments
Labels
design Design of APIs or of the language itself strings "Strings!"

Comments

@NHDaly
Copy link
Member

NHDaly commented Apr 8, 2025

In julia, we are allowed to construct strings with non-UTF-8 data. Per the docstring for String:

while they are interpreted as being UTF-8 encoded, they can be composed of any byte sequence. Use isvalid to validate that the underlying byte sequence is valid as UTF-8.

Julia provides a series of functions for indexing a string via codeunits, such as codeunit(str, i) -> UInt8 and codeunits(str::Str) -> Base.CodeUnits.

However, we cannot use SubString{String} to build a view over a string, which is indexing non-UTF-8 data by codeunits.
This is surprising, since the underlying struct appears architected to support it:

julia> dump(view("\xa8\xce\xa8", 1:1))
SubString{String}
  string: String "\xa8Ψ"
  offset: Int64 0
  ncodeunits: Int64 1

but we cannot construct it, since the default constructor has been replaced with one taking a start and end character offset.

struct SubString{T<:AbstractString} <: AbstractString
    string::T
    offset::Int
    ncodeunits::Int

    function SubString{T}(s::T, i::Int, j::Int) where T<:AbstractString
        i  j || return new(s, 0, 0)
        @boundscheck begin
            checkbounds(s, i:j)
            @inbounds isvalid(s, i) || string_index_err(s, i)
            @inbounds isvalid(s, j) || string_index_err(s, j)
        end
        return new(s, i-1, nextind(s,j)-i)
    end
end

Can we provide an additional function to allow constructing a SubString{String} via offset and ncodeunits, allowing a SubString to not refer to a valid utf-8 string?

@NHDaly NHDaly added strings "Strings!" design Design of APIs or of the language itself labels Apr 8, 2025
@vtjnash
Copy link
Member

vtjnash commented Apr 8, 2025

Can you comment on why you would need to construct a SubString which isn't a sub-slice of the parent string? Simple local testing seems to suggest it works just fine with malformed data, as long as your slices are well-formed

julia> SubString("α\x80\x80\x80", 1, 3)
"α\x80"

@NHDaly
Copy link
Member Author

NHDaly commented Apr 8, 2025

Imagine that I am storing non utf-8 data in the string, and I want to get a SubString of that string, as a view to the first 2 bytes, and the bytes are "\xa8\xce\xa8". There is no way to get it without copying, since the current API is indexing based on character indexes:

julia> s = "\xa8\xce\xa8"
"\xa8Ψ"

julia> ncodeunits(SubString{String}(s, 1, 2))
3

You can get it via a copy, like this, of course, but this is no longer a SubString of the original string:

julia> SubString{String}(String(codeunits(s)[1:2]))
"\xa8\xce"

julia> ncodeunits(SubString{String}(String(codeunits(s)[1:2])))
2

@jakobnissen
Copy link
Member

I'm skeptical of this, but interested in the use case. If you want the first 2 bytes, independent on whether these two bytes actually correspond to characters, in what sense do you really want a string? That is, if you want the first two bytes, don't you actually want view(codeunits(s), 1:2)?

@NHDaly
Copy link
Member Author

NHDaly commented Apr 8, 2025

Yeah, i asked my colleague the same thing, but the issue is that we are using the same data structure both to store UTF-8 strings and non-UTF-8 strings, and it's up to the caller to know whether or not you can call the utf-8-specific functions.

Which, I will note, is exactly the same choice made by julia's String type. You can store either utf-8 data or non-utf-8 data in a String, and you simply need to know (or check isvalid) whether your data is UTF-8.

(The context for this is that we are implementing the storage for strings in our database engine, which has the same loose definition of string data as Julia: a string can contain arbitrary data, but if it is utf-8, you can use the corresponding functions for it. So we want to support treating this data as an <: AbstractString for the cases where it is valid utf-8, and we also want to support indexing it by codeunit for the cases where it isn't, again just like String.

However, as a performance optimization in some contexts, we are loading a series of strings together from disk as one giant string, and then using SubStrings to refer to the individual strings. In such cases, we want the individual strings to behave like AbstractString. We just want to avoid copying the data if we can, to make it more efficient.)

@jakobnissen
Copy link
Member

Julia does allow reading non-UTF8 in as strings, but it has never allowed slicing or indexing at invalid character indexes, even when constructing a normal String. It seems like if we allow constructing a view at invalid indices, we'd also have to allow slicing and indexing at invalid indices.
I see the use case. But perhaps it would be better served by reading the data in as an array, then using StringViews.jl to construct the views? After all, this use case doesn't exactly sound like "utf8 by convention, but we don't error on invalid strings" - this sounds like reading binary data and then selectively choosing to interpret some of these bytes as strings.

@NHDaly
Copy link
Member Author

NHDaly commented Apr 8, 2025

Interesting suggestion. Possibly that would be a better fit, we should look into it.

I think that we are something like "utf8 by convention, but if you store invalid strings loaded from binary data, that's okay too, as long as you accept that you might get errors on the functions that expect utf8 strings," which i thought was more like the Julia philosophy too. But maybe that's wrong?

Julia does allow reading non-UTF8 in as strings, but it has never allowed slicing or indexing at invalid character indexes, even when constructing a normal String. It seems like if we allow constructing a view at invalid indices, we'd also have to allow slicing and indexing at invalid indices.

I guess that julia does support indexing strings via codeunits, through the codeunits(s) interface. But you're right that turning that into a sliced string is a bit of gymnastics:

julia> String(codeunits(s)[1:2])
"\xa8\xce"

I'm basically asking if we can have the same thing for String Views. Something like:

substring_view_codeunits(s, 1, 2)

or maybe just

SubString{String}(codeunits(s), 1:2)

or something?

@Seelengrab
Copy link
Contributor

Imagine that I am storing non utf-8 data in the string,

Maybe I'm missing something, but why not use a Memory for this? What advantage does using a String (and thus SubString{String}) have here? It doesn't sound like you're treating that data as an actual String in its own sense.

@NHDaly
Copy link
Member Author

NHDaly commented Apr 10, 2025

My point is that sometimes we are treating the data like a utf8 string. We want to be able to call string functions on it like uppercase, length, etc, which do assume utf8, but we also want to support byte-based functionality if it isn't utf8. Again, I will note that this is exactly the same flexibility that Julia String offers.

@vtjnash
Copy link
Member

vtjnash commented Apr 10, 2025

I think the main concern with that is if you haven't put any separators in your data (even at least \0), then those functions will return wrong answers and corrupt your data in this edge case. If you do have any sort of separator in the data, then SubString construction would have been okay.

@jakobnissen
Copy link
Member

jakobnissen commented Apr 10, 2025

Maybe I'm dense (pun intended) but I still don't quite understand. You want to use String as a generic byte storage for a buffer that contains both UTF8 and non-UTF8.
Presumably, the substrings should only point to the UTF8 part of the data. That implies its start and end indices of the substring are also valid, so the current constructor shouldn't be a problem?
Here is an example:

julia> data = rand(UInt8, 500)
       s = "rødgrød med fløde"
       data[100:100+ncodeunits(s)-1] .= codeunits(s)
       str = String(data)
       view(str, 100:prevind(str, 100+ncodeunits(s)))
"rødgrød med fløde"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design of APIs or of the language itself strings "Strings!"
Projects
None yet
Development

No branches or pull requests

4 participants