-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048
Comments
Can you comment on why you would need to construct a SubString which isn't a sub-slice of the parent string? Simple local testing seems to suggest it works just fine with malformed data, as long as your slices are well-formed
|
Imagine that I am storing non utf-8 data in the string, and I want to get a SubString of that string, as a view to the first 2 bytes, and the bytes are julia> s = "\xa8\xce\xa8"
"\xa8Ψ"
julia> ncodeunits(SubString{String}(s, 1, 2))
3 You can get it via a copy, like this, of course, but this is no longer a SubString of the original string: julia> SubString{String}(String(codeunits(s)[1:2]))
"\xa8\xce"
julia> ncodeunits(SubString{String}(String(codeunits(s)[1:2])))
2 |
I'm skeptical of this, but interested in the use case. If you want the first 2 bytes, independent on whether these two bytes actually correspond to characters, in what sense do you really want a string? That is, if you want the first two bytes, don't you actually want |
Yeah, i asked my colleague the same thing, but the issue is that we are using the same data structure both to store UTF-8 strings and non-UTF-8 strings, and it's up to the caller to know whether or not you can call the utf-8-specific functions. Which, I will note, is exactly the same choice made by julia's String type. You can store either utf-8 data or non-utf-8 data in a String, and you simply need to know (or check (The context for this is that we are implementing the storage for strings in our database engine, which has the same loose definition of string data as Julia: a string can contain arbitrary data, but if it is utf-8, you can use the corresponding functions for it. So we want to support treating this data as an However, as a performance optimization in some contexts, we are loading a series of strings together from disk as one giant string, and then using SubStrings to refer to the individual strings. In such cases, we want the individual strings to behave like AbstractString. We just want to avoid copying the data if we can, to make it more efficient.) |
Julia does allow reading non-UTF8 in as strings, but it has never allowed slicing or indexing at invalid character indexes, even when constructing a normal String. It seems like if we allow constructing a view at invalid indices, we'd also have to allow slicing and indexing at invalid indices. |
Interesting suggestion. Possibly that would be a better fit, we should look into it. I think that we are something like "utf8 by convention, but if you store invalid strings loaded from binary data, that's okay too, as long as you accept that you might get errors on the functions that expect utf8 strings," which i thought was more like the Julia philosophy too. But maybe that's wrong?
I guess that julia does support indexing strings via codeunits, through the julia> String(codeunits(s)[1:2])
"\xa8\xce" I'm basically asking if we can have the same thing for String Views. Something like: substring_view_codeunits(s, 1, 2) or maybe just SubString{String}(codeunits(s), 1:2) or something? |
Maybe I'm missing something, but why not use a |
My point is that sometimes we are treating the data like a utf8 string. We want to be able to call string functions on it like uppercase, length, etc, which do assume utf8, but we also want to support byte-based functionality if it isn't utf8. Again, I will note that this is exactly the same flexibility that Julia String offers. |
I think the main concern with that is if you haven't put any separators in your data (even at least |
Maybe I'm dense (pun intended) but I still don't quite understand. You want to use String as a generic byte storage for a buffer that contains both UTF8 and non-UTF8. julia> data = rand(UInt8, 500)
s = "rødgrød med fløde"
data[100:100+ncodeunits(s)-1] .= codeunits(s)
str = String(data)
view(str, 100:prevind(str, 100+ncodeunits(s)))
"rødgrød med fløde" |
In julia, we are allowed to construct strings with non-UTF-8 data. Per the docstring for
String
:Julia provides a series of functions for indexing a string via codeunits, such as
codeunit(str, i) -> UInt8
andcodeunits(str::Str) -> Base.CodeUnits
.However, we cannot use
SubString{String}
to build a view over a string, which is indexing non-UTF-8 data by codeunits.This is surprising, since the underlying struct appears architected to support it:
but we cannot construct it, since the default constructor has been replaced with one taking a start and end character offset.
Can we provide an additional function to allow constructing a SubString{String} via
offset
andncodeunits
, allowing a SubString to not refer to a valid utf-8 string?The text was updated successfully, but these errors were encountered: