[SPARK-53347][PROTOBUF] Fix proto deserialization to allow false and null values #52108
+47
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR fixes SPARK-53347:
from_protobuf()
incorrectly deserializesgoogle.protobuf.BoolValue
fields set tofalse
asnull
.In Protobuf 3, primitive types (e.g.
bool
) have no field presence, but wrapper types such asgoogle.protobuf.BoolValue
are messages that can distinguish between:null
false
→false
true
→true
The existing Spark Protobuf deserializer dropped the distinction for
BoolValue(false)
and returnednull
.This patch updates
ProtobufDeserializer
so thatBoolValue(false)
is correctly deserialized asfalse
, while keeping the existing semantics for all other types.Why are the changes needed?
Without this patch, Spark users cannot distinguish between
false
andnull
when reading Protobuf data usingfrom_protobuf()
.This breaks correctness in queries where boolean fields are optional but explicitly set to
false
.Correct handling is critical for applications that rely on distinguishing unset from false values (e.g. feature flags, filters, optional booleans in business logic).
Does this PR introduce any user-facing change?
Yes.
Previously:
When parsing a message with that contained a false boolean value by doing :
df.select(from_protobuf($"bytes", "BoolWrapper", desc)).show()
Produced :
After the patch we get :
The behaviour when value is true remains unchanged.
How was this patch tested?
Added a new unit test in ProtobufFunctionsSuite:
Ensures BoolValue.of(true) → true
Ensures BoolValue.of(false) → false
Ensures absent field → null
Ran the existing ProtobufFunctionsSuite and related Protobuf tests to confirm no regressions.
Was this patch authored or co-authored using generative AI tooling?
Yes, with the help of Genie.
Generated-by Genie