-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reading of proto Uint32Value #3113
base: master
Are you sure you want to change the base?
Fix reading of proto Uint32Value #3113
Conversation
@@ -260,7 +260,7 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addF | |||
return builder.primitive(INT32, getRepetition(descriptor)); | |||
} | |||
if (messageType.equals(UInt32Value.getDescriptor())) { | |||
return builder.primitive(INT64, getRepetition(descriptor)); | |||
return builder.primitive(INT32, getRepetition(descriptor)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unsigned 32-bit integer might overflow a signed 32-bit integer, right? It would be good to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added a test using the min and max value of Integer and doing a round trip to make sure we get it back. It looks a bit weird because uint32 are represented as signed 32 bits integer in java, but it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the test really test the out of bounds case. The Integer.MAX_VALUE
is 2^31-1 = 2147483647
, but the unsigned integer could go up to 2^32-1 = 4294967295
. Maybe the original author @mwong38 has an opinion on this (I'm not a protobuf expert myself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my understanding.
uint32
is not a java concept. Therefore a protobuf field of type uint32
, in the java generated code, will be treated as signed 32 bit integer. What java calls an int
.
Something like:
message TestMessage {
uint32 value = 1;
}
Will generate setters and getters that use an int
:
public Builder setValue(int value) // ....
However if I set the value to a negative value, for example builder.setValue(-1)
and call toString
, it will correctly interpret it as 4294967295.
Again what's important to note is that the underlying representation of a uint32 protobuf type in java is (32bit) int. Say I am to store, from a language that support unsigned integers (eg python), the value of a uint32
to 4294967295
. Then if I read it in java I would get -1
. But if I serialize the message again and pass it back to C or python, I would get back 4294967295
which is what I want.
I've extended the test to illustrate the behaviour better (adding calls to toString
). Also I've added -1
on top of Integer.MIN_VALUE
and Integer.MAX_VALUE
to correctly test the full spectrum of unsigned integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine and a good way to test it. Java doesn't support unsigned int, so if you're forced to convert to signed, this is the best way. The underlying 32-bit machine representation will almost certainly be the same. (I say "almost" baring whatever little/big endian weirdness -- but even then the conversion to/from Protobuf should be consistent.) So I think we're good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the excellent example. But why wouldn't we just handle this as an INT64 in Java, sacrificing some more memory, but see 4294967295
instead of -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko I think the Uint32Value, which is a basically a nullable uint32, should use the same physical type as a non nullable uint32 (give or take the nullable flag). In this case we should use 32 bits integer to be consistent.
Also even if we were to use int64 for the underlying parquet storage, it won't be exposed to the user as the API for Uint32Value is provided by the protobuf library and uses int
in java.
Rationale for this change
Fix for #3112
What changes are included in this PR?
addLong
implementation of theProtoUInt32ValueConverter
Are these changes tested?
There was extensive test coverage of this feature for round trip reading. The main change is the physical / underlying type change from int32 to int64 which is tested.
There are however no test for backward compatibility.
Are there any user-facing changes?
The parquet file generated from protobuf using Uint32Value will now use int32 instead of int64.
Related MR:
Closes #3112