Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading of proto Uint32Value #3113

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

0x26res
Copy link

@0x26res 0x26res commented Jan 6, 2025

Rationale for this change

Fix for #3112

What changes are included in this PR?

  • Uint32Value primitive / physical type switched to INT32 (from INT64)
  • Reading value for UInt32Value now uses int instead of long. In order to be backward compatible with previous parquet files (that were using long), I've left the addLong implementation of the ProtoUInt32ValueConverter

Are these changes tested?

There was extensive test coverage of this feature for round trip reading. The main change is the physical / underlying type change from int32 to int64 which is tested.

There are however no test for backward compatibility.

Are there any user-facing changes?

The parquet file generated from protobuf using Uint32Value will now use int32 instead of int64.

Related MR:

Closes #3112

@@ -260,7 +260,7 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addF
return builder.primitive(INT32, getRepetition(descriptor));
}
if (messageType.equals(UInt32Value.getDescriptor())) {
return builder.primitive(INT64, getRepetition(descriptor));
return builder.primitive(INT32, getRepetition(descriptor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unsigned 32-bit integer might overflow a signed 32-bit integer, right? It would be good to test this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added a test using the min and max value of Integer and doing a round trip to make sure we get it back. It looks a bit weird because uint32 are represented as signed 32 bits integer in java, but it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test really test the out of bounds case. The Integer.MAX_VALUE is 2^31-1 = 2147483647, but the unsigned integer could go up to 2^32-1 = 4294967295. Maybe the original author @mwong38 has an opinion on this (I'm not a protobuf expert myself).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my understanding.

uint32 is not a java concept. Therefore a protobuf field of type uint32, in the java generated code, will be treated as signed 32 bit integer. What java calls an int.

Something like:

message TestMessage {
  uint32 value = 1;
}

Will generate setters and getters that use an int:

 public Builder setValue(int value) // ....

However if I set the value to a negative value, for example builder.setValue(-1) and call toString, it will correctly interpret it as 4294967295.

Again what's important to note is that the underlying representation of a uint32 protobuf type in java is (32bit) int. Say I am to store, from a language that support unsigned integers (eg python), the value of a uint32 to 4294967295. Then if I read it in java I would get -1. But if I serialize the message again and pass it back to C or python, I would get back 4294967295 which is what I want.

I've extended the test to illustrate the behaviour better (adding calls to toString). Also I've added -1 on top of Integer.MIN_VALUE and Integer.MAX_VALUE to correctly test the full spectrum of unsigned integer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine and a good way to test it. Java doesn't support unsigned int, so if you're forced to convert to signed, this is the best way. The underlying 32-bit machine representation will almost certainly be the same. (I say "almost" baring whatever little/big endian weirdness -- but even then the conversion to/from Protobuf should be consistent.) So I think we're good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the excellent example. But why wouldn't we just handle this as an INT64 in Java, sacrificing some more memory, but see 4294967295 instead of -1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko I think the Uint32Value, which is a basically a nullable uint32, should use the same physical type as a non nullable uint32 (give or take the nullable flag). In this case we should use 32 bits integer to be consistent.

Also even if we were to use int64 for the underlying parquet storage, it won't be exposed to the user as the API for Uint32Value is provided by the protobuf library and uses int in java.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fail to read google.protobuf.UInt32Value from parquet
3 participants