-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1458: Reduce HDFS NameNode getFileInfo RPC #1767
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 4e659d8.
|
||
public class OrcInputStreamUtil { | ||
|
||
private static final String DFS_CLASS = "org.apache.hadoop.hdfs.DFSInputStream"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we reflect this when Apache ORC can access this directly?
$ javap -cp hadoop-client-api-3.3.6.jar org.apache.hadoop.hdfs.DFSInputStream | head -n3
Compiled from "DFSInputStream.java"
public class org.apache.hadoop.hdfs.DFSInputStream extends org.apache.hadoop.fs.FSInputStream implements org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead,org.apache.hadoop.fs.HasEnhancedByteBufferAccess,org.apache.hadoop.fs.CanUnbuffer,org.apache.hadoop.fs.StreamCapabilities,org.apache.hadoop.fs.ByteBufferPositionedReadable {
public static boolean tcpReadsDisabledForTesting;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.hadoop.hdfs.DFSInputStream.shortCircuitForbidden
Method is not public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before ORC-1430, we could not access hdfs-related classes in the core module, so I used reflection.
Now I have done some refactoring, but the method of shortCircuitForbidden
is not public, so I still need to use reflection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a high-risky way which Apache Hadoop community is also not recommended.
May I ask if you have any reference in ASF communities which adopts this hack, @cxzl25 ?
Gentle ping, @cxzl25 . |
@Hexiaoqiao Can you help review this PR? |
Thanks involve me here. I have left my concerns here, I would like to reassert: I believe this could reduce load to NameNode when using ORC on HDFS but I am worried it involve any potential issues from HDFS view. Back to this PR, IIUC this is used only for extract footer, right? Technically, for one immutable file, the result from Thanks again. |
What changes were proposed in this pull request?
If the file in HDFS is in a completed state, avoid calling the HDFS getFileInfo RPC.
Provide
orc.file.length.fast
configuration to enable this behavior.Why are the changes needed?
Now reading an ORC file in HDFS will generate at least one
open
andgetFileInfo
RPC. This optimization can remove getFileInfo RPC as much as possible, improve ORC reading efficiency, and reduce the number of HDFS RPCs.How was this patch tested?
The production environment has been running stably for several months
Was this patch authored or co-authored using generative AI tooling?
No