-
Notifications
You must be signed in to change notification settings - Fork 14
[Join] Inline and parallelize tbb in getAllTableColumnFragments. #616
Conversation
6cfba25
to
40fd548
Compare
ab0e572
to
eb8042f
Compare
1e7b776
to
32b48b4
Compare
Will be rebased over #623 --- Done |
13a559f
to
93f9e17
Compare
I was wrong here, it's inference of 2 issues.
|
93f9e17
to
235508d
Compare
Should be separated into 2 PRs, also new CPU implementation: Some overall details in #574 (comment) |
Benchmark status (with non-lazy): ./_launcher/solution.R --solution=pyhdk --task=join --nrow=1e8
[1] "./pyhdk/join-pyhdk.py"
# join-pyhdk.py
pyhdk data_name: J1_1e8_NA_0_0
loading datasets J1_1e8_NA_0_0, J1_1e8_1e2_0_0, J1_1e8_1e5_0_0, J1_1e8_1e8_0_0
Using fragment size 32000000
100000000
100
100000
100000000
joining...
(89997128, 9)
(89997128, 9)
(89995511, 11)
(89995511, 11)
(100000000, 11)
(100000000, 11)
(89995511, 11)
(89995511, 11)
(90000000, 13)
(90000000, 13)
joining finished, took 73s
on_disk question run time_sec
1 FALSE small inner on int 1 2.227
2 FALSE small inner on int 2 1.914
3 FALSE medium inner on int 1 2.315
4 FALSE medium inner on int 2 2.167
5 FALSE medium outer on int 1 2.257
6 FALSE medium outer on int 2 2.475
7 FALSE medium inner on factor 1 2.227
8 FALSE medium inner on factor 2 2.240
9 FALSE big inner on int 1 4.320
10 FALSE big inner on int 2 3.970
./_launcher/solution.R --solution=pyhdk --task=join --nrow=1e8
[1] "./pyhdk/join-pyhdk.py"
# join-pyhdk.py
pyhdk data_name: J1_1e8_NA_0_0
loading datasets J1_1e8_NA_0_0, J1_1e8_1e2_0_0, J1_1e8_1e5_0_0, J1_1e8_1e8_0_0
Using fragment size 4000000
100000000
100
100000
100000000
joining...
(89997128, 9)
(89997128, 9)
(89995511, 11)
(89995511, 11)
(100000000, 11)
(100000000, 11)
(89995511, 11)
(89995511, 11)
(90000000, 13)
(90000000, 13)
joining finished, took 59s
on_disk question run time_sec
1 FALSE small inner on int 1 1.077
2 FALSE small inner on int 2 0.765
3 FALSE medium inner on int 1 1.063
4 FALSE medium inner on int 2 1.017
5 FALSE medium outer on int 1 0.704
6 FALSE medium outer on int 2 0.726
7 FALSE medium inner on factor 1 1.056
8 FALSE medium inner on factor 2 1.004
9 FALSE big inner on int 1 1.506
10 FALSE big inner on int 2 1.293 |
b81d607
to
a6dbaa4
Compare
export FRAGMENT_SIZE=4000000
/localdisk/dmitriim/benchmarks/db-benchmark ⑂master* $ ./_launcher/solution.R --solution=pyhdk --task=join --nrow=1e7
[1] "./pyhdk/join-pyhdk.py"
# join-pyhdk.py
pyhdk data_name: J1_1e7_NA_0_0
loading datasets J1_1e7_NA_0_0, J1_1e7_1e1_0_0, J1_1e7_1e4_0_0, J1_1e7_1e7_0_0
Using fragment size 4000000
10000000
10
10000
10000000
joining...
(8998860, 9)
(8998860, 9)
(8998412, 11)
(8998412, 11)
(10000000, 11)
(10000000, 11)
(8998412, 11)
(8998412, 11)
(9000000, 13)
(9000000, 13)
joining finished, took 9s
on_disk question run time_sec
1 FALSE small inner on int 1 0.346
2 FALSE small inner on int 2 0.304
3 FALSE medium inner on int 1 0.409
4 FALSE medium inner on int 2 0.373
5 FALSE medium outer on int 1 0.274
6 FALSE medium outer on int 2 0.284
7 FALSE medium inner on factor 1 0.354
8 FALSE medium inner on factor 2 0.320
9 FALSE big inner on int 1 0.774
10 FALSE big inner on int 2 0.482
./_launcher/solution.R --solution=pyhdk --task=join --nrow=1e8
[1] "./pyhdk/join-pyhdk.py"
# join-pyhdk.py
pyhdk data_name: J1_1e8_NA_0_0
loading datasets J1_1e8_NA_0_0, J1_1e8_1e2_0_0, J1_1e8_1e5_0_0, J1_1e8_1e8_0_0
Using fragment size 4000000
100000000
100
100000
100000000
joining...
(89997128, 9)
(89997128, 9)
(89995511, 11)
(89995511, 11)
(100000000, 11)
(100000000, 11)
(89995511, 11)
(89995511, 11)
(90000000, 13)
(90000000, 13)
joining finished, took 64s
on_disk question run time_sec
1 FALSE small inner on int 1 1.009
2 FALSE small inner on int 2 0.903
3 FALSE medium inner on int 1 0.970
4 FALSE medium inner on int 2 0.971
5 FALSE medium outer on int 1 0.754
6 FALSE medium outer on int 2 0.840
7 FALSE medium inner on factor 1 1.086
8 FALSE medium inner on factor 2 0.999
9 FALSE big inner on int 1 1.501
10 FALSE big inner on int 2 1.338
./_launcher/solution.R --solution=pyhdk --task=join --nrow=1e9
[1] "./pyhdk/join-pyhdk.py"
# join-pyhdk.py
pyhdk data_name: J1_1e9_NA_0_0
loading datasets J1_1e9_NA_0_0, J1_1e9_1e3_0_0, J1_1e9_1e6_0_0, J1_1e9_1e9_0_0
Using fragment size 4000000
1000000000
1000
1000000
1000000000
joining...
(899999033, 9)
[thread 984394 also had an error][thread 983253 also had an error][thread 984245 also had an error][thread 986622 also had an error]
[thread 988259 also had an error][thread 985551 also had an error]
[thread 982395 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f7723a7c2d6, pid=858290, tid=984071
#
# JRE version: OpenJDK Runtime Environment (20.0) (build 20-internal-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (20-internal-adhoc..src, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libResultSet.so+0xbb2d6] ResultSet::getRowAt[abi:cxx11](unsigned long, bool, bool, bool, std::vector<bool, std::allocator<bool> > const&) const+0x1d6
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /localdisk/dmitriim/benchmarks/db-benchmark/core.858290)
#
# An error report file with more information is saved as:
# /localdisk/dmitriim/benchmarks/db-benchmark/hs_err_pid858290.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped) |
Looks like #616 (comment) large joins are failing in all backends. |
It looks like it's simply due to the fact that the machine doesn't have enough memory for such a big join. It has just 160GB of memory and IIUC each of the joined tables is 50GB. |
a6dbaa4
to
3e26fa2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what redundant copies you actually remove? For me, it looks like some inlining + parallelization using TBB.
Varlen part of the new code looks completely dysfunctional (size to copy in write_ptrs
would be some negative int
casted to size_t
and should cause SEGFAULT on mempcy). Is it actually ever triggered in our tests? I feel like we don't actually support the whole column fetch for varlen data.
python/pyhdk/hdk.py
Outdated
@@ -2899,6 +2899,9 @@ def if_then_else(self, cond, true_val, false_val): | |||
""" | |||
return self._builder.if_then_else(cond, true_val, false_val) | |||
|
|||
def clear_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid unrelated change.
@@ -73,6 +73,7 @@ struct JoinColumnIterator { | |||
DEVICE FORCE_INLINE JoinColumnIterator& operator++() { | |||
index += step; | |||
index_inside_chunk += step; | |||
// this loop is made to find index_of_chunk by total index of element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid unrelated changes.
const auto fragments_it = all_tables_fragments.find({db_id, table_id}); | ||
CHECK(fragments_it != all_tables_fragments.end()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this vertical space added for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for visual indention, will remove.
auto merged_results = | ||
ColumnarResults::mergeResults(executor_->row_set_mem_owner_, column_frags); | ||
const auto& fragment = (*fragments)[frag_id]; | ||
const auto& rows_in_frag = fragment.getNumTuples(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reference here looks inappropriate.
total_row_count += rows_in_frag; | ||
} | ||
|
||
const auto& type_width = col_info->type->size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why reference?
valid_fragments.push_back(frag_id); | ||
} | ||
|
||
if (write_ptrs.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make an empty table check right after the row count computation?
Okay, maybe current changes is simple parallelization case. Originally we had:
Currently ColumnarResults c-tor already fixed, fetchBuffer should be zero copy, merge inline and parallelized. |
3e26fa2
to
e342fae
Compare
ba5af91
to
a501ccd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is good overall! I suggest a few clean-ups though.
@@ -239,6 +240,11 @@ const int8_t* ColumnFetcher::getAllTableColumnFragments( | |||
int db_id = col_info->db_id; | |||
int table_id = col_info->table_id; | |||
int col_id = col_info->column_id; | |||
if (col_info->type->isString() || col_info->type->isArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a CHECK instead. It is an internal error, not some input error useful for the user.
|
||
size_t total_row_count = 0; | ||
for (size_t frag_id = 0; frag_id < frag_count; ++frag_id) { | ||
if (executor_->getConfig().exec.interrupt.enable_non_kernel_time_query_interrupt && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interruption check in this loop doesn't make much sense because we don't do any actual data fetch here. I suggest moving it out of the loop or removing it completely.
raw_write_ptrs.emplace_back(write_ptrs[i].first); | ||
} | ||
|
||
std::unique_ptr<ColumnarResults> merged_results(new ColumnarResults( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector ColumnarResults::ColumnarResults
gets is not a pointer per fragment, it is a pointer per column (ColumnarResults
can store multiple columns, but not multiple fragments). So actually you are supposed to pass a vector with a single pointer here. Your version works because the first vector element has a correct pointer and other elements are simply not used. But this code is still confusing, so let's fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Do you mean that we are using only data from first fragment? If so why we are fetching them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, now I get it.
a501ccd
to
05b98f3
Compare
This commit refactors and simplifies method `getAllTableColumnFragments`. Also some parallelization added. Partially resolves: #574 Signed-off-by: Dmitrii Makarenko <[email protected]>
05b98f3
to
8c4a252
Compare
This commit refactors and simplifies method
getAllTableColumnFragments
.Also some parallelization added.
Partially resolves: #574
Signed-off-by: Dmitrii Makarenko [email protected]