Join Two Big Tables Fast
after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:
I have two tables:
- Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)
There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.
I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:
create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;
This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found
- create table nologging
- parallel query execution and parallel table creation
But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.
oracle
|
show 1 more comment
after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:
I have two tables:
- Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)
There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.
I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:
create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;
This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found
- create table nologging
- parallel query execution and parallel table creation
But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.
oracle
is there a foreign key from table 1 to table 2?
– Boneist
Nov 20 '18 at 16:16
@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).
– Peter
Nov 20 '18 at 16:27
1
If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g.create table_1 (column_a primary key) organization index as select ...
). Note that would require primary keys to be specified, but I don't think that's an issue here.
– Boneist
Nov 20 '18 at 16:42
@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?
– Peter
Nov 20 '18 at 17:35
1
Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).
– Marmite Bomber
Nov 20 '18 at 18:07
|
show 1 more comment
after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:
I have two tables:
- Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)
There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.
I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:
create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;
This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found
- create table nologging
- parallel query execution and parallel table creation
But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.
oracle
after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:
I have two tables:
- Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)
There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.
I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:
create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;
This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found
- create table nologging
- parallel query execution and parallel table creation
But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.
oracle
oracle
edited Nov 20 '18 at 16:55
Peter
asked Nov 20 '18 at 15:39


PeterPeter
3739
3739
is there a foreign key from table 1 to table 2?
– Boneist
Nov 20 '18 at 16:16
@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).
– Peter
Nov 20 '18 at 16:27
1
If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g.create table_1 (column_a primary key) organization index as select ...
). Note that would require primary keys to be specified, but I don't think that's an issue here.
– Boneist
Nov 20 '18 at 16:42
@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?
– Peter
Nov 20 '18 at 17:35
1
Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).
– Marmite Bomber
Nov 20 '18 at 18:07
|
show 1 more comment
is there a foreign key from table 1 to table 2?
– Boneist
Nov 20 '18 at 16:16
@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).
– Peter
Nov 20 '18 at 16:27
1
If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g.create table_1 (column_a primary key) organization index as select ...
). Note that would require primary keys to be specified, but I don't think that's an issue here.
– Boneist
Nov 20 '18 at 16:42
@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?
– Peter
Nov 20 '18 at 17:35
1
Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).
– Marmite Bomber
Nov 20 '18 at 18:07
is there a foreign key from table 1 to table 2?
– Boneist
Nov 20 '18 at 16:16
is there a foreign key from table 1 to table 2?
– Boneist
Nov 20 '18 at 16:16
@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).
– Peter
Nov 20 '18 at 16:27
@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).
– Peter
Nov 20 '18 at 16:27
1
1
If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g.
create table_1 (column_a primary key) organization index as select ...
). Note that would require primary keys to be specified, but I don't think that's an issue here.– Boneist
Nov 20 '18 at 16:42
If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g.
create table_1 (column_a primary key) organization index as select ...
). Note that would require primary keys to be specified, but I don't think that's an issue here.– Boneist
Nov 20 '18 at 16:42
@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?
– Peter
Nov 20 '18 at 17:35
@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?
– Peter
Nov 20 '18 at 17:35
1
1
Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).
– Marmite Bomber
Nov 20 '18 at 18:07
Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).
– Marmite Bomber
Nov 20 '18 at 18:07
|
show 1 more comment
1 Answer
1
active
oldest
votes
Start with the simplest possibility using this query
create table c as
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;
It will use a hash join, adjust parallel degree based on your hardware setup.
The expected execution plan is as follows
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."A"="B"."A"(+))
My test with synthetic data of your size is 35 seconds.
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in theB
table the join would return duplicated records. What you address is query usingIN
orNOT EXISTS
which leads tosemi
oranti
join with exact optimization on stopping after first row you mentioned.
– Marmite Bomber
Nov 20 '18 at 19:34
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396506%2fjoin-two-big-tables-fast%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Start with the simplest possibility using this query
create table c as
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;
It will use a hash join, adjust parallel degree based on your hardware setup.
The expected execution plan is as follows
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."A"="B"."A"(+))
My test with synthetic data of your size is 35 seconds.
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in theB
table the join would return duplicated records. What you address is query usingIN
orNOT EXISTS
which leads tosemi
oranti
join with exact optimization on stopping after first row you mentioned.
– Marmite Bomber
Nov 20 '18 at 19:34
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
add a comment |
Start with the simplest possibility using this query
create table c as
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;
It will use a hash join, adjust parallel degree based on your hardware setup.
The expected execution plan is as follows
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."A"="B"."A"(+))
My test with synthetic data of your size is 35 seconds.
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in theB
table the join would return duplicated records. What you address is query usingIN
orNOT EXISTS
which leads tosemi
oranti
join with exact optimization on stopping after first row you mentioned.
– Marmite Bomber
Nov 20 '18 at 19:34
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
add a comment |
Start with the simplest possibility using this query
create table c as
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;
It will use a hash join, adjust parallel degree based on your hardware setup.
The expected execution plan is as follows
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."A"="B"."A"(+))
My test with synthetic data of your size is 35 seconds.
Start with the simplest possibility using this query
create table c as
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;
It will use a hash join, adjust parallel degree based on your hardware setup.
The expected execution plan is as follows
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."A"="B"."A"(+))
My test with synthetic data of your size is 35 seconds.
answered Nov 20 '18 at 18:17


Marmite BomberMarmite Bomber
7,70431033
7,70431033
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in theB
table the join would return duplicated records. What you address is query usingIN
orNOT EXISTS
which leads tosemi
oranti
join with exact optimization on stopping after first row you mentioned.
– Marmite Bomber
Nov 20 '18 at 19:34
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
add a comment |
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in theB
table the join would return duplicated records. What you address is query usingIN
orNOT EXISTS
which leads tosemi
oranti
join with exact optimization on stopping after first row you mentioned.
– Marmite Bomber
Nov 20 '18 at 19:34
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?
– Peter
Nov 20 '18 at 19:24
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the
B
table the join would return duplicated records. What you address is query using IN
or NOT EXISTS
which leads to semi
or anti
join with exact optimization on stopping after first row you mentioned.– Marmite Bomber
Nov 20 '18 at 19:34
Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the
B
table the join would return duplicated records. What you address is query using IN
or NOT EXISTS
which leads to semi
or anti
join with exact optimization on stopping after first row you mentioned.– Marmite Bomber
Nov 20 '18 at 19:34
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!
– Peter
Nov 20 '18 at 19:57
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)
– Marmite Bomber
Nov 20 '18 at 20:15
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396506%2fjoin-two-big-tables-fast%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
is there a foreign key from table 1 to table 2?
– Boneist
Nov 20 '18 at 16:16
@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).
– Peter
Nov 20 '18 at 16:27
1
If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g.
create table_1 (column_a primary key) organization index as select ...
). Note that would require primary keys to be specified, but I don't think that's an issue here.– Boneist
Nov 20 '18 at 16:42
@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?
– Peter
Nov 20 '18 at 17:35
1
Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).
– Marmite Bomber
Nov 20 '18 at 18:07