Join Two Big Tables Fast












1















after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:



I have two tables:




  • Table 1 has 10 MIO records. It has only one column:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')



  • Table 2 has 300 MIO records. It has two columns:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

    • Column B: A number (e.g. 32.5432)




There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.



I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:



create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;


This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found




  • create table nologging

  • parallel query execution and parallel table creation


But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.










share|improve this question

























  • is there a foreign key from table 1 to table 2?

    – Boneist
    Nov 20 '18 at 16:16











  • @Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

    – Peter
    Nov 20 '18 at 16:27






  • 1





    If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

    – Boneist
    Nov 20 '18 at 16:42











  • @Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

    – Peter
    Nov 20 '18 at 17:35






  • 1





    Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

    – Marmite Bomber
    Nov 20 '18 at 18:07


















1















after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:



I have two tables:




  • Table 1 has 10 MIO records. It has only one column:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')



  • Table 2 has 300 MIO records. It has two columns:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

    • Column B: A number (e.g. 32.5432)




There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.



I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:



create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;


This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found




  • create table nologging

  • parallel query execution and parallel table creation


But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.










share|improve this question

























  • is there a foreign key from table 1 to table 2?

    – Boneist
    Nov 20 '18 at 16:16











  • @Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

    – Peter
    Nov 20 '18 at 16:27






  • 1





    If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

    – Boneist
    Nov 20 '18 at 16:42











  • @Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

    – Peter
    Nov 20 '18 at 17:35






  • 1





    Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

    – Marmite Bomber
    Nov 20 '18 at 18:07
















1












1








1








after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:



I have two tables:




  • Table 1 has 10 MIO records. It has only one column:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')



  • Table 2 has 300 MIO records. It has two columns:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

    • Column B: A number (e.g. 32.5432)




There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.



I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:



create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;


This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found




  • create table nologging

  • parallel query execution and parallel table creation


But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.










share|improve this question
















after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:



I have two tables:




  • Table 1 has 10 MIO records. It has only one column:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')



  • Table 2 has 300 MIO records. It has two columns:


    • Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

    • Column B: A number (e.g. 32.5432)




There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.



I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:



create table3 as
select t1.a,
(select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)
-- the rownum = 1 is to show Oracle that there can only be one match
from table1 t1;


This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found




  • create table nologging

  • parallel query execution and parallel table creation


But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.







oracle






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 16:55







Peter

















asked Nov 20 '18 at 15:39









PeterPeter

3739




3739













  • is there a foreign key from table 1 to table 2?

    – Boneist
    Nov 20 '18 at 16:16











  • @Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

    – Peter
    Nov 20 '18 at 16:27






  • 1





    If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

    – Boneist
    Nov 20 '18 at 16:42











  • @Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

    – Peter
    Nov 20 '18 at 17:35






  • 1





    Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

    – Marmite Bomber
    Nov 20 '18 at 18:07





















  • is there a foreign key from table 1 to table 2?

    – Boneist
    Nov 20 '18 at 16:16











  • @Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

    – Peter
    Nov 20 '18 at 16:27






  • 1





    If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

    – Boneist
    Nov 20 '18 at 16:42











  • @Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

    – Peter
    Nov 20 '18 at 17:35






  • 1





    Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

    – Marmite Bomber
    Nov 20 '18 at 18:07



















is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16





is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16













@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27





@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27




1




1





If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42





If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42













@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35





@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35




1




1





Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07







Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07














1 Answer
1






active

oldest

votes


















2














Start with the simplest possibility using this query



create table c as 
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;


It will use a hash join, adjust parallel degree based on your hardware setup.



The expected execution plan is as follows



    --------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."A"="B"."A"(+))


My test with synthetic data of your size is 35 seconds.






share|improve this answer
























  • TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

    – Peter
    Nov 20 '18 at 19:24











  • Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

    – Marmite Bomber
    Nov 20 '18 at 19:34













  • I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

    – Peter
    Nov 20 '18 at 19:57











  • The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

    – Marmite Bomber
    Nov 20 '18 at 20:15











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396506%2fjoin-two-big-tables-fast%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Start with the simplest possibility using this query



create table c as 
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;


It will use a hash join, adjust parallel degree based on your hardware setup.



The expected execution plan is as follows



    --------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."A"="B"."A"(+))


My test with synthetic data of your size is 35 seconds.






share|improve this answer
























  • TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

    – Peter
    Nov 20 '18 at 19:24











  • Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

    – Marmite Bomber
    Nov 20 '18 at 19:34













  • I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

    – Peter
    Nov 20 '18 at 19:57











  • The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

    – Marmite Bomber
    Nov 20 '18 at 20:15
















2














Start with the simplest possibility using this query



create table c as 
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;


It will use a hash join, adjust parallel degree based on your hardware setup.



The expected execution plan is as follows



    --------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."A"="B"."A"(+))


My test with synthetic data of your size is 35 seconds.






share|improve this answer
























  • TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

    – Peter
    Nov 20 '18 at 19:24











  • Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

    – Marmite Bomber
    Nov 20 '18 at 19:34













  • I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

    – Peter
    Nov 20 '18 at 19:57











  • The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

    – Marmite Bomber
    Nov 20 '18 at 20:15














2












2








2







Start with the simplest possibility using this query



create table c as 
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;


It will use a hash join, adjust parallel degree based on your hardware setup.



The expected execution plan is as follows



    --------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."A"="B"."A"(+))


My test with synthetic data of your size is 35 seconds.






share|improve this answer













Start with the simplest possibility using this query



create table c as 
select /*+ parallel(6) */ a.a, b.b
from a
left outer join b
on a.a = b.a
;


It will use a hash join, adjust parallel degree based on your hardware setup.



The expected execution plan is as follows



    --------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | CREATE TABLE STATEMENT | | 261M| 6238M| 56799 (1)| 00:03:48 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | C | | | | | Q1,02 | PCWP | |
|* 4 | HASH JOIN OUTER | | 261M| 6238M| 31984 (1)| 00:02:08 | Q1,02 | PCWP | |
| 5 | PX RECEIVE | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,02 | PCWP | |
| 6 | PX SEND HASH | :TQ10000 | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | P->P | HASH |
| 7 | PX BLOCK ITERATOR | | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS STORAGE FULL| A | 7849K| 44M| 726 (1)| 00:00:03 | Q1,00 | PCWP | |
| 9 | PX RECEIVE | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10001 | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | P->P | HASH |
| 11 | PX BLOCK ITERATOR | | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS STORAGE FULL| B | 261M| 4741M| 31149 (1)| 00:02:05 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("A"."A"="B"."A"(+))


My test with synthetic data of your size is 35 seconds.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 18:17









Marmite BomberMarmite Bomber

7,70431033




7,70431033













  • TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

    – Peter
    Nov 20 '18 at 19:24











  • Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

    – Marmite Bomber
    Nov 20 '18 at 19:34













  • I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

    – Peter
    Nov 20 '18 at 19:57











  • The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

    – Marmite Bomber
    Nov 20 '18 at 20:15



















  • TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

    – Peter
    Nov 20 '18 at 19:24











  • Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

    – Marmite Bomber
    Nov 20 '18 at 19:34













  • I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

    – Peter
    Nov 20 '18 at 19:57











  • The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

    – Marmite Bomber
    Nov 20 '18 at 20:15

















TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24





TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24













Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34







Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34















I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57





I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57













The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15





The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396506%2fjoin-two-big-tables-fast%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

Npm cannot find a required file even through it is in the searched directory