Join Two Big Tables Fast

after research via playing with the queries and doing online research, I am turning to you for input. Looking forward to your replies! I am trying to write this in a general way as I am looking for general ideas on how to do this, not for exact statements. If this question is received poorly, I am happy to rework it, let me know. Here we go:

I have two tables:

Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)

There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.

I am now looking for a fast query (the prep-work e.g. index creation can take time, no problem) to create a third table that contains the 10 MIO rows of Table 1, and Column B of Table 2 if a match is found via Column A (for 99% a match will be found). To make it clearer, a simple CTAS could be:

create table3 as

select t1.a, 

    (select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)

    -- the rownum = 1 is to show Oracle that there can only be one match

from table1 t1;

This is not as fast as it can be (I hope). What are your ideas to make it faster? Creating indices? Which ones? What kind of join would you want to see in the execution plan? A hash join? I already found

create table nologging

parallel query execution and parallel table creation

But I am interested in the specifics of how the perfect execution plan for this would look given that we are allowed to alter the system (e.g. create an index). In particular I am asking for 11gR2 but 12c comments are also very welcome.

edited Nov 20 '18 at 16:55

asked Nov 20 '18 at 15:39

Peter

3739

is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16

@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27

1

If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42

@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35

1

Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07

|
show 1 more comment

I have two tables:

Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)

There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.

create table3 as

select t1.a, 

    (select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)

    -- the rownum = 1 is to show Oracle that there can only be one match

from table1 t1;

create table nologging

parallel query execution and parallel table creation

edited Nov 20 '18 at 16:55

asked Nov 20 '18 at 15:39

Peter

3739

is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16

@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27

1

If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42

@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35

1

Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07

|
show 1 more comment

I have two tables:

Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)

There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.

create table3 as

select t1.a, 

    (select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)

    -- the rownum = 1 is to show Oracle that there can only be one match

from table1 t1;

create table nologging

parallel query execution and parallel table creation

edited Nov 20 '18 at 16:55

asked Nov 20 '18 at 15:39

Peter

3739

I have two tables:

Table 1 has 10 MIO records. It has only one column:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')

Table 2 has 300 MIO records. It has two columns:
- Column A: A unique ID (a text e.g. '5uz1g09aksmsd')
- Column B: A number (e.g. 32.5432)

There are no indices etc. yet, not even a PK as both tables have just been created via CTAS. Both tables have been analysed.

create table3 as

select t1.a, 

    (select t2.b from table2 t2 where t2.a = t1.a and rownum = 1)

    -- the rownum = 1 is to show Oracle that there can only be one match

from table1 t1;

create table nologging

parallel query execution and parallel table creation

oracle

edited Nov 20 '18 at 16:55

asked Nov 20 '18 at 15:39

Peter

3739

edited Nov 20 '18 at 16:55

asked Nov 20 '18 at 15:39

Peter

3739

edited Nov 20 '18 at 16:55

asked Nov 20 '18 at 15:39

Peter

3739

asked Nov 20 '18 at 15:39

Peter

3739

asked Nov 20 '18 at 15:39

Peter

3739

is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16

@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27

1

If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42

@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35

1

Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07

|
show 1 more comment

is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16

@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27

1

If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42

@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35

1

Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07

is there a foreign key from table 1 to table 2?

– Boneist
Nov 20 '18 at 16:16

@Boneist Thanks for your question. I updated my question to make this clearer. There is no foreign key (yet, but we can create one if you think this speeds things up considerably).

– Peter
Nov 20 '18 at 16:27

If I were you, given there are a small number of columns, it might be worth storing the table 1 and 2 data in Index organised tables, rather than standard heap tables (e.g. create table_1 (column_a primary key) organization index as select ...). Note that would require primary keys to be specified, but I don't think that's an issue here.

– Boneist
Nov 20 '18 at 16:42

@Boneist Thank you for your suggestion. I will give IOTs a try tomorrow. Where do you think the benefit will come from? Table 1 as IOT or Table2 as IOT? Or is there an even greater benefit when both are IOT? Creating indexes (A on T1, A,B on T2) should yield the same result in terms of query speed, as no table access is needed after index access? How can an index on T1 help, as we need all rows anyways? What would a 'perfect non-parallel all-rows' execution plan look like in your opinion?

– Peter
Nov 20 '18 at 17:35

Simplest way ist to use a hash join, with efficient hardware use parallel query option. Nologging and parallel table creation would have less effect as the base of the elapsed time is in the join (simple test). As the tables are newly created and contains only the relevant column, indexes are of no use for hash join in your case (they would be greater that the table).

– Marmite Bomber
Nov 20 '18 at 18:07

|
show 1 more comment

1 Answer
1

active

oldest

votes

Start with the simplest possibility using this query

create table c as 

select /*+ parallel(6) */ a.a, b.b 

from a 

left outer join b

on a.a = b.a

;

It will use a hash join, adjust parallel degree based on your hardware setup.

The expected execution plan is as follows

    --------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

--------------------------------------------------------------------------------------------------------------------------

|   0 | CREATE TABLE STATEMENT           |          |   261M|  6238M| 56799   (1)| 00:03:48 |        |      |            |

|   1 |  PX COORDINATOR                  |          |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)            | :TQ10002 |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | P->S | QC (RAND)  |

|   3 |    LOAD AS SELECT                | C        |       |       |            |          |  Q1,02 | PCWP |            |

|*  4 |     HASH JOIN OUTER              |          |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | PCWP |            |

|   5 |      PX RECEIVE                  |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,02 | PCWP |            |

|   6 |       PX SEND HASH               | :TQ10000 |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | P->P | HASH       |

|   7 |        PX BLOCK ITERATOR         |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWC |            |

|   8 |         TABLE ACCESS STORAGE FULL| A        |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWP |            |

|   9 |      PX RECEIVE                  |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,02 | PCWP |            |

|  10 |       PX SEND HASH               | :TQ10001 |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | P->P | HASH       |

|  11 |        PX BLOCK ITERATOR         |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWC |            |

|  12 |         TABLE ACCESS STORAGE FULL| B        |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWP |            |

--------------------------------------------------------------------------------------------------------------------------





Predicate Information (identified by operation id):

---------------------------------------------------



   4 - access("A"."A"="B"."A"(+))

My test with synthetic data of your size is 35 seconds.

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24

Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34

I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57

The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396506%2fjoin-two-big-tables-fast%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Start with the simplest possibility using this query

create table c as 

select /*+ parallel(6) */ a.a, b.b 

from a 

left outer join b

on a.a = b.a

;

It will use a hash join, adjust parallel degree based on your hardware setup.

The expected execution plan is as follows

    --------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

--------------------------------------------------------------------------------------------------------------------------

|   0 | CREATE TABLE STATEMENT           |          |   261M|  6238M| 56799   (1)| 00:03:48 |        |      |            |

|   1 |  PX COORDINATOR                  |          |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)            | :TQ10002 |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | P->S | QC (RAND)  |

|   3 |    LOAD AS SELECT                | C        |       |       |            |          |  Q1,02 | PCWP |            |

|*  4 |     HASH JOIN OUTER              |          |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | PCWP |            |

|   5 |      PX RECEIVE                  |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,02 | PCWP |            |

|   6 |       PX SEND HASH               | :TQ10000 |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | P->P | HASH       |

|   7 |        PX BLOCK ITERATOR         |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWC |            |

|   8 |         TABLE ACCESS STORAGE FULL| A        |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWP |            |

|   9 |      PX RECEIVE                  |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,02 | PCWP |            |

|  10 |       PX SEND HASH               | :TQ10001 |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | P->P | HASH       |

|  11 |        PX BLOCK ITERATOR         |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWC |            |

|  12 |         TABLE ACCESS STORAGE FULL| B        |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWP |            |

--------------------------------------------------------------------------------------------------------------------------





Predicate Information (identified by operation id):

---------------------------------------------------



   4 - access("A"."A"="B"."A"(+))

My test with synthetic data of your size is 35 seconds.

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24

Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34

I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57

The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15

add a comment |

Start with the simplest possibility using this query

create table c as 

select /*+ parallel(6) */ a.a, b.b 

from a 

left outer join b

on a.a = b.a

;

It will use a hash join, adjust parallel degree based on your hardware setup.

The expected execution plan is as follows

    --------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

--------------------------------------------------------------------------------------------------------------------------

|   0 | CREATE TABLE STATEMENT           |          |   261M|  6238M| 56799   (1)| 00:03:48 |        |      |            |

|   1 |  PX COORDINATOR                  |          |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)            | :TQ10002 |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | P->S | QC (RAND)  |

|   3 |    LOAD AS SELECT                | C        |       |       |            |          |  Q1,02 | PCWP |            |

|*  4 |     HASH JOIN OUTER              |          |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | PCWP |            |

|   5 |      PX RECEIVE                  |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,02 | PCWP |            |

|   6 |       PX SEND HASH               | :TQ10000 |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | P->P | HASH       |

|   7 |        PX BLOCK ITERATOR         |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWC |            |

|   8 |         TABLE ACCESS STORAGE FULL| A        |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWP |            |

|   9 |      PX RECEIVE                  |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,02 | PCWP |            |

|  10 |       PX SEND HASH               | :TQ10001 |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | P->P | HASH       |

|  11 |        PX BLOCK ITERATOR         |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWC |            |

|  12 |         TABLE ACCESS STORAGE FULL| B        |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWP |            |

--------------------------------------------------------------------------------------------------------------------------





Predicate Information (identified by operation id):

---------------------------------------------------



   4 - access("A"."A"="B"."A"(+))

My test with synthetic data of your size is 35 seconds.

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24

Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34

I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57

The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15

add a comment |

Start with the simplest possibility using this query

create table c as 

select /*+ parallel(6) */ a.a, b.b 

from a 

left outer join b

on a.a = b.a

;

It will use a hash join, adjust parallel degree based on your hardware setup.

The expected execution plan is as follows

    --------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

--------------------------------------------------------------------------------------------------------------------------

|   0 | CREATE TABLE STATEMENT           |          |   261M|  6238M| 56799   (1)| 00:03:48 |        |      |            |

|   1 |  PX COORDINATOR                  |          |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)            | :TQ10002 |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | P->S | QC (RAND)  |

|   3 |    LOAD AS SELECT                | C        |       |       |            |          |  Q1,02 | PCWP |            |

|*  4 |     HASH JOIN OUTER              |          |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | PCWP |            |

|   5 |      PX RECEIVE                  |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,02 | PCWP |            |

|   6 |       PX SEND HASH               | :TQ10000 |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | P->P | HASH       |

|   7 |        PX BLOCK ITERATOR         |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWC |            |

|   8 |         TABLE ACCESS STORAGE FULL| A        |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWP |            |

|   9 |      PX RECEIVE                  |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,02 | PCWP |            |

|  10 |       PX SEND HASH               | :TQ10001 |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | P->P | HASH       |

|  11 |        PX BLOCK ITERATOR         |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWC |            |

|  12 |         TABLE ACCESS STORAGE FULL| B        |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWP |            |

--------------------------------------------------------------------------------------------------------------------------





Predicate Information (identified by operation id):

---------------------------------------------------



   4 - access("A"."A"="B"."A"(+))

My test with synthetic data of your size is 35 seconds.

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

Start with the simplest possibility using this query

create table c as 

select /*+ parallel(6) */ a.a, b.b 

from a 

left outer join b

on a.a = b.a

;

It will use a hash join, adjust parallel degree based on your hardware setup.

The expected execution plan is as follows

    --------------------------------------------------------------------------------------------------------------------------

| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |

--------------------------------------------------------------------------------------------------------------------------

|   0 | CREATE TABLE STATEMENT           |          |   261M|  6238M| 56799   (1)| 00:03:48 |        |      |            |

|   1 |  PX COORDINATOR                  |          |       |       |            |          |        |      |            |

|   2 |   PX SEND QC (RANDOM)            | :TQ10002 |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | P->S | QC (RAND)  |

|   3 |    LOAD AS SELECT                | C        |       |       |            |          |  Q1,02 | PCWP |            |

|*  4 |     HASH JOIN OUTER              |          |   261M|  6238M| 31984   (1)| 00:02:08 |  Q1,02 | PCWP |            |

|   5 |      PX RECEIVE                  |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,02 | PCWP |            |

|   6 |       PX SEND HASH               | :TQ10000 |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | P->P | HASH       |

|   7 |        PX BLOCK ITERATOR         |          |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWC |            |

|   8 |         TABLE ACCESS STORAGE FULL| A        |  7849K|    44M|   726   (1)| 00:00:03 |  Q1,00 | PCWP |            |

|   9 |      PX RECEIVE                  |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,02 | PCWP |            |

|  10 |       PX SEND HASH               | :TQ10001 |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | P->P | HASH       |

|  11 |        PX BLOCK ITERATOR         |          |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWC |            |

|  12 |         TABLE ACCESS STORAGE FULL| B        |   261M|  4741M| 31149   (1)| 00:02:05 |  Q1,01 | PCWP |            |

--------------------------------------------------------------------------------------------------------------------------





Predicate Information (identified by operation id):

---------------------------------------------------



   4 - access("A"."A"="B"."A"(+))

My test with synthetic data of your size is 35 seconds.

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

answered Nov 20 '18 at 18:17

Marmite Bomber

7,70431033

TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24

Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34

I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57

The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15

add a comment |

TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24

Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34

I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57

The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15

TYVM! I understand that you say that a hash join (not nested loop and not sort merge) is the way to go. I understand you also see no benefit in using indexes for the query speed here (i guess you mean it would do a FFS, which would be the same as the FTS on the table). May I ask: Could it be that your query loses about 50% speed (guessing here) by not telling Oracle that it can stop for a t1 row after it found the first match in t2 for that t1 row? I tried to tell Oracle that with my rownum = 1, but I guess a unique constraint or similar could do the trick? Do you see potential in this?

– Peter
Nov 20 '18 at 19:24

Definitively no in your case. Has join probes the smaller table and scans the other one. No need to stop after first match. And if you would have dups in the Btable the join would return duplicated records. What you address is query using IN or NOT EXISTS which leads to semior anti join with exact optimization on stopping after first row you mentioned.

– Marmite Bomber
Nov 20 '18 at 19:34

I think I understand. It's only doing one loop through table B anyway, so 'stopping' in scanning B does not help/work. You helped me a lot, thank you. The last bit I am missing now I think is the question why an index on table A (the one that is hashed and probed) does not speed things up. Two questions I guess: Does the hashing already include putting the records in order? Can't the index save the time of hashing (and/or sorting if it does that), as the index can be probed the same way as the hash-table? TY!

– Peter
Nov 20 '18 at 19:57

The smaler table is hashed (not probed as I noted above and I can't edit it any more), which means not ordering but a direct key access. Basically both tables must be processed completely, i.e. all rows. Here is index a problem, not a help. FULL TABLE SCAN rules:)

– Marmite Bomber
Nov 20 '18 at 20:15

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu