How to join datasets with values between other values?

I have a use case where I need to join 2 data-frames.

ID view

ID  BookTime

1   2

1   5

2   8

2   3

3   4

FareRule view

Start End Fare

1   3   10

3   6   20

6   10  25

Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.

ID  FareDue

1   10

1   20

2   25

2   20

3   20

I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?

SELECT 

    ID, 

    Fare AS FareDue 

FROM 

    ID 

CROSS JOIN 

    FareRule 

WHERE 

   BookTime >=Start 

   AND 

   BookTime< End

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

asked Nov 20 '18 at 19:49

Amaan Khan

467

Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55

Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34

Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13

add a comment |

I have a use case where I need to join 2 data-frames.

ID view

ID  BookTime

1   2

1   5

2   8

2   3

3   4

FareRule view

Start End Fare

1   3   10

3   6   20

6   10  25

Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.

ID  FareDue

1   10

1   20

2   25

2   20

3   20

I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?

SELECT 

    ID, 

    Fare AS FareDue 

FROM 

    ID 

CROSS JOIN 

    FareRule 

WHERE 

   BookTime >=Start 

   AND 

   BookTime< End

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

asked Nov 20 '18 at 19:49

Amaan Khan

467

Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55

Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34

Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13

add a comment |

I have a use case where I need to join 2 data-frames.

ID view

ID  BookTime

1   2

1   5

2   8

2   3

3   4

FareRule view

Start End Fare

1   3   10

3   6   20

6   10  25

Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.

ID  FareDue

1   10

1   20

2   25

2   20

3   20

I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?

SELECT 

    ID, 

    Fare AS FareDue 

FROM 

    ID 

CROSS JOIN 

    FareRule 

WHERE 

   BookTime >=Start 

   AND 

   BookTime< End

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

asked Nov 20 '18 at 19:49

Amaan Khan

467

I have a use case where I need to join 2 data-frames.

ID view

ID  BookTime

1   2

1   5

2   8

2   3

3   4

FareRule view

Start End Fare

1   3   10

3   6   20

6   10  25

Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.

ID  FareDue

1   10

1   20

2   25

2   20

3   20

I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?

SELECT 

    ID, 

    Fare AS FareDue 

FROM 

    ID 

CROSS JOIN 

    FareRule 

WHERE 

   BookTime >=Start 

   AND 

   BookTime< End

apache-spark apache-spark-sql

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

asked Nov 20 '18 at 19:49

Amaan Khan

467

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

asked Nov 20 '18 at 19:49

Amaan Khan

467

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

edited Nov 21 '18 at 11:01

Jacek Laskowski

44.3k18131265

asked Nov 20 '18 at 19:49

Amaan Khan

467

asked Nov 20 '18 at 19:49

Amaan Khan

467

asked Nov 20 '18 at 19:49

Amaan Khan

467

Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55

Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34

Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13

add a comment |

Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55

Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34

Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13

Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55

Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34

Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13

add a comment |

1 Answer
1

active

oldest

votes

-1

Given the following datasets:

val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")

scala> id.show

+---+--------+

| ID|BookTime|

+---+--------+

|  1|       2|

|  1|       5|

|  2|       8|

|  2|       3|

|  3|       4|

+---+--------+



val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")

scala> fareRule.show

+-----+---+----+

|start|end|fare|

+-----+---+----+

|    1|  3|  10|

|    3|  6|  20|

|    6| 10|  25|

+-----+---+----+

You simply join them together using between expression.

val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)

scala> q.show

+---+----+

| id|fare|

+---+----+

|  1|  10|

|  1|  20|

|  2|  25|

|  2|  10|

|  2|  20|

|  3|  20|

+---+----+

You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400521%2fhow-to-join-datasets-with-values-between-other-values%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

-1

Given the following datasets:

val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")

scala> id.show

+---+--------+

| ID|BookTime|

+---+--------+

|  1|       2|

|  1|       5|

|  2|       8|

|  2|       3|

|  3|       4|

+---+--------+



val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")

scala> fareRule.show

+-----+---+----+

|start|end|fare|

+-----+---+----+

|    1|  3|  10|

|    3|  6|  20|

|    6| 10|  25|

+-----+---+----+

You simply join them together using between expression.

val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)

scala> q.show

+---+----+

| id|fare|

+---+----+

|  1|  10|

|  1|  20|

|  2|  25|

|  2|  10|

|  2|  20|

|  3|  20|

+---+----+

You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

add a comment |

-1

Given the following datasets:

val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")

scala> id.show

+---+--------+

| ID|BookTime|

+---+--------+

|  1|       2|

|  1|       5|

|  2|       8|

|  2|       3|

|  3|       4|

+---+--------+



val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")

scala> fareRule.show

+-----+---+----+

|start|end|fare|

+-----+---+----+

|    1|  3|  10|

|    3|  6|  20|

|    6| 10|  25|

+-----+---+----+

You simply join them together using between expression.

val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)

scala> q.show

+---+----+

| id|fare|

+---+----+

|  1|  10|

|  1|  20|

|  2|  25|

|  2|  10|

|  2|  20|

|  3|  20|

+---+----+

You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

add a comment |

-1

Given the following datasets:

val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")

scala> id.show

+---+--------+

| ID|BookTime|

+---+--------+

|  1|       2|

|  1|       5|

|  2|       8|

|  2|       3|

|  3|       4|

+---+--------+



val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")

scala> fareRule.show

+-----+---+----+

|start|end|fare|

+-----+---+----+

|    1|  3|  10|

|    3|  6|  20|

|    6| 10|  25|

+-----+---+----+

You simply join them together using between expression.

val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)

scala> q.show

+---+----+

| id|fare|

+---+----+

|  1|  10|

|  1|  20|

|  2|  25|

|  2|  10|

|  2|  20|

|  3|  20|

+---+----+

You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

Given the following datasets:

val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")

scala> id.show

+---+--------+

| ID|BookTime|

+---+--------+

|  1|       2|

|  1|       5|

|  2|       8|

|  2|       3|

|  3|       4|

+---+--------+



val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")

scala> fareRule.show

+-----+---+----+

|start|end|fare|

+-----+---+----+

|    1|  3|  10|

|    3|  6|  20|

|    6| 10|  25|

+-----+---+----+

You simply join them together using between expression.

val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)

scala> q.show

+---+----+

| id|fare|

+---+----+

|  1|  10|

|  1|  20|

|  2|  25|

|  2|  10|

|  2|  20|

|  3|  20|

+---+----+

You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

answered Nov 21 '18 at 11:00

Jacek Laskowski

44.3k18131265

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu