How to join datasets with values between other values?
I have a use case where I need to join 2 data-frames.
ID view
ID BookTime
1 2
1 5
2 8
2 3
3 4
FareRule view
Start End Fare
1 3 10
3 6 20
6 10 25
Output is a result of join by checking the BookTime
from the ID
table. The Fare
is computed based on the window that is between Start
and End
from FareRule
.
ID FareDue
1 10
1 20
2 25
2 20
3 20
I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?
SELECT
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End
apache-spark apache-spark-sql
add a comment |
I have a use case where I need to join 2 data-frames.
ID view
ID BookTime
1 2
1 5
2 8
2 3
3 4
FareRule view
Start End Fare
1 3 10
3 6 20
6 10 25
Output is a result of join by checking the BookTime
from the ID
table. The Fare
is computed based on the window that is between Start
and End
from FareRule
.
ID FareDue
1 10
1 20
2 25
2 20
3 20
I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?
SELECT
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End
apache-spark apache-spark-sql
Possible duplicate of How to improve broadcast Join speed in Spark
– user10465355
Nov 20 '18 at 19:55
Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.
– Vitaliy
Nov 21 '18 at 6:34
Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?
– philipxy
Nov 28 '18 at 21:13
add a comment |
I have a use case where I need to join 2 data-frames.
ID view
ID BookTime
1 2
1 5
2 8
2 3
3 4
FareRule view
Start End Fare
1 3 10
3 6 20
6 10 25
Output is a result of join by checking the BookTime
from the ID
table. The Fare
is computed based on the window that is between Start
and End
from FareRule
.
ID FareDue
1 10
1 20
2 25
2 20
3 20
I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?
SELECT
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End
apache-spark apache-spark-sql
I have a use case where I need to join 2 data-frames.
ID view
ID BookTime
1 2
1 5
2 8
2 3
3 4
FareRule view
Start End Fare
1 3 10
3 6 20
6 10 25
Output is a result of join by checking the BookTime
from the ID
table. The Fare
is computed based on the window that is between Start
and End
from FareRule
.
ID FareDue
1 10
1 20
2 25
2 20
3 20
I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?
SELECT
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End
apache-spark apache-spark-sql
apache-spark apache-spark-sql
edited Nov 21 '18 at 11:01
Jacek Laskowski
44.3k18131265
44.3k18131265
asked Nov 20 '18 at 19:49
Amaan KhanAmaan Khan
467
467
Possible duplicate of How to improve broadcast Join speed in Spark
– user10465355
Nov 20 '18 at 19:55
Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.
– Vitaliy
Nov 21 '18 at 6:34
Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?
– philipxy
Nov 28 '18 at 21:13
add a comment |
Possible duplicate of How to improve broadcast Join speed in Spark
– user10465355
Nov 20 '18 at 19:55
Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.
– Vitaliy
Nov 21 '18 at 6:34
Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?
– philipxy
Nov 28 '18 at 21:13
Possible duplicate of How to improve broadcast Join speed in Spark
– user10465355
Nov 20 '18 at 19:55
Possible duplicate of How to improve broadcast Join speed in Spark
– user10465355
Nov 20 '18 at 19:55
Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.
– Vitaliy
Nov 21 '18 at 6:34
Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.
– Vitaliy
Nov 21 '18 at 6:34
Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?
– philipxy
Nov 28 '18 at 21:13
Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?
– philipxy
Nov 28 '18 at 21:13
add a comment |
1 Answer
1
active
oldest
votes
Given the following datasets:
val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
scala> id.show
+---+--------+
| ID|BookTime|
+---+--------+
| 1| 2|
| 1| 5|
| 2| 8|
| 2| 3|
| 3| 4|
+---+--------+
val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
scala> fareRule.show
+-----+---+----+
|start|end|fare|
+-----+---+----+
| 1| 3| 10|
| 3| 6| 20|
| 6| 10| 25|
+-----+---+----+
You simply join
them together using between
expression.
val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
scala> q.show
+---+----+
| id|fare|
+---+----+
| 1| 10|
| 1| 20|
| 2| 25|
| 2| 10|
| 2| 20|
| 3| 20|
+---+----+
You may want to adjust between
so the boundaries are exclusive on one side. between
by default uses the lower bound and upper bound, inclusive.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400521%2fhow-to-join-datasets-with-values-between-other-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Given the following datasets:
val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
scala> id.show
+---+--------+
| ID|BookTime|
+---+--------+
| 1| 2|
| 1| 5|
| 2| 8|
| 2| 3|
| 3| 4|
+---+--------+
val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
scala> fareRule.show
+-----+---+----+
|start|end|fare|
+-----+---+----+
| 1| 3| 10|
| 3| 6| 20|
| 6| 10| 25|
+-----+---+----+
You simply join
them together using between
expression.
val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
scala> q.show
+---+----+
| id|fare|
+---+----+
| 1| 10|
| 1| 20|
| 2| 25|
| 2| 10|
| 2| 20|
| 3| 20|
+---+----+
You may want to adjust between
so the boundaries are exclusive on one side. between
by default uses the lower bound and upper bound, inclusive.
add a comment |
Given the following datasets:
val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
scala> id.show
+---+--------+
| ID|BookTime|
+---+--------+
| 1| 2|
| 1| 5|
| 2| 8|
| 2| 3|
| 3| 4|
+---+--------+
val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
scala> fareRule.show
+-----+---+----+
|start|end|fare|
+-----+---+----+
| 1| 3| 10|
| 3| 6| 20|
| 6| 10| 25|
+-----+---+----+
You simply join
them together using between
expression.
val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
scala> q.show
+---+----+
| id|fare|
+---+----+
| 1| 10|
| 1| 20|
| 2| 25|
| 2| 10|
| 2| 20|
| 3| 20|
+---+----+
You may want to adjust between
so the boundaries are exclusive on one side. between
by default uses the lower bound and upper bound, inclusive.
add a comment |
Given the following datasets:
val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
scala> id.show
+---+--------+
| ID|BookTime|
+---+--------+
| 1| 2|
| 1| 5|
| 2| 8|
| 2| 3|
| 3| 4|
+---+--------+
val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
scala> fareRule.show
+-----+---+----+
|start|end|fare|
+-----+---+----+
| 1| 3| 10|
| 3| 6| 20|
| 6| 10| 25|
+-----+---+----+
You simply join
them together using between
expression.
val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
scala> q.show
+---+----+
| id|fare|
+---+----+
| 1| 10|
| 1| 20|
| 2| 25|
| 2| 10|
| 2| 20|
| 3| 20|
+---+----+
You may want to adjust between
so the boundaries are exclusive on one side. between
by default uses the lower bound and upper bound, inclusive.
Given the following datasets:
val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
scala> id.show
+---+--------+
| ID|BookTime|
+---+--------+
| 1| 2|
| 1| 5|
| 2| 8|
| 2| 3|
| 3| 4|
+---+--------+
val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
scala> fareRule.show
+-----+---+----+
|start|end|fare|
+-----+---+----+
| 1| 3| 10|
| 3| 6| 20|
| 6| 10| 25|
+-----+---+----+
You simply join
them together using between
expression.
val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
scala> q.show
+---+----+
| id|fare|
+---+----+
| 1| 10|
| 1| 20|
| 2| 25|
| 2| 10|
| 2| 20|
| 3| 20|
+---+----+
You may want to adjust between
so the boundaries are exclusive on one side. between
by default uses the lower bound and upper bound, inclusive.
answered Nov 21 '18 at 11:00
Jacek LaskowskiJacek Laskowski
44.3k18131265
44.3k18131265
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400521%2fhow-to-join-datasets-with-values-between-other-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How to improve broadcast Join speed in Spark
– user10465355
Nov 20 '18 at 19:55
Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.
– Vitaliy
Nov 21 '18 at 6:34
Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?
– philipxy
Nov 28 '18 at 21:13