Tez VS Spark - huge performance diffs
up vote
0
down vote
favorite
I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT
is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx
apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez
|
show 2 more comments
up vote
0
down vote
favorite
I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT
is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx
apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez
1
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query ashive
on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday
@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday
hive.vectorized.execution.enabled
?
– Samson Scharfrichter
yesterday
|
show 2 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT
is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx
apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez
I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT
is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx
apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez
apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez
edited yesterday
asked 2 days ago
hummingBird
1,52031434
1,52031434
1
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query ashive
on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday
@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday
hive.vectorized.execution.enabled
?
– Samson Scharfrichter
yesterday
|
show 2 more comments
1
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query ashive
on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday
@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday
hive.vectorized.execution.enabled
?
– Samson Scharfrichter
yesterday
1
1
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as
hive
on a different queue than your personal Spark session uses? Tez container size? Etc...– Samson Scharfrichter
yesterday
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as
hive
on a different queue than your personal Spark session uses? Tez container size? Etc...– Samson Scharfrichter
yesterday
@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday
@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday
hive.vectorized.execution.enabled
?– Samson Scharfrichter
yesterday
hive.vectorized.execution.enabled
?– Samson Scharfrichter
yesterday
|
show 2 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53373301%2ftez-vs-spark-huge-performance-diffs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as
hive
on a different queue than your personal Spark session uses? Tez container size? Etc...– Samson Scharfrichter
yesterday
@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday
hive.vectorized.execution.enabled
?– Samson Scharfrichter
yesterday