Tez VS Spark - huge performance diffs











up vote
0
down vote

favorite












I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows



SELECT DT, Sum(1) from mydata GROUP BY DT


DT is partition column, a string that marks date.



In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.



When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.



UPDATE: It finished 1 row selected (672.152 seconds)



More information about the environment:




  • Only one queue used, with capacity scheduler


  • User under which the job is running is my own user. We have Kerberos used with LDAP


  • AM Resource: 4096 MB


  • using tez.runtime.compress with Snappy


  • data is in Parquet format, no compression applied


  • tez.task.resource.memory 6134 MB


  • tez.counters.max 10000


  • tez.counters.max.groups 3000


  • tez.runtime.io.sort.mb 8110 MB


  • tez.runtime.pipelined.sorter.sort.threads 2


  • tez.runtime.shuffle.fetch.buffer.percent 0.6


  • tez.runtime.shuffle.memory.limit.percent 0.25


  • tez.runtime.unordered.output.buffer.size-mb 460 MB



Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.



What shd be the first thing to check?



Thx










share|improve this question




















  • 1




    Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
    – leftjoin
    2 days ago










  • thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
    – hummingBird
    2 days ago










  • What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
    – Samson Scharfrichter
    yesterday










  • @SamsonScharfrichter added more info to question... What else could be important?
    – hummingBird
    yesterday










  • hive.vectorized.execution.enabled?
    – Samson Scharfrichter
    yesterday















up vote
0
down vote

favorite












I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows



SELECT DT, Sum(1) from mydata GROUP BY DT


DT is partition column, a string that marks date.



In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.



When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.



UPDATE: It finished 1 row selected (672.152 seconds)



More information about the environment:




  • Only one queue used, with capacity scheduler


  • User under which the job is running is my own user. We have Kerberos used with LDAP


  • AM Resource: 4096 MB


  • using tez.runtime.compress with Snappy


  • data is in Parquet format, no compression applied


  • tez.task.resource.memory 6134 MB


  • tez.counters.max 10000


  • tez.counters.max.groups 3000


  • tez.runtime.io.sort.mb 8110 MB


  • tez.runtime.pipelined.sorter.sort.threads 2


  • tez.runtime.shuffle.fetch.buffer.percent 0.6


  • tez.runtime.shuffle.memory.limit.percent 0.25


  • tez.runtime.unordered.output.buffer.size-mb 460 MB



Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.



What shd be the first thing to check?



Thx










share|improve this question




















  • 1




    Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
    – leftjoin
    2 days ago










  • thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
    – hummingBird
    2 days ago










  • What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
    – Samson Scharfrichter
    yesterday










  • @SamsonScharfrichter added more info to question... What else could be important?
    – hummingBird
    yesterday










  • hive.vectorized.execution.enabled?
    – Samson Scharfrichter
    yesterday













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows



SELECT DT, Sum(1) from mydata GROUP BY DT


DT is partition column, a string that marks date.



In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.



When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.



UPDATE: It finished 1 row selected (672.152 seconds)



More information about the environment:




  • Only one queue used, with capacity scheduler


  • User under which the job is running is my own user. We have Kerberos used with LDAP


  • AM Resource: 4096 MB


  • using tez.runtime.compress with Snappy


  • data is in Parquet format, no compression applied


  • tez.task.resource.memory 6134 MB


  • tez.counters.max 10000


  • tez.counters.max.groups 3000


  • tez.runtime.io.sort.mb 8110 MB


  • tez.runtime.pipelined.sorter.sort.threads 2


  • tez.runtime.shuffle.fetch.buffer.percent 0.6


  • tez.runtime.shuffle.memory.limit.percent 0.25


  • tez.runtime.unordered.output.buffer.size-mb 460 MB



Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.



What shd be the first thing to check?



Thx










share|improve this question















I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows



SELECT DT, Sum(1) from mydata GROUP BY DT


DT is partition column, a string that marks date.



In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.



When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.



UPDATE: It finished 1 row selected (672.152 seconds)



More information about the environment:




  • Only one queue used, with capacity scheduler


  • User under which the job is running is my own user. We have Kerberos used with LDAP


  • AM Resource: 4096 MB


  • using tez.runtime.compress with Snappy


  • data is in Parquet format, no compression applied


  • tez.task.resource.memory 6134 MB


  • tez.counters.max 10000


  • tez.counters.max.groups 3000


  • tez.runtime.io.sort.mb 8110 MB


  • tez.runtime.pipelined.sorter.sort.threads 2


  • tez.runtime.shuffle.fetch.buffer.percent 0.6


  • tez.runtime.shuffle.memory.limit.percent 0.25


  • tez.runtime.unordered.output.buffer.size-mb 460 MB



Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.



What shd be the first thing to check?



Thx







apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday

























asked 2 days ago









hummingBird

1,52031434




1,52031434








  • 1




    Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
    – leftjoin
    2 days ago










  • thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
    – hummingBird
    2 days ago










  • What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
    – Samson Scharfrichter
    yesterday










  • @SamsonScharfrichter added more info to question... What else could be important?
    – hummingBird
    yesterday










  • hive.vectorized.execution.enabled?
    – Samson Scharfrichter
    yesterday














  • 1




    Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
    – leftjoin
    2 days ago










  • thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
    – hummingBird
    2 days ago










  • What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
    – Samson Scharfrichter
    yesterday










  • @SamsonScharfrichter added more info to question... What else could be important?
    – hummingBird
    yesterday










  • hive.vectorized.execution.enabled?
    – Samson Scharfrichter
    yesterday








1




1




Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago




Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago












thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago




thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago












What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday




What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday












@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday




@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday












hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday




hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53373301%2ftez-vs-spark-huge-performance-diffs%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53373301%2ftez-vs-spark-huge-performance-diffs%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

A Topological Invariant for $pi_3(U(n))$