Tez VS Spark - huge performance diffs

up vote
0
down vote

favorite

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

Only one queue used, with capacity scheduler

User under which the job is running is my own user. We have Kerberos used with LDAP

AM Resource: 4096 MB

using tez.runtime.compress with Snappy

data is in Parquet format, no compression applied

tez.task.resource.memory 6134 MB

tez.counters.max 10000

tez.counters.max.groups 3000

tez.runtime.io.sort.mb 8110 MB

tez.runtime.pipelined.sorter.sort.threads 2

tez.runtime.shuffle.fetch.buffer.percent 0.6

tez.runtime.shuffle.memory.limit.percent 0.25

tez.runtime.unordered.output.buffer.size-mb 460 MB

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

edited yesterday

asked 2 days ago

hummingBird

1,52031434

1

Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago

thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago

What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday

@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday

hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday

|
show 2 more comments

up vote
0
down vote

favorite

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

Only one queue used, with capacity scheduler

User under which the job is running is my own user. We have Kerberos used with LDAP

AM Resource: 4096 MB

using tez.runtime.compress with Snappy

data is in Parquet format, no compression applied

tez.task.resource.memory 6134 MB

tez.counters.max 10000

tez.counters.max.groups 3000

tez.runtime.io.sort.mb 8110 MB

tez.runtime.pipelined.sorter.sort.threads 2

tez.runtime.shuffle.fetch.buffer.percent 0.6

tez.runtime.shuffle.memory.limit.percent 0.25

tez.runtime.unordered.output.buffer.size-mb 460 MB

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

edited yesterday

asked 2 days ago

hummingBird

1,52031434

1

Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago

thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago

What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday

@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday

hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday

|
show 2 more comments

up vote
0
down vote

favorite

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

Only one queue used, with capacity scheduler

User under which the job is running is my own user. We have Kerberos used with LDAP

AM Resource: 4096 MB

using tez.runtime.compress with Snappy

data is in Parquet format, no compression applied

tez.task.resource.memory 6134 MB

tez.counters.max 10000

tez.counters.max.groups 3000

tez.runtime.io.sort.mb 8110 MB

tez.runtime.pipelined.sorter.sort.threads 2

tez.runtime.shuffle.fetch.buffer.percent 0.6

tez.runtime.shuffle.memory.limit.percent 0.25

tez.runtime.unordered.output.buffer.size-mb 460 MB

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

edited yesterday

asked 2 days ago

hummingBird

1,52031434

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

Only one queue used, with capacity scheduler

User under which the job is running is my own user. We have Kerberos used with LDAP

AM Resource: 4096 MB

using tez.runtime.compress with Snappy

data is in Parquet format, no compression applied

tez.task.resource.memory 6134 MB

tez.counters.max 10000

tez.counters.max.groups 3000

tez.runtime.io.sort.mb 8110 MB

tez.runtime.pipelined.sorter.sort.threads 2

tez.runtime.shuffle.fetch.buffer.percent 0.6

tez.runtime.shuffle.memory.limit.percent 0.25

tez.runtime.unordered.output.buffer.size-mb 460 MB

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

apache-spark hive apache-spark-sql hortonworks-data-platform apache-tez

edited yesterday

asked 2 days ago

hummingBird

1,52031434

edited yesterday

asked 2 days ago

hummingBird

1,52031434

edited yesterday

asked 2 days ago

hummingBird

1,52031434

asked 2 days ago

hummingBird

1,52031434

asked 2 days ago

hummingBird

1,52031434

1

Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago

thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago

What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday

@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday

hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday

|
show 2 more comments

1

Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago

thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago

What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday

@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday

hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday

Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1)
– leftjoin
2 days ago

thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out
– hummingBird
2 days ago

What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc...
– Samson Scharfrichter
yesterday

@SamsonScharfrichter added more info to question... What else could be important?
– hummingBird
yesterday

hive.vectorized.execution.enabled?
– Samson Scharfrichter
yesterday

|
show 2 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53373301%2ftez-vs-spark-huge-performance-diffs%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu