Hadoop multinode cluster too slow. How do I increase speed of data processing?
I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED
state and wait for the first job to finish and then they start.
Here is the output of ps
command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
hadoop cluster-computing yarn hadoop2
add a comment |
I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED
state and wait for the first job to finish and then they start.
Here is the output of ps
command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
hadoop cluster-computing yarn hadoop2
add a comment |
I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED
state and wait for the first job to finish and then they start.
Here is the output of ps
command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
hadoop cluster-computing yarn hadoop2
I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED
state and wait for the first job to finish and then they start.
Here is the output of ps
command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
hadoop cluster-computing yarn hadoop2
hadoop cluster-computing yarn hadoop2
edited Nov 22 '18 at 10:18
Rishabh Dixit
asked Nov 21 '18 at 6:09
Rishabh DixitRishabh Dixit
309
309
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
I believe you can edit the mapred-default.xml
The Params you are looking for are
- mapreduce.job.running.map.limit
- mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
I do not have a mapred-default.xml in$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does thefree
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz andnproc
commands gives me24
cores per machine.
– Rishabh Dixit
Nov 21 '18 at 9:09
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. Andfree
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.
– Rishabh Dixit
Nov 21 '18 at 9:30
add a comment |
Based on your yarn-site.xml
your yarn.scheduler.minimum-allocation-mb
setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value ofyarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !
– Rishabh Dixit
Nov 22 '18 at 7:29
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information -18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
– Rishabh Dixit
Nov 22 '18 at 10:11
You need to restart all the YARN services when you change theyarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.
– tk421
Nov 22 '18 at 17:43
add a comment |
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406172%2fhadoop-multinode-cluster-too-slow-how-do-i-increase-speed-of-data-processing%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
I believe you can edit the mapred-default.xml
The Params you are looking for are
- mapreduce.job.running.map.limit
- mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
I do not have a mapred-default.xml in$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does thefree
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz andnproc
commands gives me24
cores per machine.
– Rishabh Dixit
Nov 21 '18 at 9:09
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. Andfree
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.
– Rishabh Dixit
Nov 21 '18 at 9:30
add a comment |
I believe you can edit the mapred-default.xml
The Params you are looking for are
- mapreduce.job.running.map.limit
- mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
I do not have a mapred-default.xml in$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does thefree
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz andnproc
commands gives me24
cores per machine.
– Rishabh Dixit
Nov 21 '18 at 9:09
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. Andfree
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.
– Rishabh Dixit
Nov 21 '18 at 9:30
add a comment |
I believe you can edit the mapred-default.xml
The Params you are looking for are
- mapreduce.job.running.map.limit
- mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
I believe you can edit the mapred-default.xml
The Params you are looking for are
- mapreduce.job.running.map.limit
- mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
answered Nov 21 '18 at 7:15
Tim SeedTim Seed
1,9911716
1,9911716
I do not have a mapred-default.xml in$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does thefree
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz andnproc
commands gives me24
cores per machine.
– Rishabh Dixit
Nov 21 '18 at 9:09
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. Andfree
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.
– Rishabh Dixit
Nov 21 '18 at 9:30
add a comment |
I do not have a mapred-default.xml in$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does thefree
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz andnproc
commands gives me24
cores per machine.
– Rishabh Dixit
Nov 21 '18 at 9:09
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. Andfree
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.
– Rishabh Dixit
Nov 21 '18 at 9:30
I do not have a mapred-default.xml in
$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does the free
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc
commands gives me 24
cores per machine.– Rishabh Dixit
Nov 21 '18 at 9:09
I do not have a mapred-default.xml in
$HADOOP_HOME/etc/hadoop/
. If my memory is too small per machine, then why does the free
command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc
commands gives me 24
cores per machine.– Rishabh Dixit
Nov 21 '18 at 9:09
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.
– Tim Seed
Nov 21 '18 at 9:23
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487
– Rishabh Dixit
Nov 21 '18 at 9:28
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And
free
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.– Rishabh Dixit
Nov 21 '18 at 9:30
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And
free
command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.– Rishabh Dixit
Nov 21 '18 at 9:30
add a comment |
Based on your yarn-site.xml
your yarn.scheduler.minimum-allocation-mb
setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value ofyarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !
– Rishabh Dixit
Nov 22 '18 at 7:29
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information -18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
– Rishabh Dixit
Nov 22 '18 at 10:11
You need to restart all the YARN services when you change theyarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.
– tk421
Nov 22 '18 at 17:43
add a comment |
Based on your yarn-site.xml
your yarn.scheduler.minimum-allocation-mb
setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value ofyarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !
– Rishabh Dixit
Nov 22 '18 at 7:29
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information -18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
– Rishabh Dixit
Nov 22 '18 at 10:11
You need to restart all the YARN services when you change theyarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.
– tk421
Nov 22 '18 at 17:43
add a comment |
Based on your yarn-site.xml
your yarn.scheduler.minimum-allocation-mb
setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
Based on your yarn-site.xml
your yarn.scheduler.minimum-allocation-mb
setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
answered Nov 21 '18 at 17:01
tk421tk421
3,47631426
3,47631426
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value ofyarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !
– Rishabh Dixit
Nov 22 '18 at 7:29
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information -18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
– Rishabh Dixit
Nov 22 '18 at 10:11
You need to restart all the YARN services when you change theyarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.
– tk421
Nov 22 '18 at 17:43
add a comment |
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value ofyarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !
– Rishabh Dixit
Nov 22 '18 at 7:29
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information -18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
– Rishabh Dixit
Nov 22 '18 at 10:11
You need to restart all the YARN services when you change theyarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.
– tk421
Nov 22 '18 at 17:43
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of
yarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !– Rishabh Dixit
Nov 22 '18 at 7:29
Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of
yarn.scheduler.minimum-allocation-mb
? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !– Rishabh Dixit
Nov 22 '18 at 7:29
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?– Rishabh Dixit
Nov 22 '18 at 10:11
I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
. Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?– Rishabh Dixit
Nov 22 '18 at 10:11
You need to restart all the YARN services when you change the
yarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.– tk421
Nov 22 '18 at 17:43
You need to restart all the YARN services when you change the
yarn-site.xml
. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.– tk421
Nov 22 '18 at 17:43
add a comment |
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:
add a comment |
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:
add a comment |
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:
edited Dec 18 '18 at 16:05
answered Dec 18 '18 at 15:50
PetroPetro
1,85521533
1,85521533
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406172%2fhadoop-multinode-cluster-too-slow-how-do-i-increase-speed-of-data-processing%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown