Hadoop multinode cluster too slow. How do I increase speed of data processing?

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.

After running an MR job, i checked my RAM Usage which is mentioned below:

Namenode

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           7          15           0           8          22

Swap:     31           0          31

Datanode :

Slave1 :

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           6           6           0          18          24

Swap:     31           3          28

Slave2:

          total        used        free      shared  buff/cache   available

Mem:      31           2           4           0          24          28

Swap:     31           1          30

Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.

Here is the output of ps command of the JAR that I submnitted to execute the MR job:

/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console 

-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop 

-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console 

-classpath --classpath of jars

 org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02

Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.

EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -

nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &

Here is some more information :

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363

18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

edited Nov 22 '18 at 10:18

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

add a comment |

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.

After running an MR job, i checked my RAM Usage which is mentioned below:

Namenode

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           7          15           0           8          22

Swap:     31           0          31

Datanode :

Slave1 :

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           6           6           0          18          24

Swap:     31           3          28

Slave2:

          total        used        free      shared  buff/cache   available

Mem:      31           2           4           0          24          28

Swap:     31           1          30

Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.

Here is the output of ps command of the JAR that I submnitted to execute the MR job:

/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console 

-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop 

-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console 

-classpath --classpath of jars

 org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02

EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -

nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &

Here is some more information :

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363

18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

edited Nov 22 '18 at 10:18

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

add a comment |

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.

After running an MR job, i checked my RAM Usage which is mentioned below:

Namenode

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           7          15           0           8          22

Swap:     31           0          31

Datanode :

Slave1 :

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           6           6           0          18          24

Swap:     31           3          28

Slave2:

          total        used        free      shared  buff/cache   available

Mem:      31           2           4           0          24          28

Swap:     31           1          30

Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.

Here is the output of ps command of the JAR that I submnitted to execute the MR job:

/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console 

-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop 

-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console 

-classpath --classpath of jars

 org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02

EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -

nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &

Here is some more information :

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363

18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

edited Nov 22 '18 at 10:18

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.

After running an MR job, i checked my RAM Usage which is mentioned below:

Namenode

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           7          15           0           8          22

Swap:     31           0          31

Datanode :

Slave1 :

free -g

          total        used        free      shared  buff/cache   available

Mem:      31           6           6           0          18          24

Swap:     31           3          28

Slave2:

          total        used        free      shared  buff/cache   available

Mem:      31           2           4           0          24          28

Swap:     31           1          30

Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.

Here is the output of ps command of the JAR that I submnitted to execute the MR job:

/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console 

-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml 

-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 

-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 

-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop 

-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console 

-classpath --classpath of jars

 org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02

EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -

nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &

Here is some more information :

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363

18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

hadoop cluster-computing yarn hadoop2

edited Nov 22 '18 at 10:18

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

edited Nov 22 '18 at 10:18

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

edited Nov 22 '18 at 10:18

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

asked Nov 21 '18 at 6:09

Rishabh Dixit

309

add a comment |

3 Answers
3

active

oldest

votes

I believe you can edit the mapred-default.xml

The Params you are looking for are

mapreduce.job.running.map.limit

mapreduce.job.running.reduce.limit

0 (Probably what it is set too at the moment) means UNLIMITED.

Looking at your Memory 32G/Machine seems too small.

What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

– Rishabh Dixit
Nov 21 '18 at 9:09

How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

– Tim Seed
Nov 21 '18 at 9:23

Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

– Rishabh Dixit
Nov 21 '18 at 9:28

I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

– Rishabh Dixit
Nov 21 '18 at 9:30

add a comment |

Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.

Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.

answered Nov 21 '18 at 17:01

tk421

3,47631426

Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

– Rishabh Dixit
Nov 22 '18 at 7:29

I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

– tk421
Nov 22 '18 at 17:43

add a comment |

To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:

For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB

TEZ:
enter image description here

YARN:

enter image description here

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:50

Petro

1,85521533

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406172%2fhadoop-multinode-cluster-too-slow-how-do-i-increase-speed-of-data-processing%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I believe you can edit the mapred-default.xml

The Params you are looking for are

mapreduce.job.running.map.limit

mapreduce.job.running.reduce.limit

0 (Probably what it is set too at the moment) means UNLIMITED.

Looking at your Memory 32G/Machine seems too small.

What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

– Rishabh Dixit
Nov 21 '18 at 9:09

How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

– Tim Seed
Nov 21 '18 at 9:23

Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

– Rishabh Dixit
Nov 21 '18 at 9:28

I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

– Rishabh Dixit
Nov 21 '18 at 9:30

add a comment |

I believe you can edit the mapred-default.xml

The Params you are looking for are

mapreduce.job.running.map.limit

mapreduce.job.running.reduce.limit

0 (Probably what it is set too at the moment) means UNLIMITED.

Looking at your Memory 32G/Machine seems too small.

What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

– Rishabh Dixit
Nov 21 '18 at 9:09

How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

– Tim Seed
Nov 21 '18 at 9:23

Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

– Rishabh Dixit
Nov 21 '18 at 9:28

I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

– Rishabh Dixit
Nov 21 '18 at 9:30

add a comment |

I believe you can edit the mapred-default.xml

The Params you are looking for are

mapreduce.job.running.map.limit

mapreduce.job.running.reduce.limit

0 (Probably what it is set too at the moment) means UNLIMITED.

Looking at your Memory 32G/Machine seems too small.

What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

I believe you can edit the mapred-default.xml

The Params you are looking for are

mapreduce.job.running.map.limit

mapreduce.job.running.reduce.limit

0 (Probably what it is set too at the moment) means UNLIMITED.

Looking at your Memory 32G/Machine seems too small.

What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

answered Nov 21 '18 at 7:15

Tim Seed

1,9911716

I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

– Rishabh Dixit
Nov 21 '18 at 9:09

How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

– Tim Seed
Nov 21 '18 at 9:23

Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

– Rishabh Dixit
Nov 21 '18 at 9:28

I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

– Rishabh Dixit
Nov 21 '18 at 9:30

add a comment |

I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

– Rishabh Dixit
Nov 21 '18 at 9:09

How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

– Tim Seed
Nov 21 '18 at 9:23

Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

– Rishabh Dixit
Nov 21 '18 at 9:28

I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

– Rishabh Dixit
Nov 21 '18 at 9:30

I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

– Rishabh Dixit
Nov 21 '18 at 9:09

How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

– Tim Seed
Nov 21 '18 at 9:23

Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

– Rishabh Dixit
Nov 21 '18 at 9:28

I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

– Rishabh Dixit
Nov 21 '18 at 9:30

add a comment |

answered Nov 21 '18 at 17:01

tk421

3,47631426

Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

– Rishabh Dixit
Nov 22 '18 at 7:29

I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

– tk421
Nov 22 '18 at 17:43

add a comment |

answered Nov 21 '18 at 17:01

tk421

3,47631426

Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

– Rishabh Dixit
Nov 22 '18 at 7:29

I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

– tk421
Nov 22 '18 at 17:43

add a comment |

answered Nov 21 '18 at 17:01

tk421

3,47631426

answered Nov 21 '18 at 17:01

tk421

3,47631426

answered Nov 21 '18 at 17:01

tk421

3,47631426

answered Nov 21 '18 at 17:01

tk421

3,47631426

answered Nov 21 '18 at 17:01

tk421

3,47631426

Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

– Rishabh Dixit
Nov 22 '18 at 7:29

I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

– tk421
Nov 22 '18 at 17:43

add a comment |

Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

– Rishabh Dixit
Nov 22 '18 at 7:29

I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

– tk421
Nov 22 '18 at 17:43

Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

– Rishabh Dixit
Nov 22 '18 at 7:29

I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information -

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

– Rishabh Dixit
Nov 22 '18 at 10:11

You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

– tk421
Nov 22 '18 at 17:43

add a comment |

To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:

For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB

TEZ:
enter image description here

YARN:

enter image description here

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:50

Petro

1,85521533

add a comment |

To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:

For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB

TEZ:
enter image description here

YARN:

enter image description here

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:50

Petro

1,85521533

add a comment |

To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:

For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB

TEZ:
enter image description here

YARN:

enter image description here

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:50

Petro

1,85521533

To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:

For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB

TEZ:
enter image description here

YARN:

enter image description here

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:50

Petro

1,85521533

edited Dec 18 '18 at 16:05

answered Dec 18 '18 at 15:50

Petro

1,85521533

answered Dec 18 '18 at 15:50

Petro

1,85521533

answered Dec 18 '18 at 15:50

Petro

1,85521533

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu