Hadoop multinode cluster too slow. How do I increase speed of data processing?












0















I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.



After running an MR job, i checked my RAM Usage which is mentioned below:



Namenode



free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31


Datanode :



Slave1 :



free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28


Slave2:



          total        used        free      shared  buff/cache   available
Mem: 31 2 4 0 24 28
Swap: 31 1 30


Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.



Here is the output of ps command of the JAR that I submnitted to execute the MR job:



/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02


Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.



EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -



nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &


Here is some more information :



18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372


Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?










share|improve this question





























    0















    I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.



    After running an MR job, i checked my RAM Usage which is mentioned below:



    Namenode



    free -g
    total used free shared buff/cache available
    Mem: 31 7 15 0 8 22
    Swap: 31 0 31


    Datanode :



    Slave1 :



    free -g
    total used free shared buff/cache available
    Mem: 31 6 6 0 18 24
    Swap: 31 3 28


    Slave2:



              total        used        free      shared  buff/cache   available
    Mem: 31 2 4 0 24 28
    Swap: 31 1 30


    Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.



    Here is the output of ps command of the JAR that I submnitted to execute the MR job:



    /opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 
    -Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
    -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
    -Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
    -Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
    -Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
    -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
    -Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
    -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
    -classpath --classpath of jars
    org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02


    Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.



    EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -



    nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &


    Here is some more information :



    18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
    18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372


    Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?










    share|improve this question



























      0












      0








      0








      I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.



      After running an MR job, i checked my RAM Usage which is mentioned below:



      Namenode



      free -g
      total used free shared buff/cache available
      Mem: 31 7 15 0 8 22
      Swap: 31 0 31


      Datanode :



      Slave1 :



      free -g
      total used free shared buff/cache available
      Mem: 31 6 6 0 18 24
      Swap: 31 3 28


      Slave2:



                total        used        free      shared  buff/cache   available
      Mem: 31 2 4 0 24 28
      Swap: 31 1 30


      Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.



      Here is the output of ps command of the JAR that I submnitted to execute the MR job:



      /opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 
      -Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
      -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
      -Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
      -Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
      -Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
      -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
      -Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
      -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
      -classpath --classpath of jars
      org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02


      Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.



      EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -



      nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &


      Here is some more information :



      18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
      18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372


      Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?










      share|improve this question
















      I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.



      After running an MR job, i checked my RAM Usage which is mentioned below:



      Namenode



      free -g
      total used free shared buff/cache available
      Mem: 31 7 15 0 8 22
      Swap: 31 0 31


      Datanode :



      Slave1 :



      free -g
      total used free shared buff/cache available
      Mem: 31 6 6 0 18 24
      Swap: 31 3 28


      Slave2:



                total        used        free      shared  buff/cache   available
      Mem: 31 2 4 0 24 28
      Swap: 31 1 30


      Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.



      Here is the output of ps command of the JAR that I submnitted to execute the MR job:



      /opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 
      -Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
      -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
      -Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
      -Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
      -Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
      -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
      -Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
      -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
      -classpath --classpath of jars
      org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02


      Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.



      EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -



      nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &


      Here is some more information :



      18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
      18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372


      Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?







      hadoop cluster-computing yarn hadoop2






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 22 '18 at 10:18







      Rishabh Dixit

















      asked Nov 21 '18 at 6:09









      Rishabh DixitRishabh Dixit

      309




      309
























          3 Answers
          3






          active

          oldest

          votes


















          0














          I believe you can edit the mapred-default.xml



          The Params you are looking for are




          • mapreduce.job.running.map.limit

          • mapreduce.job.running.reduce.limit


          0 (Probably what it is set too at the moment) means UNLIMITED.



          Looking at your Memory 32G/Machine seems too small.



          What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.






          share|improve this answer
























          • I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

            – Rishabh Dixit
            Nov 21 '18 at 9:09













          • How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

            – Tim Seed
            Nov 21 '18 at 9:23











          • Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

            – Rishabh Dixit
            Nov 21 '18 at 9:28











          • I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

            – Rishabh Dixit
            Nov 21 '18 at 9:30





















          0














          Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.



          Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.






          share|improve this answer
























          • Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

            – Rishabh Dixit
            Nov 22 '18 at 7:29











          • I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

            – Rishabh Dixit
            Nov 22 '18 at 10:11













          • You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

            – tk421
            Nov 22 '18 at 17:43



















          0














          To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:



          For Tez: Divide RAM/CORES = Max TEZ Container size
          So in my case: 128/32 = 4GB



          TEZ:
          enter image description here





          YARN:



          enter image description here






          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406172%2fhadoop-multinode-cluster-too-slow-how-do-i-increase-speed-of-data-processing%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            I believe you can edit the mapred-default.xml



            The Params you are looking for are




            • mapreduce.job.running.map.limit

            • mapreduce.job.running.reduce.limit


            0 (Probably what it is set too at the moment) means UNLIMITED.



            Looking at your Memory 32G/Machine seems too small.



            What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.






            share|improve this answer
























            • I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

              – Rishabh Dixit
              Nov 21 '18 at 9:09













            • How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

              – Tim Seed
              Nov 21 '18 at 9:23











            • Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

              – Rishabh Dixit
              Nov 21 '18 at 9:28











            • I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

              – Rishabh Dixit
              Nov 21 '18 at 9:30


















            0














            I believe you can edit the mapred-default.xml



            The Params you are looking for are




            • mapreduce.job.running.map.limit

            • mapreduce.job.running.reduce.limit


            0 (Probably what it is set too at the moment) means UNLIMITED.



            Looking at your Memory 32G/Machine seems too small.



            What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.






            share|improve this answer
























            • I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

              – Rishabh Dixit
              Nov 21 '18 at 9:09













            • How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

              – Tim Seed
              Nov 21 '18 at 9:23











            • Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

              – Rishabh Dixit
              Nov 21 '18 at 9:28











            • I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

              – Rishabh Dixit
              Nov 21 '18 at 9:30
















            0












            0








            0







            I believe you can edit the mapred-default.xml



            The Params you are looking for are




            • mapreduce.job.running.map.limit

            • mapreduce.job.running.reduce.limit


            0 (Probably what it is set too at the moment) means UNLIMITED.



            Looking at your Memory 32G/Machine seems too small.



            What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.






            share|improve this answer













            I believe you can edit the mapred-default.xml



            The Params you are looking for are




            • mapreduce.job.running.map.limit

            • mapreduce.job.running.reduce.limit


            0 (Probably what it is set too at the moment) means UNLIMITED.



            Looking at your Memory 32G/Machine seems too small.



            What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 21 '18 at 7:15









            Tim SeedTim Seed

            1,9911716




            1,9911716













            • I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

              – Rishabh Dixit
              Nov 21 '18 at 9:09













            • How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

              – Tim Seed
              Nov 21 '18 at 9:23











            • Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

              – Rishabh Dixit
              Nov 21 '18 at 9:28











            • I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

              – Rishabh Dixit
              Nov 21 '18 at 9:30





















            • I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

              – Rishabh Dixit
              Nov 21 '18 at 9:09













            • How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

              – Tim Seed
              Nov 21 '18 at 9:23











            • Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

              – Rishabh Dixit
              Nov 21 '18 at 9:28











            • I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

              – Rishabh Dixit
              Nov 21 '18 at 9:30



















            I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

            – Rishabh Dixit
            Nov 21 '18 at 9:09







            I do not have a mapred-default.xml in $HADOOP_HOME/etc/hadoop/. If my memory is too small per machine, then why does the free command shows a lot of free memory ? Here is the CPU used on each machine - Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz and nproc commands gives me 24 cores per machine.

            – Rishabh Dixit
            Nov 21 '18 at 9:09















            How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

            – Tim Seed
            Nov 21 '18 at 9:23





            How long do you MapReduce jobs take ? I was running 30 DN's each with 128 Gb of memory - and we dropped MapReduce for Spark/HBase due to speed. Have you tried using Hive ? Are the results the Same ?? Pheonix (from Hortonworks) was also good.

            – Tim Seed
            Nov 21 '18 at 9:23













            Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

            – Rishabh Dixit
            Nov 21 '18 at 9:28





            Yes, I am using hive 1.2.1 stable. After submitting a long query I get similar output - Launching Job 7 out of 21 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Hadoop job information for Stage-14: number of mappers: 380; number of reducers: 487

            – Rishabh Dixit
            Nov 21 '18 at 9:28













            I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

            – Rishabh Dixit
            Nov 21 '18 at 9:30







            I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process. And free command shows similar memory consumption even when hive queries with above mentioned mapper and reducer numbers are running.

            – Rishabh Dixit
            Nov 21 '18 at 9:30















            0














            Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.



            Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.






            share|improve this answer
























            • Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

              – Rishabh Dixit
              Nov 22 '18 at 7:29











            • I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

              – Rishabh Dixit
              Nov 22 '18 at 10:11













            • You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

              – tk421
              Nov 22 '18 at 17:43
















            0














            Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.



            Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.






            share|improve this answer
























            • Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

              – Rishabh Dixit
              Nov 22 '18 at 7:29











            • I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

              – Rishabh Dixit
              Nov 22 '18 at 10:11













            • You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

              – tk421
              Nov 22 '18 at 17:43














            0












            0








            0







            Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.



            Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.






            share|improve this answer













            Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.



            Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 21 '18 at 17:01









            tk421tk421

            3,47631426




            3,47631426













            • Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

              – Rishabh Dixit
              Nov 22 '18 at 7:29











            • I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

              – Rishabh Dixit
              Nov 22 '18 at 10:11













            • You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

              – tk421
              Nov 22 '18 at 17:43



















            • Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

              – Rishabh Dixit
              Nov 22 '18 at 7:29











            • I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

              – Rishabh Dixit
              Nov 22 '18 at 10:11













            • You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

              – tk421
              Nov 22 '18 at 17:43

















            Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

            – Rishabh Dixit
            Nov 22 '18 at 7:29





            Do I need to restart the cluster everytime I make changes in the yarn config files ? or just the changes in XML file will make my next job run with the new value of yarn.scheduler.minimum-allocation-mb ? I have made it 1 GB which sounds reasonable. I am yet to execute the job with new settings. I will let you know how this fares out. Thank you for your time !

            – Rishabh Dixit
            Nov 22 '18 at 7:29













            I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

            – Rishabh Dixit
            Nov 22 '18 at 10:11







            I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command - nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 & . Here is some more information - 18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363 18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372 Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

            – Rishabh Dixit
            Nov 22 '18 at 10:11















            You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

            – tk421
            Nov 22 '18 at 17:43





            You need to restart all the YARN services when you change the yarn-site.xml. You should look at the YARN UI (RM_URL/cluster/apps) to see what your usage is. There are probably more settings you need to tune.

            – tk421
            Nov 22 '18 at 17:43











            0














            To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:



            For Tez: Divide RAM/CORES = Max TEZ Container size
            So in my case: 128/32 = 4GB



            TEZ:
            enter image description here





            YARN:



            enter image description here






            share|improve this answer






























              0














              To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:



              For Tez: Divide RAM/CORES = Max TEZ Container size
              So in my case: 128/32 = 4GB



              TEZ:
              enter image description here





              YARN:



              enter image description here






              share|improve this answer




























                0












                0








                0







                To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:



                For Tez: Divide RAM/CORES = Max TEZ Container size
                So in my case: 128/32 = 4GB



                TEZ:
                enter image description here





                YARN:



                enter image description here






                share|improve this answer















                To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:



                For Tez: Divide RAM/CORES = Max TEZ Container size
                So in my case: 128/32 = 4GB



                TEZ:
                enter image description here





                YARN:



                enter image description here







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 18 '18 at 16:05

























                answered Dec 18 '18 at 15:50









                PetroPetro

                1,85521533




                1,85521533






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406172%2fhadoop-multinode-cluster-too-slow-how-do-i-increase-speed-of-data-processing%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

                    Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

                    A Topological Invariant for $pi_3(U(n))$