In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is...












2















In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:



+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+


If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.



Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.










share|improve this question



























    2















    In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:



    +----------+----------+----------+
    | Employee | Date | Balance |
    +----------+----------+----------+
    | John | 20181029 | 1800.2 |
    | John | 20181105 | 2937.74 |
    | John | 20181106 | 3000 |
    | John | 20181110 | 1500 |
    | John | 20181119 | -755.5 |
    | John | 20181120 | -800 |
    | John | 20181121 | 1200 |
    | John | 20181122 | -400 |
    | John | 20181123 | -900 |
    | John | 20181202 | -1300 |
    +----------+----------+----------+


    If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.



    Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.










    share|improve this question

























      2












      2








      2


      0






      In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:



      +----------+----------+----------+
      | Employee | Date | Balance |
      +----------+----------+----------+
      | John | 20181029 | 1800.2 |
      | John | 20181105 | 2937.74 |
      | John | 20181106 | 3000 |
      | John | 20181110 | 1500 |
      | John | 20181119 | -755.5 |
      | John | 20181120 | -800 |
      | John | 20181121 | 1200 |
      | John | 20181122 | -400 |
      | John | 20181123 | -900 |
      | John | 20181202 | -1300 |
      +----------+----------+----------+


      If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.



      Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.










      share|improve this question














      In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:



      +----------+----------+----------+
      | Employee | Date | Balance |
      +----------+----------+----------+
      | John | 20181029 | 1800.2 |
      | John | 20181105 | 2937.74 |
      | John | 20181106 | 3000 |
      | John | 20181110 | 1500 |
      | John | 20181119 | -755.5 |
      | John | 20181120 | -800 |
      | John | 20181121 | 1200 |
      | John | 20181122 | -400 |
      | John | 20181123 | -900 |
      | John | 20181202 | -1300 |
      +----------+----------+----------+


      If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.



      Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.







      hadoop hiveql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 2 at 19:45









      lolzorsplolzorsp

      111




      111
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Step 1: Original Data



          The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.



          daily_employee_balance



          use default;
          drop table if exists daily_employee_balance;
          create table if not exists daily_employee_balance (
          employee_id string,
          employee string,
          iso_date date,
          balance double
          );


          Insert Sample Data in original table daily_employee_balance



          insert into table daily_employee_balance values 
          ('103','John','2018-10-25',1800.2),
          ('103','John','2018-10-29',1125.7),
          ('103','John','2018-11-05',2937.74),
          ('103','John','2018-11-06',3000),
          ('103','John','2018-11-10',1500),
          ('103','John','2018-11-19',-755.5),
          ('103','John','2018-11-20',-800),
          ('103','John','2018-11-21',1200),
          ('103','John','2018-11-22',-400),
          ('103','John','2018-11-23',-900),
          ('103','John','2018-12-02',-1300);


          Step 2: Dimension Table



          You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.



          use default;
          drop table if exists dimension_date;
          create external table dimension_date(
          date_id int,
          iso_date string,
          year string,
          month string,
          month_desc string,
          end_of_month_flg string
          );


          Insert some sample data for entire month of Nov 2018:



          insert into table dimension_date values
          (6880,'2018-11-01','2018','2018-11','November','N'),
          (6881,'2018-11-02','2018','2018-11','November','N'),
          (6882,'2018-11-03','2018','2018-11','November','N'),
          (6883,'2018-11-04','2018','2018-11','November','N'),
          (6884,'2018-11-05','2018','2018-11','November','N'),
          (6885,'2018-11-06','2018','2018-11','November','N'),
          (6886,'2018-11-07','2018','2018-11','November','N'),
          (6887,'2018-11-08','2018','2018-11','November','N'),
          (6888,'2018-11-09','2018','2018-11','November','N'),
          (6889,'2018-11-10','2018','2018-11','November','N'),
          (6890,'2018-11-11','2018','2018-11','November','N'),
          (6891,'2018-11-12','2018','2018-11','November','N'),
          (6892,'2018-11-13','2018','2018-11','November','N'),
          (6893,'2018-11-14','2018','2018-11','November','N'),
          (6894,'2018-11-15','2018','2018-11','November','N'),
          (6895,'2018-11-16','2018','2018-11','November','N'),
          (6896,'2018-11-17','2018','2018-11','November','N'),
          (6897,'2018-11-18','2018','2018-11','November','N'),
          (6898,'2018-11-19','2018','2018-11','November','N'),
          (6899,'2018-11-20','2018','2018-11','November','N'),
          (6900,'2018-11-21','2018','2018-11','November','N'),
          (6901,'2018-11-22','2018','2018-11','November','N'),
          (6902,'2018-11-23','2018','2018-11','November','N'),
          (6903,'2018-11-24','2018','2018-11','November','N'),
          (6904,'2018-11-25','2018','2018-11','November','N'),
          (6905,'2018-11-26','2018','2018-11','November','N'),
          (6906,'2018-11-27','2018','2018-11','November','N'),
          (6907,'2018-11-28','2018','2018-11','November','N'),
          (6908,'2018-11-29','2018','2018-11','November','N'),
          (6909,'2018-11-30','2018','2018-11','November','Y');


          Step 3: Fact Table



          Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.



          This will re-create the original table with missing dates and populate the missing balance with earlier known balance.



          --inner query to get all the possible dates
          --outer self join query will populate the missing dates and balance
          drop table if exists f_employee_balance;
          create table f_employee_balance
          stored as orc tblproperties ("orc.compress"="SNAPPY") as
          select q1.employee_id, q1.iso_date,
          nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
          over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
          month, year from (
          select distinct
          r.employee_id,
          d.iso_date as iso_date,
          d.month, d.year
          from daily_employee_balance r, dimension_date d )q1
          left outer join daily_employee_balance r on
          (q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);


          Step 4: Analytics



          The query below will give you the true average for by month:



          select employee_id, monthly_avg, month, year from (
          select employee_id,
          row_number() over (partition by employee_id,year,month) as row_num,
          avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
          f_employee_balance)q1
          where row_num = 1
          order by year, month;


          Step 5: Conclusion



          You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012272%2fin-hiveql-what-is-the-most-elegant-performatic-way-of-calculating-an-average-va%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Step 1: Original Data



            The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.



            daily_employee_balance



            use default;
            drop table if exists daily_employee_balance;
            create table if not exists daily_employee_balance (
            employee_id string,
            employee string,
            iso_date date,
            balance double
            );


            Insert Sample Data in original table daily_employee_balance



            insert into table daily_employee_balance values 
            ('103','John','2018-10-25',1800.2),
            ('103','John','2018-10-29',1125.7),
            ('103','John','2018-11-05',2937.74),
            ('103','John','2018-11-06',3000),
            ('103','John','2018-11-10',1500),
            ('103','John','2018-11-19',-755.5),
            ('103','John','2018-11-20',-800),
            ('103','John','2018-11-21',1200),
            ('103','John','2018-11-22',-400),
            ('103','John','2018-11-23',-900),
            ('103','John','2018-12-02',-1300);


            Step 2: Dimension Table



            You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.



            use default;
            drop table if exists dimension_date;
            create external table dimension_date(
            date_id int,
            iso_date string,
            year string,
            month string,
            month_desc string,
            end_of_month_flg string
            );


            Insert some sample data for entire month of Nov 2018:



            insert into table dimension_date values
            (6880,'2018-11-01','2018','2018-11','November','N'),
            (6881,'2018-11-02','2018','2018-11','November','N'),
            (6882,'2018-11-03','2018','2018-11','November','N'),
            (6883,'2018-11-04','2018','2018-11','November','N'),
            (6884,'2018-11-05','2018','2018-11','November','N'),
            (6885,'2018-11-06','2018','2018-11','November','N'),
            (6886,'2018-11-07','2018','2018-11','November','N'),
            (6887,'2018-11-08','2018','2018-11','November','N'),
            (6888,'2018-11-09','2018','2018-11','November','N'),
            (6889,'2018-11-10','2018','2018-11','November','N'),
            (6890,'2018-11-11','2018','2018-11','November','N'),
            (6891,'2018-11-12','2018','2018-11','November','N'),
            (6892,'2018-11-13','2018','2018-11','November','N'),
            (6893,'2018-11-14','2018','2018-11','November','N'),
            (6894,'2018-11-15','2018','2018-11','November','N'),
            (6895,'2018-11-16','2018','2018-11','November','N'),
            (6896,'2018-11-17','2018','2018-11','November','N'),
            (6897,'2018-11-18','2018','2018-11','November','N'),
            (6898,'2018-11-19','2018','2018-11','November','N'),
            (6899,'2018-11-20','2018','2018-11','November','N'),
            (6900,'2018-11-21','2018','2018-11','November','N'),
            (6901,'2018-11-22','2018','2018-11','November','N'),
            (6902,'2018-11-23','2018','2018-11','November','N'),
            (6903,'2018-11-24','2018','2018-11','November','N'),
            (6904,'2018-11-25','2018','2018-11','November','N'),
            (6905,'2018-11-26','2018','2018-11','November','N'),
            (6906,'2018-11-27','2018','2018-11','November','N'),
            (6907,'2018-11-28','2018','2018-11','November','N'),
            (6908,'2018-11-29','2018','2018-11','November','N'),
            (6909,'2018-11-30','2018','2018-11','November','Y');


            Step 3: Fact Table



            Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.



            This will re-create the original table with missing dates and populate the missing balance with earlier known balance.



            --inner query to get all the possible dates
            --outer self join query will populate the missing dates and balance
            drop table if exists f_employee_balance;
            create table f_employee_balance
            stored as orc tblproperties ("orc.compress"="SNAPPY") as
            select q1.employee_id, q1.iso_date,
            nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
            over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
            month, year from (
            select distinct
            r.employee_id,
            d.iso_date as iso_date,
            d.month, d.year
            from daily_employee_balance r, dimension_date d )q1
            left outer join daily_employee_balance r on
            (q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);


            Step 4: Analytics



            The query below will give you the true average for by month:



            select employee_id, monthly_avg, month, year from (
            select employee_id,
            row_number() over (partition by employee_id,year,month) as row_num,
            avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
            f_employee_balance)q1
            where row_num = 1
            order by year, month;


            Step 5: Conclusion



            You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.






            share|improve this answer




























              1














              Step 1: Original Data



              The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.



              daily_employee_balance



              use default;
              drop table if exists daily_employee_balance;
              create table if not exists daily_employee_balance (
              employee_id string,
              employee string,
              iso_date date,
              balance double
              );


              Insert Sample Data in original table daily_employee_balance



              insert into table daily_employee_balance values 
              ('103','John','2018-10-25',1800.2),
              ('103','John','2018-10-29',1125.7),
              ('103','John','2018-11-05',2937.74),
              ('103','John','2018-11-06',3000),
              ('103','John','2018-11-10',1500),
              ('103','John','2018-11-19',-755.5),
              ('103','John','2018-11-20',-800),
              ('103','John','2018-11-21',1200),
              ('103','John','2018-11-22',-400),
              ('103','John','2018-11-23',-900),
              ('103','John','2018-12-02',-1300);


              Step 2: Dimension Table



              You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.



              use default;
              drop table if exists dimension_date;
              create external table dimension_date(
              date_id int,
              iso_date string,
              year string,
              month string,
              month_desc string,
              end_of_month_flg string
              );


              Insert some sample data for entire month of Nov 2018:



              insert into table dimension_date values
              (6880,'2018-11-01','2018','2018-11','November','N'),
              (6881,'2018-11-02','2018','2018-11','November','N'),
              (6882,'2018-11-03','2018','2018-11','November','N'),
              (6883,'2018-11-04','2018','2018-11','November','N'),
              (6884,'2018-11-05','2018','2018-11','November','N'),
              (6885,'2018-11-06','2018','2018-11','November','N'),
              (6886,'2018-11-07','2018','2018-11','November','N'),
              (6887,'2018-11-08','2018','2018-11','November','N'),
              (6888,'2018-11-09','2018','2018-11','November','N'),
              (6889,'2018-11-10','2018','2018-11','November','N'),
              (6890,'2018-11-11','2018','2018-11','November','N'),
              (6891,'2018-11-12','2018','2018-11','November','N'),
              (6892,'2018-11-13','2018','2018-11','November','N'),
              (6893,'2018-11-14','2018','2018-11','November','N'),
              (6894,'2018-11-15','2018','2018-11','November','N'),
              (6895,'2018-11-16','2018','2018-11','November','N'),
              (6896,'2018-11-17','2018','2018-11','November','N'),
              (6897,'2018-11-18','2018','2018-11','November','N'),
              (6898,'2018-11-19','2018','2018-11','November','N'),
              (6899,'2018-11-20','2018','2018-11','November','N'),
              (6900,'2018-11-21','2018','2018-11','November','N'),
              (6901,'2018-11-22','2018','2018-11','November','N'),
              (6902,'2018-11-23','2018','2018-11','November','N'),
              (6903,'2018-11-24','2018','2018-11','November','N'),
              (6904,'2018-11-25','2018','2018-11','November','N'),
              (6905,'2018-11-26','2018','2018-11','November','N'),
              (6906,'2018-11-27','2018','2018-11','November','N'),
              (6907,'2018-11-28','2018','2018-11','November','N'),
              (6908,'2018-11-29','2018','2018-11','November','N'),
              (6909,'2018-11-30','2018','2018-11','November','Y');


              Step 3: Fact Table



              Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.



              This will re-create the original table with missing dates and populate the missing balance with earlier known balance.



              --inner query to get all the possible dates
              --outer self join query will populate the missing dates and balance
              drop table if exists f_employee_balance;
              create table f_employee_balance
              stored as orc tblproperties ("orc.compress"="SNAPPY") as
              select q1.employee_id, q1.iso_date,
              nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
              over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
              month, year from (
              select distinct
              r.employee_id,
              d.iso_date as iso_date,
              d.month, d.year
              from daily_employee_balance r, dimension_date d )q1
              left outer join daily_employee_balance r on
              (q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);


              Step 4: Analytics



              The query below will give you the true average for by month:



              select employee_id, monthly_avg, month, year from (
              select employee_id,
              row_number() over (partition by employee_id,year,month) as row_num,
              avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
              f_employee_balance)q1
              where row_num = 1
              order by year, month;


              Step 5: Conclusion



              You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.






              share|improve this answer


























                1












                1








                1







                Step 1: Original Data



                The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.



                daily_employee_balance



                use default;
                drop table if exists daily_employee_balance;
                create table if not exists daily_employee_balance (
                employee_id string,
                employee string,
                iso_date date,
                balance double
                );


                Insert Sample Data in original table daily_employee_balance



                insert into table daily_employee_balance values 
                ('103','John','2018-10-25',1800.2),
                ('103','John','2018-10-29',1125.7),
                ('103','John','2018-11-05',2937.74),
                ('103','John','2018-11-06',3000),
                ('103','John','2018-11-10',1500),
                ('103','John','2018-11-19',-755.5),
                ('103','John','2018-11-20',-800),
                ('103','John','2018-11-21',1200),
                ('103','John','2018-11-22',-400),
                ('103','John','2018-11-23',-900),
                ('103','John','2018-12-02',-1300);


                Step 2: Dimension Table



                You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.



                use default;
                drop table if exists dimension_date;
                create external table dimension_date(
                date_id int,
                iso_date string,
                year string,
                month string,
                month_desc string,
                end_of_month_flg string
                );


                Insert some sample data for entire month of Nov 2018:



                insert into table dimension_date values
                (6880,'2018-11-01','2018','2018-11','November','N'),
                (6881,'2018-11-02','2018','2018-11','November','N'),
                (6882,'2018-11-03','2018','2018-11','November','N'),
                (6883,'2018-11-04','2018','2018-11','November','N'),
                (6884,'2018-11-05','2018','2018-11','November','N'),
                (6885,'2018-11-06','2018','2018-11','November','N'),
                (6886,'2018-11-07','2018','2018-11','November','N'),
                (6887,'2018-11-08','2018','2018-11','November','N'),
                (6888,'2018-11-09','2018','2018-11','November','N'),
                (6889,'2018-11-10','2018','2018-11','November','N'),
                (6890,'2018-11-11','2018','2018-11','November','N'),
                (6891,'2018-11-12','2018','2018-11','November','N'),
                (6892,'2018-11-13','2018','2018-11','November','N'),
                (6893,'2018-11-14','2018','2018-11','November','N'),
                (6894,'2018-11-15','2018','2018-11','November','N'),
                (6895,'2018-11-16','2018','2018-11','November','N'),
                (6896,'2018-11-17','2018','2018-11','November','N'),
                (6897,'2018-11-18','2018','2018-11','November','N'),
                (6898,'2018-11-19','2018','2018-11','November','N'),
                (6899,'2018-11-20','2018','2018-11','November','N'),
                (6900,'2018-11-21','2018','2018-11','November','N'),
                (6901,'2018-11-22','2018','2018-11','November','N'),
                (6902,'2018-11-23','2018','2018-11','November','N'),
                (6903,'2018-11-24','2018','2018-11','November','N'),
                (6904,'2018-11-25','2018','2018-11','November','N'),
                (6905,'2018-11-26','2018','2018-11','November','N'),
                (6906,'2018-11-27','2018','2018-11','November','N'),
                (6907,'2018-11-28','2018','2018-11','November','N'),
                (6908,'2018-11-29','2018','2018-11','November','N'),
                (6909,'2018-11-30','2018','2018-11','November','Y');


                Step 3: Fact Table



                Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.



                This will re-create the original table with missing dates and populate the missing balance with earlier known balance.



                --inner query to get all the possible dates
                --outer self join query will populate the missing dates and balance
                drop table if exists f_employee_balance;
                create table f_employee_balance
                stored as orc tblproperties ("orc.compress"="SNAPPY") as
                select q1.employee_id, q1.iso_date,
                nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
                over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
                month, year from (
                select distinct
                r.employee_id,
                d.iso_date as iso_date,
                d.month, d.year
                from daily_employee_balance r, dimension_date d )q1
                left outer join daily_employee_balance r on
                (q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);


                Step 4: Analytics



                The query below will give you the true average for by month:



                select employee_id, monthly_avg, month, year from (
                select employee_id,
                row_number() over (partition by employee_id,year,month) as row_num,
                avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
                f_employee_balance)q1
                where row_num = 1
                order by year, month;


                Step 5: Conclusion



                You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.






                share|improve this answer













                Step 1: Original Data



                The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.



                daily_employee_balance



                use default;
                drop table if exists daily_employee_balance;
                create table if not exists daily_employee_balance (
                employee_id string,
                employee string,
                iso_date date,
                balance double
                );


                Insert Sample Data in original table daily_employee_balance



                insert into table daily_employee_balance values 
                ('103','John','2018-10-25',1800.2),
                ('103','John','2018-10-29',1125.7),
                ('103','John','2018-11-05',2937.74),
                ('103','John','2018-11-06',3000),
                ('103','John','2018-11-10',1500),
                ('103','John','2018-11-19',-755.5),
                ('103','John','2018-11-20',-800),
                ('103','John','2018-11-21',1200),
                ('103','John','2018-11-22',-400),
                ('103','John','2018-11-23',-900),
                ('103','John','2018-12-02',-1300);


                Step 2: Dimension Table



                You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.



                use default;
                drop table if exists dimension_date;
                create external table dimension_date(
                date_id int,
                iso_date string,
                year string,
                month string,
                month_desc string,
                end_of_month_flg string
                );


                Insert some sample data for entire month of Nov 2018:



                insert into table dimension_date values
                (6880,'2018-11-01','2018','2018-11','November','N'),
                (6881,'2018-11-02','2018','2018-11','November','N'),
                (6882,'2018-11-03','2018','2018-11','November','N'),
                (6883,'2018-11-04','2018','2018-11','November','N'),
                (6884,'2018-11-05','2018','2018-11','November','N'),
                (6885,'2018-11-06','2018','2018-11','November','N'),
                (6886,'2018-11-07','2018','2018-11','November','N'),
                (6887,'2018-11-08','2018','2018-11','November','N'),
                (6888,'2018-11-09','2018','2018-11','November','N'),
                (6889,'2018-11-10','2018','2018-11','November','N'),
                (6890,'2018-11-11','2018','2018-11','November','N'),
                (6891,'2018-11-12','2018','2018-11','November','N'),
                (6892,'2018-11-13','2018','2018-11','November','N'),
                (6893,'2018-11-14','2018','2018-11','November','N'),
                (6894,'2018-11-15','2018','2018-11','November','N'),
                (6895,'2018-11-16','2018','2018-11','November','N'),
                (6896,'2018-11-17','2018','2018-11','November','N'),
                (6897,'2018-11-18','2018','2018-11','November','N'),
                (6898,'2018-11-19','2018','2018-11','November','N'),
                (6899,'2018-11-20','2018','2018-11','November','N'),
                (6900,'2018-11-21','2018','2018-11','November','N'),
                (6901,'2018-11-22','2018','2018-11','November','N'),
                (6902,'2018-11-23','2018','2018-11','November','N'),
                (6903,'2018-11-24','2018','2018-11','November','N'),
                (6904,'2018-11-25','2018','2018-11','November','N'),
                (6905,'2018-11-26','2018','2018-11','November','N'),
                (6906,'2018-11-27','2018','2018-11','November','N'),
                (6907,'2018-11-28','2018','2018-11','November','N'),
                (6908,'2018-11-29','2018','2018-11','November','N'),
                (6909,'2018-11-30','2018','2018-11','November','Y');


                Step 3: Fact Table



                Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.



                This will re-create the original table with missing dates and populate the missing balance with earlier known balance.



                --inner query to get all the possible dates
                --outer self join query will populate the missing dates and balance
                drop table if exists f_employee_balance;
                create table f_employee_balance
                stored as orc tblproperties ("orc.compress"="SNAPPY") as
                select q1.employee_id, q1.iso_date,
                nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
                over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
                month, year from (
                select distinct
                r.employee_id,
                d.iso_date as iso_date,
                d.month, d.year
                from daily_employee_balance r, dimension_date d )q1
                left outer join daily_employee_balance r on
                (q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);


                Step 4: Analytics



                The query below will give you the true average for by month:



                select employee_id, monthly_avg, month, year from (
                select employee_id,
                row_number() over (partition by employee_id,year,month) as row_num,
                avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
                f_employee_balance)q1
                where row_num = 1
                order by year, month;


                Step 5: Conclusion



                You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 3 at 19:10









                gkc123gkc123

                2721319




                2721319
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012272%2fin-hiveql-what-is-the-most-elegant-performatic-way-of-calculating-an-average-va%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    MongoDB - Not Authorized To Execute Command

                    How to fix TextFormField cause rebuild widget in Flutter

                    in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith