Spark compression when writing to external Hive table





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







6















I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...). By setting e.g.



spark.sql("SET spark.sql.parquet.compression.codec=GZIP")


I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g.




part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet




However if I work with partitioned Hive table, this setting does not have any effect, the file size is always the same. In addition, the filename is always




part-00000




Now how can I change (or at least verify) the compression codec of the parquet files in the partitioned case?



My table is :



CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
PARTITIONED BY (`year` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'









share|improve this question































    6















    I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...). By setting e.g.



    spark.sql("SET spark.sql.parquet.compression.codec=GZIP")


    I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g.




    part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet




    However if I work with partitioned Hive table, this setting does not have any effect, the file size is always the same. In addition, the filename is always




    part-00000




    Now how can I change (or at least verify) the compression codec of the parquet files in the partitioned case?



    My table is :



    CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
    PARTITIONED BY (`year` int)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH SERDEPROPERTIES (
    'serialization.format' = '1'
    )
    STORED AS
    INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'









    share|improve this question



























      6












      6








      6


      2






      I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...). By setting e.g.



      spark.sql("SET spark.sql.parquet.compression.codec=GZIP")


      I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g.




      part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet




      However if I work with partitioned Hive table, this setting does not have any effect, the file size is always the same. In addition, the filename is always




      part-00000




      Now how can I change (or at least verify) the compression codec of the parquet files in the partitioned case?



      My table is :



      CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
      PARTITIONED BY (`year` int)
      ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
      WITH SERDEPROPERTIES (
      'serialization.format' = '1'
      )
      STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'









      share|improve this question
















      I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...). By setting e.g.



      spark.sql("SET spark.sql.parquet.compression.codec=GZIP")


      I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g.




      part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet




      However if I work with partitioned Hive table, this setting does not have any effect, the file size is always the same. In addition, the filename is always




      part-00000




      Now how can I change (or at least verify) the compression codec of the parquet files in the partitioned case?



      My table is :



      CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
      PARTITIONED BY (`year` int)
      ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
      WITH SERDEPROPERTIES (
      'serialization.format' = '1'
      )
      STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'






      apache-spark hive parquet






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 10 at 8:18







      Raphael Roth

















      asked Jan 3 at 14:03









      Raphael RothRaphael Roth

      12.9k54279




      12.9k54279
























          1 Answer
          1






          active

          oldest

          votes


















          0














          As you create external table, I would proceed like this :



          First write your parquet dataset with the required compression:



          df.write
          .partitionBy("year")
          .option("compression","<gzip|snappy|none>")
          .parquet("<parquet_file_path>")


          you can check as before with the file extension.
          Then,you can create your external table as follow :



          CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
          PARTITIONED BY (`year` int)
          STORED AS PARQUET
          LOCATION '<parquet_file_path>';


          If the external table already exists in Hive, you just need to run to refresh your table:



          MSCK REPAIR TABLE test;





          share|improve this answer
























          • I was asking about inserting into an existing table

            – Raphael Roth
            Jan 15 at 17:09











          • Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

            – Nonontb
            Jan 15 at 20:42












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54023847%2fspark-compression-when-writing-to-external-hive-table%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          As you create external table, I would proceed like this :



          First write your parquet dataset with the required compression:



          df.write
          .partitionBy("year")
          .option("compression","<gzip|snappy|none>")
          .parquet("<parquet_file_path>")


          you can check as before with the file extension.
          Then,you can create your external table as follow :



          CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
          PARTITIONED BY (`year` int)
          STORED AS PARQUET
          LOCATION '<parquet_file_path>';


          If the external table already exists in Hive, you just need to run to refresh your table:



          MSCK REPAIR TABLE test;





          share|improve this answer
























          • I was asking about inserting into an existing table

            – Raphael Roth
            Jan 15 at 17:09











          • Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

            – Nonontb
            Jan 15 at 20:42
















          0














          As you create external table, I would proceed like this :



          First write your parquet dataset with the required compression:



          df.write
          .partitionBy("year")
          .option("compression","<gzip|snappy|none>")
          .parquet("<parquet_file_path>")


          you can check as before with the file extension.
          Then,you can create your external table as follow :



          CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
          PARTITIONED BY (`year` int)
          STORED AS PARQUET
          LOCATION '<parquet_file_path>';


          If the external table already exists in Hive, you just need to run to refresh your table:



          MSCK REPAIR TABLE test;





          share|improve this answer
























          • I was asking about inserting into an existing table

            – Raphael Roth
            Jan 15 at 17:09











          • Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

            – Nonontb
            Jan 15 at 20:42














          0












          0








          0







          As you create external table, I would proceed like this :



          First write your parquet dataset with the required compression:



          df.write
          .partitionBy("year")
          .option("compression","<gzip|snappy|none>")
          .parquet("<parquet_file_path>")


          you can check as before with the file extension.
          Then,you can create your external table as follow :



          CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
          PARTITIONED BY (`year` int)
          STORED AS PARQUET
          LOCATION '<parquet_file_path>';


          If the external table already exists in Hive, you just need to run to refresh your table:



          MSCK REPAIR TABLE test;





          share|improve this answer













          As you create external table, I would proceed like this :



          First write your parquet dataset with the required compression:



          df.write
          .partitionBy("year")
          .option("compression","<gzip|snappy|none>")
          .parquet("<parquet_file_path>")


          you can check as before with the file extension.
          Then,you can create your external table as follow :



          CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
          PARTITIONED BY (`year` int)
          STORED AS PARQUET
          LOCATION '<parquet_file_path>';


          If the external table already exists in Hive, you just need to run to refresh your table:



          MSCK REPAIR TABLE test;






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 15 at 15:12









          NonontbNonontb

          22138




          22138













          • I was asking about inserting into an existing table

            – Raphael Roth
            Jan 15 at 17:09











          • Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

            – Nonontb
            Jan 15 at 20:42



















          • I was asking about inserting into an existing table

            – Raphael Roth
            Jan 15 at 17:09











          • Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

            – Nonontb
            Jan 15 at 20:42

















          I was asking about inserting into an existing table

          – Raphael Roth
          Jan 15 at 17:09





          I was asking about inserting into an existing table

          – Raphael Roth
          Jan 15 at 17:09













          Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

          – Nonontb
          Jan 15 at 20:42





          Using SaveMode.Append would add some new files with new data and using MSCK REPAIR TABLE will refresh the table. But ok I don't use insertInto method

          – Nonontb
          Jan 15 at 20:42




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54023847%2fspark-compression-when-writing-to-external-hive-table%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          ts Property 'filter' does not exist on type '{}'

          mat-slide-toggle shouldn't change it's state when I click cancel in confirmation window