Writing a Java RDD to multiple folders with varying schemas





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.



But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.



In order for this to work the class clsT must be a bean type class. So far so good, this works for me.



Now, my bean class has the following fields with the following example values



String partitionFieldOne = "a";
String partitionFieldTwo = "b";
String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
String data = "field1: value1, field2: value2, field3: value3, ..."


Now, the issue I am having is as follows.
When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).



This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.



The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.



What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?



I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.



Please let me know if I have explained this badly










share|improve this question





























    0















    if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.



    But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.



    In order for this to work the class clsT must be a bean type class. So far so good, this works for me.



    Now, my bean class has the following fields with the following example values



    String partitionFieldOne = "a";
    String partitionFieldTwo = "b";
    String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
    String data = "field1: value1, field2: value2, field3: value3, ..."


    Now, the issue I am having is as follows.
    When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).



    This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.



    The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.



    What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?



    I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.



    Please let me know if I have explained this badly










    share|improve this question

























      0












      0








      0








      if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.



      But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.



      In order for this to work the class clsT must be a bean type class. So far so good, this works for me.



      Now, my bean class has the following fields with the following example values



      String partitionFieldOne = "a";
      String partitionFieldTwo = "b";
      String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
      String data = "field1: value1, field2: value2, field3: value3, ..."


      Now, the issue I am having is as follows.
      When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).



      This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.



      The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.



      What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?



      I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.



      Please let me know if I have explained this badly










      share|improve this question














      if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.



      But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.



      In order for this to work the class clsT must be a bean type class. So far so good, this works for me.



      Now, my bean class has the following fields with the following example values



      String partitionFieldOne = "a";
      String partitionFieldTwo = "b";
      String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
      String data = "field1: value1, field2: value2, field3: value3, ..."


      Now, the issue I am having is as follows.
      When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).



      This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.



      The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.



      What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?



      I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.



      Please let me know if I have explained this badly







      apache-spark rdd






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 3 at 16:23









      SiLafSiLaf

      16812




      16812
























          0






          active

          oldest

          votes












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026127%2fwriting-a-java-rdd-to-multiple-folders-with-varying-schemas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026127%2fwriting-a-java-rdd-to-multiple-folders-with-varying-schemas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          'app-layout' is not a known element: how to share Component with different Modules

          android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

          WPF add header to Image with URL pettitions [duplicate]