USQL Files vs. Managed Tables - how data is stored physically?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.



Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:




  1. I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.

  2. I will use managed tables and set up table partitioning


I am now looking into pros and cons of both scenarios. Some things that come to my mind are:




  1. The ability to compress data in scenario #1 (at the cost of performance of course)

  2. The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)

  3. On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest

  4. I would expect that the tables will offer better performance


One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):



First we can read that:



"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?



Later we can read that:



"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."



Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.



Therefore I have several questions:




  1. Is there some documentation explaining in more detail how TABLE REBUILD works?

  2. What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?

  3. How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)

  4. Is there any documentation explaining in more details how .ss files are created?

  5. Which of the scenarios would you take and why?


Many thanks for your help,



Jakub



EDIT: I did some more tests and it only made it more intriguing.




  1. I took a sample of 7 days of data

  2. I‌ created a table partitioned by date

  3. I created 8 partitions - one for each day + one default

  4. I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions

  5. I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)

  6. To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)

  7. I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense

  8. I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)

  9. I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny

  10. I then imported another file containing only 2 days of data
    I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files

  11. Being even more intrigued I ran the TABLE REBUILD

  12. I endeed up with 8 files (for each partitions) but... they were all just recently modified


Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?










share|improve this question































    3















    I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.



    Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:




    1. I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.

    2. I will use managed tables and set up table partitioning


    I am now looking into pros and cons of both scenarios. Some things that come to my mind are:




    1. The ability to compress data in scenario #1 (at the cost of performance of course)

    2. The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)

    3. On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest

    4. I would expect that the tables will offer better performance


    One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):



    First we can read that:



    "U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?



    Later we can read that:



    "..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."



    Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.



    Therefore I have several questions:




    1. Is there some documentation explaining in more detail how TABLE REBUILD works?

    2. What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?

    3. How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)

    4. Is there any documentation explaining in more details how .ss files are created?

    5. Which of the scenarios would you take and why?


    Many thanks for your help,



    Jakub



    EDIT: I did some more tests and it only made it more intriguing.




    1. I took a sample of 7 days of data

    2. I‌ created a table partitioned by date

    3. I created 8 partitions - one for each day + one default

    4. I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions

    5. I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)

    6. To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)

    7. I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense

    8. I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)

    9. I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny

    10. I then imported another file containing only 2 days of data
      I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files

    11. Being even more intrigued I ran the TABLE REBUILD

    12. I endeed up with 8 files (for each partitions) but... they were all just recently modified


    Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?










    share|improve this question



























      3












      3








      3


      1






      I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.



      Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:




      1. I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.

      2. I will use managed tables and set up table partitioning


      I am now looking into pros and cons of both scenarios. Some things that come to my mind are:




      1. The ability to compress data in scenario #1 (at the cost of performance of course)

      2. The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)

      3. On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest

      4. I would expect that the tables will offer better performance


      One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):



      First we can read that:



      "U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?



      Later we can read that:



      "..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."



      Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.



      Therefore I have several questions:




      1. Is there some documentation explaining in more detail how TABLE REBUILD works?

      2. What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?

      3. How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)

      4. Is there any documentation explaining in more details how .ss files are created?

      5. Which of the scenarios would you take and why?


      Many thanks for your help,



      Jakub



      EDIT: I did some more tests and it only made it more intriguing.




      1. I took a sample of 7 days of data

      2. I‌ created a table partitioned by date

      3. I created 8 partitions - one for each day + one default

      4. I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions

      5. I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)

      6. To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)

      7. I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense

      8. I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)

      9. I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny

      10. I then imported another file containing only 2 days of data
        I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files

      11. Being even more intrigued I ran the TABLE REBUILD

      12. I endeed up with 8 files (for each partitions) but... they were all just recently modified


      Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?










      share|improve this question
















      I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.



      Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:




      1. I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.

      2. I will use managed tables and set up table partitioning


      I am now looking into pros and cons of both scenarios. Some things that come to my mind are:




      1. The ability to compress data in scenario #1 (at the cost of performance of course)

      2. The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)

      3. On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest

      4. I would expect that the tables will offer better performance


      One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):



      First we can read that:



      "U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?



      Later we can read that:



      "..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."



      Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.



      Therefore I have several questions:




      1. Is there some documentation explaining in more detail how TABLE REBUILD works?

      2. What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?

      3. How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)

      4. Is there any documentation explaining in more details how .ss files are created?

      5. Which of the scenarios would you take and why?


      Many thanks for your help,



      Jakub



      EDIT: I did some more tests and it only made it more intriguing.




      1. I took a sample of 7 days of data

      2. I‌ created a table partitioned by date

      3. I created 8 partitions - one for each day + one default

      4. I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions

      5. I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)

      6. To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)

      7. I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense

      8. I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)

      9. I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny

      10. I then imported another file containing only 2 days of data
        I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files

      11. Being even more intrigued I ran the TABLE REBUILD

      12. I endeed up with 8 files (for each partitions) but... they were all just recently modified


      Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?







      azure-data-lake u-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 3 at 14:40







      Jakub Krupa

















      asked Jan 3 at 13:02









      Jakub KrupaJakub Krupa

      185




      185
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022849%2fusql-files-vs-managed-tables-how-data-is-stored-physically%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.






            share|improve this answer




























              1














              Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.






              share|improve this answer


























                1












                1








                1







                Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.






                share|improve this answer













                Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 25 at 10:33









                wBobwBob

                6,20631122




                6,20631122
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022849%2fusql-files-vs-managed-tables-how-data-is-stored-physically%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    MongoDB - Not Authorized To Execute Command

                    Npm cannot find a required file even through it is in the searched directory

                    in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith