Partitioning data for a timestamp query












0














I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet



I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.



How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.



I would be happy to provide more information if needed.



Thank you!



Example query would be something like:



SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'









share|improve this question
























  • please share your example query?
    – Jon Scott
    Nov 20 '18 at 18:23










  • Updated the original post, let me know if anything is unclear.
    – Eumcoz
    Nov 20 '18 at 19:11
















0














I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet



I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.



How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.



I would be happy to provide more information if needed.



Thank you!



Example query would be something like:



SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'









share|improve this question
























  • please share your example query?
    – Jon Scott
    Nov 20 '18 at 18:23










  • Updated the original post, let me know if anything is unclear.
    – Eumcoz
    Nov 20 '18 at 19:11














0












0








0







I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet



I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.



How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.



I would be happy to provide more information if needed.



Thank you!



Example query would be something like:



SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'









share|improve this question















I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet



I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.



How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.



I would be happy to provide more information if needed.



Thank you!



Example query would be something like:



SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'






amazon-redshift amazon-redshift-spectrum






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 19:11







Eumcoz

















asked Nov 19 '18 at 17:44









EumcozEumcoz

1,5351229




1,5351229












  • please share your example query?
    – Jon Scott
    Nov 20 '18 at 18:23










  • Updated the original post, let me know if anything is unclear.
    – Eumcoz
    Nov 20 '18 at 19:11


















  • please share your example query?
    – Jon Scott
    Nov 20 '18 at 18:23










  • Updated the original post, let me know if anything is unclear.
    – Eumcoz
    Nov 20 '18 at 19:11
















please share your example query?
– Jon Scott
Nov 20 '18 at 18:23




please share your example query?
– Jon Scott
Nov 20 '18 at 18:23












Updated the original post, let me know if anything is unclear.
– Eumcoz
Nov 20 '18 at 19:11




Updated the original post, let me know if anything is unclear.
– Eumcoz
Nov 20 '18 at 19:11












1 Answer
1






active

oldest

votes


















0














Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...



And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...



Looks ugly.



You can consider doing something else :




  1. s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet


In that case your query will be more simple



WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')



OR using BETWEEN
https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53380050%2fpartitioning-data-for-a-timestamp-query%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...



    And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...



    Looks ugly.



    You can consider doing something else :




    1. s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet


    In that case your query will be more simple



    WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')



    OR using BETWEEN
    https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html






    share|improve this answer


























      0














      Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...



      And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...



      Looks ugly.



      You can consider doing something else :




      1. s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet


      In that case your query will be more simple



      WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')



      OR using BETWEEN
      https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html






      share|improve this answer
























        0












        0








        0






        Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...



        And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...



        Looks ugly.



        You can consider doing something else :




        1. s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet


        In that case your query will be more simple



        WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')



        OR using BETWEEN
        https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html






        share|improve this answer












        Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...



        And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...



        Looks ugly.



        You can consider doing something else :




        1. s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet


        In that case your query will be more simple



        WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')



        OR using BETWEEN
        https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 11 '18 at 8:44









        yuriy polonskiyyuriy polonskiy

        112




        112






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53380050%2fpartitioning-data-for-a-timestamp-query%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            Npm cannot find a required file even through it is in the searched directory