Split string by byte length in Python












1














I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.



I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.



 text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]









share|improve this question


















  • 1




    This question/answer has a similar solution to yours, but it may have some other features you would find helpful
    – G. Anderson
    Nov 19 '18 at 17:41


















1














I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.



I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.



 text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]









share|improve this question


















  • 1




    This question/answer has a similar solution to yours, but it may have some other features you would find helpful
    – G. Anderson
    Nov 19 '18 at 17:41
















1












1








1







I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.



I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.



 text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]









share|improve this question













I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.



I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.



 text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]






python oracle






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 '18 at 17:37









rtaft

573414




573414








  • 1




    This question/answer has a similar solution to yours, but it may have some other features you would find helpful
    – G. Anderson
    Nov 19 '18 at 17:41
















  • 1




    This question/answer has a similar solution to yours, but it may have some other features you would find helpful
    – G. Anderson
    Nov 19 '18 at 17:41










1




1




This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41






This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41














2 Answers
2






active

oldest

votes


















1














Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.



The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py



Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.






share|improve this answer





















  • He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
    – rtaft
    Nov 28 '18 at 15:59



















0














I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.



    encoded_text = data[key].encode('utf-8')
index = 1
while encoded_text:
length = min(4000, len(encoded_text))
if len(encoded_text) > 4000:
while (encoded_text[length] & 0xc0) == 0x80:
length -= 1
data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
encoded_text = encoded_text[length:]
index += 1
del data[key]


I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379954%2fsplit-string-by-byte-length-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.



    The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py



    Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.






    share|improve this answer





















    • He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
      – rtaft
      Nov 28 '18 at 15:59
















    1














    Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.



    The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py



    Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.






    share|improve this answer





















    • He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
      – rtaft
      Nov 28 '18 at 15:59














    1












    1








    1






    Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.



    The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py



    Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.






    share|improve this answer












    Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.



    The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py



    Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 20 '18 at 5:46









    Christopher Jones

    1,8731615




    1,8731615












    • He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
      – rtaft
      Nov 28 '18 at 15:59


















    • He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
      – rtaft
      Nov 28 '18 at 15:59
















    He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
    – rtaft
    Nov 28 '18 at 15:59




    He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
    – rtaft
    Nov 28 '18 at 15:59













    0














    I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.



        encoded_text = data[key].encode('utf-8')
    index = 1
    while encoded_text:
    length = min(4000, len(encoded_text))
    if len(encoded_text) > 4000:
    while (encoded_text[length] & 0xc0) == 0x80:
    length -= 1
    data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
    encoded_text = encoded_text[length:]
    index += 1
    del data[key]


    I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.






    share|improve this answer


























      0














      I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.



          encoded_text = data[key].encode('utf-8')
      index = 1
      while encoded_text:
      length = min(4000, len(encoded_text))
      if len(encoded_text) > 4000:
      while (encoded_text[length] & 0xc0) == 0x80:
      length -= 1
      data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
      encoded_text = encoded_text[length:]
      index += 1
      del data[key]


      I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.






      share|improve this answer
























        0












        0








        0






        I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.



            encoded_text = data[key].encode('utf-8')
        index = 1
        while encoded_text:
        length = min(4000, len(encoded_text))
        if len(encoded_text) > 4000:
        while (encoded_text[length] & 0xc0) == 0x80:
        length -= 1
        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
        encoded_text = encoded_text[length:]
        index += 1
        del data[key]


        I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.






        share|improve this answer












        I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.



            encoded_text = data[key].encode('utf-8')
        index = 1
        while encoded_text:
        length = min(4000, len(encoded_text))
        if len(encoded_text) > 4000:
        while (encoded_text[length] & 0xc0) == 0x80:
        length -= 1
        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
        encoded_text = encoded_text[length:]
        index += 1
        del data[key]


        I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 28 '18 at 16:05









        rtaft

        573414




        573414






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379954%2fsplit-string-by-byte-length-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            Npm cannot find a required file even through it is in the searched directory