Split string by byte length in Python

I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.

I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.

 text = data[key]

 index = 1

 while text:

     length = 4000

     while len(text[0:length].encode('utf-8')) > 4000:

          length -= 1

     data['{}{}'.format(key, index)] = text[0:length]

     text = text[length:]

     index += 1

 del data[key]

asked Nov 19 '18 at 17:37

rtaft

573414

1

This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41

add a comment |

 text = data[key]

 index = 1

 while text:

     length = 4000

     while len(text[0:length].encode('utf-8')) > 4000:

          length -= 1

     data['{}{}'.format(key, index)] = text[0:length]

     text = text[length:]

     index += 1

 del data[key]

asked Nov 19 '18 at 17:37

rtaft

573414

1

This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41

add a comment |

 text = data[key]

 index = 1

 while text:

     length = 4000

     while len(text[0:length].encode('utf-8')) > 4000:

          length -= 1

     data['{}{}'.format(key, index)] = text[0:length]

     text = text[length:]

     index += 1

 del data[key]

asked Nov 19 '18 at 17:37

rtaft

573414

 text = data[key]

 index = 1

 while text:

     length = 4000

     while len(text[0:length].encode('utf-8')) > 4000:

          length -= 1

     data['{}{}'.format(key, index)] = text[0:length]

     text = text[length:]

     index += 1

 del data[key]

python oracle

asked Nov 19 '18 at 17:37

rtaft

573414

asked Nov 19 '18 at 17:37

rtaft

573414

asked Nov 19 '18 at 17:37

rtaft

573414

asked Nov 19 '18 at 17:37

rtaft

573414

asked Nov 19 '18 at 17:37

rtaft

573414

1

This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41

add a comment |

1

This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41

This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41

add a comment |

2 Answers
2

active

oldest

votes

Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.

The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py

Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59

add a comment |

I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.

    encoded_text = data[key].encode('utf-8')

    index = 1

    while encoded_text:

        length = min(4000, len(encoded_text))

        if len(encoded_text) > 4000:

            while (encoded_text[length] & 0xc0) == 0x80:

                length -= 1

        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')

        encoded_text = encoded_text[length:]

        index += 1

    del data[key]

I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.

answered Nov 28 '18 at 16:05

rtaft

573414

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379954%2fsplit-string-by-byte-length-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.

Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59

add a comment |

Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.

Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59

add a comment |

Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.

Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.

Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

answered Nov 20 '18 at 5:46

Christopher Jones

1,8731615

He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59

add a comment |

He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59

He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59

add a comment |

I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.

    encoded_text = data[key].encode('utf-8')

    index = 1

    while encoded_text:

        length = min(4000, len(encoded_text))

        if len(encoded_text) > 4000:

            while (encoded_text[length] & 0xc0) == 0x80:

                length -= 1

        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')

        encoded_text = encoded_text[length:]

        index += 1

    del data[key]

I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.

answered Nov 28 '18 at 16:05

rtaft

573414

add a comment |

I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.

    encoded_text = data[key].encode('utf-8')

    index = 1

    while encoded_text:

        length = min(4000, len(encoded_text))

        if len(encoded_text) > 4000:

            while (encoded_text[length] & 0xc0) == 0x80:

                length -= 1

        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')

        encoded_text = encoded_text[length:]

        index += 1

    del data[key]

I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.

answered Nov 28 '18 at 16:05

rtaft

573414

add a comment |

I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.

    encoded_text = data[key].encode('utf-8')

    index = 1

    while encoded_text:

        length = min(4000, len(encoded_text))

        if len(encoded_text) > 4000:

            while (encoded_text[length] & 0xc0) == 0x80:

                length -= 1

        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')

        encoded_text = encoded_text[length:]

        index += 1

    del data[key]

I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.

answered Nov 28 '18 at 16:05

rtaft

573414

I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.

    encoded_text = data[key].encode('utf-8')

    index = 1

    while encoded_text:

        length = min(4000, len(encoded_text))

        if len(encoded_text) > 4000:

            while (encoded_text[length] & 0xc0) == 0x80:

                length -= 1

        data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')

        encoded_text = encoded_text[length:]

        index += 1

    del data[key]

I also toyed with the idea of using encode('unicode-escape') to get around the unicode issue, but that could potentially more than double my string length.

answered Nov 28 '18 at 16:05

rtaft

573414

answered Nov 28 '18 at 16:05

rtaft

573414

answered Nov 28 '18 at 16:05

rtaft

573414

answered Nov 28 '18 at 16:05

rtaft

573414

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu