Split string by byte length in Python
I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.
I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.
text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]
python oracle
add a comment |
I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.
I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.
text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]
python oracle
1
This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41
add a comment |
I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.
I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.
text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]
python oracle
I have a situation where I need to insert more than 4000 characters into an Oracle VARCHAR and was advised against using CLOB. The proposed solution was to split it into 2 columns, 4000 each, and 8000 should be enough. I made the code dynamic so could handle X number of columns for reuse. It worked great, passed testing, etc, until it was deployed and someone copied and pasted from a Microsoft product and it broke because there was more than 4000 bytes generated in the function. I hadn't considered unicode.
I tried several ideas to solve this before settling on one where I start with 4000 chars, and if the byte length is over 4000, remove a character and check the byte length again. It works, but I wonder if there is a better solution. The function also changes the column names from 'column' to 'column1', 'column2', ...etc.
text = data[key]
index = 1
while text:
length = 4000
while len(text[0:length].encode('utf-8')) > 4000:
length -= 1
data['{}{}'.format(key, index)] = text[0:length]
text = text[length:]
index += 1
del data[key]
python oracle
python oracle
asked Nov 19 '18 at 17:37
rtaft
573414
573414
1
This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41
add a comment |
1
This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41
1
1
This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41
This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41
add a comment |
2 Answers
2
active
oldest
votes
Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.
The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py
Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
add a comment |
I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.
encoded_text = data[key].encode('utf-8')
index = 1
while encoded_text:
length = min(4000, len(encoded_text))
if len(encoded_text) > 4000:
while (encoded_text[length] & 0xc0) == 0x80:
length -= 1
data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
encoded_text = encoded_text[length:]
index += 1
del data[key]
I also toyed with the idea of using encode('unicode-escape')
to get around the unicode issue, but that could potentially more than double my string length.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379954%2fsplit-string-by-byte-length-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.
The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py
Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
add a comment |
Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.
The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py
Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
add a comment |
Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.
The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py
Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.
Check whether your advice against CLOBs was current or whether it was based on old information about accessing LOBs using locators.
The best practice for "small" CLOBs in cx_Oracle is to represent them as strings: your code will be simple and still efficient. See the example https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py
Another solution is to use a recent version of Oracle DB that supports 32K VARCHAR2.
answered Nov 20 '18 at 5:46


Christopher Jones
1,8731615
1,8731615
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
add a comment |
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
He had many reasons from how it's stored, how it's backed up, to a list of problems he encountered over the years with LOBs. Maybe that is different now, idk, but he isn't budging. I like the 32K idea, but the DB is setup for 4k.
– rtaft
Nov 28 '18 at 15:59
add a comment |
I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.
encoded_text = data[key].encode('utf-8')
index = 1
while encoded_text:
length = min(4000, len(encoded_text))
if len(encoded_text) > 4000:
while (encoded_text[length] & 0xc0) == 0x80:
length -= 1
data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
encoded_text = encoded_text[length:]
index += 1
del data[key]
I also toyed with the idea of using encode('unicode-escape')
to get around the unicode issue, but that could potentially more than double my string length.
add a comment |
I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.
encoded_text = data[key].encode('utf-8')
index = 1
while encoded_text:
length = min(4000, len(encoded_text))
if len(encoded_text) > 4000:
while (encoded_text[length] & 0xc0) == 0x80:
length -= 1
data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
encoded_text = encoded_text[length:]
index += 1
del data[key]
I also toyed with the idea of using encode('unicode-escape')
to get around the unicode issue, but that could potentially more than double my string length.
add a comment |
I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.
encoded_text = data[key].encode('utf-8')
index = 1
while encoded_text:
length = min(4000, len(encoded_text))
if len(encoded_text) > 4000:
while (encoded_text[length] & 0xc0) == 0x80:
length -= 1
data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
encoded_text = encoded_text[length:]
index += 1
del data[key]
I also toyed with the idea of using encode('unicode-escape')
to get around the unicode issue, but that could potentially more than double my string length.
I ended up combining G. Andersons link with my code. It's more efficient in that it doesn't encode for every length check.
encoded_text = data[key].encode('utf-8')
index = 1
while encoded_text:
length = min(4000, len(encoded_text))
if len(encoded_text) > 4000:
while (encoded_text[length] & 0xc0) == 0x80:
length -= 1
data['{}{}'.format(key, index)] = encoded_text[:length].decode('utf-8')
encoded_text = encoded_text[length:]
index += 1
del data[key]
I also toyed with the idea of using encode('unicode-escape')
to get around the unicode issue, but that could potentially more than double my string length.
answered Nov 28 '18 at 16:05
rtaft
573414
573414
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379954%2fsplit-string-by-byte-length-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
This question/answer has a similar solution to yours, but it may have some other features you would find helpful
– G. Anderson
Nov 19 '18 at 17:41