Python: avoiding nested for-loop NLP edition; any lib-support?
I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:
QUERY COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5
with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.
Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:
I sort the tokens in the query alphabetically and generate a map "first_token" like:
"h" "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"
and respectively "second_token" and so forth...
"p" "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"
and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).
Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.
Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?
python python-3.x nlp nltk
add a comment |
I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:
QUERY COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5
with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.
Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:
I sort the tokens in the query alphabetically and generate a map "first_token" like:
"h" "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"
and respectively "second_token" and so forth...
"p" "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"
and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).
Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.
Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?
python python-3.x nlp nltk
Trytrie
structures =)
– alvas
Jan 2 at 2:40
@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that
– Phil
Jan 2 at 7:02
Maybe pypi.org/project/datrie ?
– alvas
Jan 2 at 9:25
@alvas ah yes, thanks :)
– Phil
Jan 2 at 11:26
add a comment |
I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:
QUERY COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5
with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.
Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:
I sort the tokens in the query alphabetically and generate a map "first_token" like:
"h" "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"
and respectively "second_token" and so forth...
"p" "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"
and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).
Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.
Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?
python python-3.x nlp nltk
I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:
QUERY COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5
with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.
Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:
I sort the tokens in the query alphabetically and generate a map "first_token" like:
"h" "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"
and respectively "second_token" and so forth...
"p" "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"
and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).
Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.
Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?
python python-3.x nlp nltk
python python-3.x nlp nltk
edited Jan 1 at 10:51
Phil
asked Jan 1 at 9:54


PhilPhil
2,33521735
2,33521735
Trytrie
structures =)
– alvas
Jan 2 at 2:40
@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that
– Phil
Jan 2 at 7:02
Maybe pypi.org/project/datrie ?
– alvas
Jan 2 at 9:25
@alvas ah yes, thanks :)
– Phil
Jan 2 at 11:26
add a comment |
Trytrie
structures =)
– alvas
Jan 2 at 2:40
@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that
– Phil
Jan 2 at 7:02
Maybe pypi.org/project/datrie ?
– alvas
Jan 2 at 9:25
@alvas ah yes, thanks :)
– Phil
Jan 2 at 11:26
Try
trie
structures =)– alvas
Jan 2 at 2:40
Try
trie
structures =)– alvas
Jan 2 at 2:40
@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that
– Phil
Jan 2 at 7:02
@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that
– Phil
Jan 2 at 7:02
Maybe pypi.org/project/datrie ?
– alvas
Jan 2 at 9:25
Maybe pypi.org/project/datrie ?
– alvas
Jan 2 at 9:25
@alvas ah yes, thanks :)
– Phil
Jan 2 at 11:26
@alvas ah yes, thanks :)
– Phil
Jan 2 at 11:26
add a comment |
1 Answer
1
active
oldest
votes
Autosuggest and many other features specific to search are handled well in Lucene
. You can try Python implementation of it PyLucene
Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram
module of Python. Details here
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994509%2fpython-avoiding-nested-for-loop-nlp-edition-any-lib-support%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Autosuggest and many other features specific to search are handled well in Lucene
. You can try Python implementation of it PyLucene
Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram
module of Python. Details here
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
add a comment |
Autosuggest and many other features specific to search are handled well in Lucene
. You can try Python implementation of it PyLucene
Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram
module of Python. Details here
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
add a comment |
Autosuggest and many other features specific to search are handled well in Lucene
. You can try Python implementation of it PyLucene
Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram
module of Python. Details here
Autosuggest and many other features specific to search are handled well in Lucene
. You can try Python implementation of it PyLucene
Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram
module of Python. Details here
answered Jan 1 at 11:01
rishirishi
1,15331841
1,15331841
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
add a comment |
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!
– Phil
Jan 1 at 14:20
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…
– rishi
Jan 1 at 16:36
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
No, I can work with ElasticSearch just fine, my issue is data-cleaning
– Phil
Jan 1 at 21:28
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994509%2fpython-avoiding-nested-for-loop-nlp-edition-any-lib-support%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Try
trie
structures =)– alvas
Jan 2 at 2:40
@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that
– Phil
Jan 2 at 7:02
Maybe pypi.org/project/datrie ?
– alvas
Jan 2 at 9:25
@alvas ah yes, thanks :)
– Phil
Jan 2 at 11:26