Python: avoiding nested for-loop NLP edition; any lib-support?












0















I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:



QUERY          COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5


with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.



Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:



I sort the tokens in the query alphabetically and generate a map "first_token" like:



"h"         "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"


and respectively "second_token" and so forth...



"p"         "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"


and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).



Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.



Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?










share|improve this question

























  • Try trie structures =)

    – alvas
    Jan 2 at 2:40











  • @alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

    – Phil
    Jan 2 at 7:02











  • Maybe pypi.org/project/datrie ?

    – alvas
    Jan 2 at 9:25











  • @alvas ah yes, thanks :)

    – Phil
    Jan 2 at 11:26
















0















I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:



QUERY          COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5


with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.



Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:



I sort the tokens in the query alphabetically and generate a map "first_token" like:



"h"         "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"


and respectively "second_token" and so forth...



"p"         "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"


and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).



Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.



Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?










share|improve this question

























  • Try trie structures =)

    – alvas
    Jan 2 at 2:40











  • @alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

    – Phil
    Jan 2 at 7:02











  • Maybe pypi.org/project/datrie ?

    – alvas
    Jan 2 at 9:25











  • @alvas ah yes, thanks :)

    – Phil
    Jan 2 at 11:26














0












0








0








I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:



QUERY          COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5


with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.



Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:



I sort the tokens in the query alphabetically and generate a map "first_token" like:



"h"         "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"


and respectively "second_token" and so forth...



"p"         "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"


and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).



Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.



Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?










share|improve this question
















I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:



QUERY          COUNT
"harry potter" 100
"iron man" 93
"harry pott" 32
"harr pott" 5


with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.



Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:



I sort the tokens in the query alphabetically and generate a map "first_token" like:



"h"         "harry potter"
"ha" "harry potter"
"har" "harry potter"
"harr" "harry potter"
"harry" "harry potter"


and respectively "second_token" and so forth...



"p"         "harry potter"
"po" "harry potter"
"pot" "harry potter"
"pott" "harry potter"
"potte" "harry potter"
"potter" "harry potter"


and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).



Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.



Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?







python python-3.x nlp nltk






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 10:51







Phil

















asked Jan 1 at 9:54









PhilPhil

2,33521735




2,33521735













  • Try trie structures =)

    – alvas
    Jan 2 at 2:40











  • @alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

    – Phil
    Jan 2 at 7:02











  • Maybe pypi.org/project/datrie ?

    – alvas
    Jan 2 at 9:25











  • @alvas ah yes, thanks :)

    – Phil
    Jan 2 at 11:26



















  • Try trie structures =)

    – alvas
    Jan 2 at 2:40











  • @alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

    – Phil
    Jan 2 at 7:02











  • Maybe pypi.org/project/datrie ?

    – alvas
    Jan 2 at 9:25











  • @alvas ah yes, thanks :)

    – Phil
    Jan 2 at 11:26

















Try trie structures =)

– alvas
Jan 2 at 2:40





Try trie structures =)

– alvas
Jan 2 at 2:40













@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02





@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02













Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25





Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25













@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26





@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26












1 Answer
1






active

oldest

votes


















0














Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene



Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here






share|improve this answer
























  • Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

    – Phil
    Jan 1 at 14:20













  • If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

    – rishi
    Jan 1 at 16:36











  • No, I can work with ElasticSearch just fine, my issue is data-cleaning

    – Phil
    Jan 1 at 21:28











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994509%2fpython-avoiding-nested-for-loop-nlp-edition-any-lib-support%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene



Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here






share|improve this answer
























  • Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

    – Phil
    Jan 1 at 14:20













  • If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

    – rishi
    Jan 1 at 16:36











  • No, I can work with ElasticSearch just fine, my issue is data-cleaning

    – Phil
    Jan 1 at 21:28
















0














Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene



Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here






share|improve this answer
























  • Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

    – Phil
    Jan 1 at 14:20













  • If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

    – rishi
    Jan 1 at 16:36











  • No, I can work with ElasticSearch just fine, my issue is data-cleaning

    – Phil
    Jan 1 at 21:28














0












0








0







Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene



Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here






share|improve this answer













Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene



Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 1 at 11:01









rishirishi

1,15331841




1,15331841













  • Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

    – Phil
    Jan 1 at 14:20













  • If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

    – rishi
    Jan 1 at 16:36











  • No, I can work with ElasticSearch just fine, my issue is data-cleaning

    – Phil
    Jan 1 at 21:28



















  • Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

    – Phil
    Jan 1 at 14:20













  • If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

    – rishi
    Jan 1 at 16:36











  • No, I can work with ElasticSearch just fine, my issue is data-cleaning

    – Phil
    Jan 1 at 21:28

















Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20







Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20















If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36





If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36













No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28





No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994509%2fpython-avoiding-nested-for-loop-nlp-edition-any-lib-support%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

Npm cannot find a required file even through it is in the searched directory