Python: avoiding nested for-loop NLP edition; any lib-support?

I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:

QUERY          COUNT

"harry potter" 100

"iron man"     93

"harry pott"   32

"harr pott"    5

with around 200.000 rows. As you can see some users are extensively using the prefixed search typing in only the first letters of a word. Those queries in the example should be aggregated with a full "harry potter" row.

Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:

I sort the tokens in the query alphabetically and generate a map "first_token" like:

"h"         "harry potter"

"ha"        "harry potter"

"har"       "harry potter"

"harr"      "harry potter"

"harry"     "harry potter"

and respectively "second_token" and so forth...

"p"         "harry potter"

"po"        "harry potter"

"pot"       "harry potter"

"pott"      "harry potter"

"potte"     "harry potter"

"potter"    "harry potter"

and then I iterate from top to bottom and for each element like "harr pott" I check if there is an element in both "first_token" and "second_token" whose value is the same document, eg "harry potter" and that document is not identical to the original ("harr pott") and has a higher score, in which case I aggregate it. The runtime of this should be O(index_size * max_number_of_tokens).

Now I was wondering if there is any lib for Python that can make it easier for me implementing all of this. Coming from Java/JS I am not so familiar with Python yet, I just know it has lots of tools for NLP.

Can anything in NLTK or so help me? I think there should be at least a tool for vectorizing strings. Perhaps using that you can do the "starts-with" operation as a simple lookup without generating tries-maps manually?

edited Jan 1 at 10:51

asked Jan 1 at 9:54

Phil

2,33521735

Try trie structures =)

– alvas
Jan 2 at 2:40

@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02

Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25

@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26

add a comment |

I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:

QUERY          COUNT

"harry potter" 100

"iron man"     93

"harry pott"   32

"harr pott"    5

Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:

I sort the tokens in the query alphabetically and generate a map "first_token" like:

"h"         "harry potter"

"ha"        "harry potter"

"har"       "harry potter"

"harr"      "harry potter"

"harry"     "harry potter"

and respectively "second_token" and so forth...

"p"         "harry potter"

"po"        "harry potter"

"pot"       "harry potter"

"pott"      "harry potter"

"potte"     "harry potter"

"potter"    "harry potter"

edited Jan 1 at 10:51

asked Jan 1 at 9:54

Phil

2,33521735

Try trie structures =)

– alvas
Jan 2 at 2:40

@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02

Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25

@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26

add a comment |

I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:

QUERY          COUNT

"harry potter" 100

"iron man"     93

"harry pott"   32

"harr pott"    5

Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:

I sort the tokens in the query alphabetically and generate a map "first_token" like:

"h"         "harry potter"

"ha"        "harry potter"

"har"       "harry potter"

"harr"      "harry potter"

"harry"     "harry potter"

and respectively "second_token" and so forth...

"p"         "harry potter"

"po"        "harry potter"

"pot"       "harry potter"

"pott"      "harry potter"

"potte"     "harry potter"

"potter"    "harry potter"

edited Jan 1 at 10:51

asked Jan 1 at 9:54

Phil

2,33521735

I am trying to make a user-query based autosuggest. I have a bunch of aggregated queries like:

QUERY          COUNT

"harry potter" 100

"iron man"     93

"harry pott"   32

"harr pott"    5

Now assuming that the majority of users searches with full words, I think I can do that aggregation effectively (avoiding a nested for-loop over the whole index) in the following way:

I sort the tokens in the query alphabetically and generate a map "first_token" like:

"h"         "harry potter"

"ha"        "harry potter"

"har"       "harry potter"

"harr"      "harry potter"

"harry"     "harry potter"

and respectively "second_token" and so forth...

"p"         "harry potter"

"po"        "harry potter"

"pot"       "harry potter"

"pott"      "harry potter"

"potte"     "harry potter"

"potter"    "harry potter"

python python-3.x nlp nltk

edited Jan 1 at 10:51

asked Jan 1 at 9:54

Phil

2,33521735

edited Jan 1 at 10:51

asked Jan 1 at 9:54

Phil

2,33521735

edited Jan 1 at 10:51

asked Jan 1 at 9:54

Phil

2,33521735

asked Jan 1 at 9:54

Phil

2,33521735

asked Jan 1 at 9:54

Phil

2,33521735

Try trie structures =)

– alvas
Jan 2 at 2:40

@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02

Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25

@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26

add a comment |

Try trie structures =)

– alvas
Jan 2 at 2:40

@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02

Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25

@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26

Try trie structures =)

– alvas
Jan 2 at 2:40

@alvas yup that's what I was describing. In fact I don't even need full tries, they can start with the smallest tokens I have (around 3 characters) instead of single letters. My question is if there is a module for doing just that

– Phil
Jan 2 at 7:02

Maybe pypi.org/project/datrie ?

– alvas
Jan 2 at 9:25

@alvas ah yes, thanks :)

– Phil
Jan 2 at 11:26

add a comment |

1 Answer
1

active

oldest

votes

Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene

Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here

answered Jan 1 at 11:01

rishi

1,15331841

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20

If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36

No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994509%2fpython-avoiding-nested-for-loop-nlp-edition-any-lib-support%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene

Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here

answered Jan 1 at 11:01

rishi

1,15331841

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20

If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36

No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28

add a comment |

Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene

Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here

answered Jan 1 at 11:01

rishi

1,15331841

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20

If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36

No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28

add a comment |

Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene

Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here

answered Jan 1 at 11:01

rishi

1,15331841

Autosuggest and many other features specific to search are handled well in Lucene. You can try Python implementation of it PyLucene

Alternatively, if you want answer limited to the specifics of the question you have asked, try out ngram module of Python. Details here

answered Jan 1 at 11:01

rishi

1,15331841

answered Jan 1 at 11:01

rishi

1,15331841

answered Jan 1 at 11:01

rishi

1,15331841

answered Jan 1 at 11:01

rishi

1,15331841

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20

If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36

No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28

add a comment |

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20

If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36

No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28

Hi and thanks for the reply. I am using ElasticSearch which is built on top of Lucene. Lucene is doing a great job of providing a quick suggest function, which in the newer versions is built with tries, as far as I know. The problem is to create suggest data in the first place; that's what I try to do right now. But I will check out the ngram module; it might indeed be just what I am looking for. Thanks again!

– Phil
Jan 1 at 14:20

If you are already using ElasticSearch and your issue is around creating database for suggestions, check out stackoverflow.com/questions/22882927/…

– rishi
Jan 1 at 16:36

No, I can work with ElasticSearch just fine, my issue is data-cleaning

– Phil
Jan 1 at 21:28

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu