how to filtered a bunch of non-labeled article data using my weak-model by its similarity of word in python?
I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)
Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.
I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.
So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.
so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?
thank you for your answer, i hope i have the solution :)
python filter scikit-learn deep-learning nlp
add a comment |
I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)
Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.
I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.
So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.
so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?
thank you for your answer, i hope i have the solution :)
python filter scikit-learn deep-learning nlp
3
It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.
– DMarczak
Nov 20 '18 at 3:47
add a comment |
I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)
Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.
I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.
So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.
so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?
thank you for your answer, i hope i have the solution :)
python filter scikit-learn deep-learning nlp
I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)
Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.
I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.
So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.
so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?
thank you for your answer, i hope i have the solution :)
python filter scikit-learn deep-learning nlp
python filter scikit-learn deep-learning nlp
edited Nov 20 '18 at 8:56


Roberto
50212
50212
asked Nov 20 '18 at 3:41


thomi dhiathomi dhia
191
191
3
It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.
– DMarczak
Nov 20 '18 at 3:47
add a comment |
3
It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.
– DMarczak
Nov 20 '18 at 3:47
3
3
It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.
– DMarczak
Nov 20 '18 at 3:47
It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.
– DMarczak
Nov 20 '18 at 3:47
add a comment |
1 Answer
1
active
oldest
votes
Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.
For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html
you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"
EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores
def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result
# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])
for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385890%2fhow-to-filtered-a-bunch-of-non-labeled-article-data-using-my-weak-model-by-its-s%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.
For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html
you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"
EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores
def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result
# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])
for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
add a comment |
Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.
For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html
you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"
EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores
def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result
# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])
for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
add a comment |
Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.
For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html
you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"
EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores
def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result
# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])
for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha
Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.
For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html
you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"
EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores
def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result
# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()
model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])
for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha
edited Nov 21 '18 at 6:24
answered Nov 20 '18 at 7:20


RobertoRoberto
50212
50212
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
add a comment |
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?
– thomi dhia
Nov 20 '18 at 23:26
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
@thomidhia I edited my answer to reply to your comment.
– Roberto
Nov 21 '18 at 6:25
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385890%2fhow-to-filtered-a-bunch-of-non-labeled-article-data-using-my-weak-model-by-its-s%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.
– DMarczak
Nov 20 '18 at 3:47