how to filtered a bunch of non-labeled article data using my weak-model by its similarity of word in python?












0















I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)



Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.



I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.



So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.



so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?



thank you for your answer, i hope i have the solution :)










share|improve this question




















  • 3





    It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

    – DMarczak
    Nov 20 '18 at 3:47
















0















I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)



Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.



I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.



So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.



so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?



thank you for your answer, i hope i have the solution :)










share|improve this question




















  • 3





    It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

    – DMarczak
    Nov 20 '18 at 3:47














0












0








0








I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)



Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.



I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.



So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.



so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?



thank you for your answer, i hope i have the solution :)










share|improve this question
















I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)



Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.



I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.



So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.



so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?



thank you for your answer, i hope i have the solution :)







python filter scikit-learn deep-learning nlp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 8:56









Roberto

50212




50212










asked Nov 20 '18 at 3:41









thomi dhiathomi dhia

191




191








  • 3





    It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

    – DMarczak
    Nov 20 '18 at 3:47














  • 3





    It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

    – DMarczak
    Nov 20 '18 at 3:47








3




3





It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47





It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47












1 Answer
1






active

oldest

votes


















0














Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.



For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html



you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"



EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores



def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result

# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()


model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])

for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha





share|improve this answer


























  • in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

    – thomi dhia
    Nov 20 '18 at 23:26













  • @thomidhia I edited my answer to reply to your comment.

    – Roberto
    Nov 21 '18 at 6:25











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385890%2fhow-to-filtered-a-bunch-of-non-labeled-article-data-using-my-weak-model-by-its-s%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.



For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html



you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"



EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores



def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result

# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()


model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])

for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha





share|improve this answer


























  • in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

    – thomi dhia
    Nov 20 '18 at 23:26













  • @thomidhia I edited my answer to reply to your comment.

    – Roberto
    Nov 21 '18 at 6:25
















0














Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.



For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html



you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"



EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores



def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result

# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()


model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])

for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha





share|improve this answer


























  • in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

    – thomi dhia
    Nov 20 '18 at 23:26













  • @thomidhia I edited my answer to reply to your comment.

    – Roberto
    Nov 21 '18 at 6:25














0












0








0







Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.



For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html



you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"



EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores



def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result

# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()


model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])

for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha





share|improve this answer















Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.



For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html



you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"



EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores



def labelize_tweets_ug(tweets,label):
result =
prefix = label
for i, t in zip(tweets.index, tweets):
result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))
return result

# all_x is a list of tweets
all_x_w2v = labelize_tweets_ug(all_x, 'all')
cores = multiprocessing.cpu_count()


model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2,
workers=cores, alpha=0.065, min_alpha=0.065)
model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])

for epoch in range(30):
model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)
model_ug_cbow.alpha -= 0.002
model_ug_cbow.min_alpha = model_ug_cbow.alpha






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 21 '18 at 6:24

























answered Nov 20 '18 at 7:20









RobertoRoberto

50212




50212













  • in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

    – thomi dhia
    Nov 20 '18 at 23:26













  • @thomidhia I edited my answer to reply to your comment.

    – Roberto
    Nov 21 '18 at 6:25



















  • in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

    – thomi dhia
    Nov 20 '18 at 23:26













  • @thomidhia I edited my answer to reply to your comment.

    – Roberto
    Nov 21 '18 at 6:25

















in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26







in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26















@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25





@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385890%2fhow-to-filtered-a-bunch-of-non-labeled-article-data-using-my-weak-model-by-its-s%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

Npm cannot find a required file even through it is in the searched directory