how to filtered a bunch of non-labeled article data using my weak-model by its similarity of word in python?

I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)

Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.

I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.

So i would like to implement a "similarity" approach to classify the 9000 samples on the base of their "word similarity" with the 500 positive samples. To extract the similar data from the 9000 data, so i can label it with 1. so the rest of data from the 9000 dataset can be labeled as 0 class.

so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?

thank you for your answer, i hope i have the solution :)

edited Nov 20 '18 at 8:56

Roberto

50212

asked Nov 20 '18 at 3:41

thomi dhia

191

3

It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47

add a comment |

I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)

Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.

I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.

so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?

thank you for your answer, i hope i have the solution :)

edited Nov 20 '18 at 8:56

Roberto

50212

asked Nov 20 '18 at 3:41

thomi dhia

191

3

It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47

add a comment |

I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)

Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.

I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.

so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?

thank you for your answer, i hope i have the solution :)

edited Nov 20 '18 at 8:56

Roberto

50212

asked Nov 20 '18 at 3:41

thomi dhia

191

I have 9000 samples of non-labeled articles, i want to label it to be binary class (0 and 1)

Additionally, i have 500 labeled samples belonging to the positive class (label=1) and no samples for the negative class label=0.

I know it's impossible to label 9000 samples with 0 and 1 using a model trained only on the 500 positive samples.

so the question, is it possible to filtered it? if so, how can i filtered it with the similarity of word with python?

thank you for your answer, i hope i have the solution :)

python filter scikit-learn deep-learning nlp

edited Nov 20 '18 at 8:56

Roberto

50212

asked Nov 20 '18 at 3:41

thomi dhia

191

edited Nov 20 '18 at 8:56

Roberto

50212

asked Nov 20 '18 at 3:41

thomi dhia

191

edited Nov 20 '18 at 8:56

Roberto

50212

edited Nov 20 '18 at 8:56

Roberto

50212

edited Nov 20 '18 at 8:56

Roberto

50212

asked Nov 20 '18 at 3:41

thomi dhia

191

asked Nov 20 '18 at 3:41

thomi dhia

191

asked Nov 20 '18 at 3:41

thomi dhia

191

3

It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47

add a comment |

3

It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47

It depends on what you consider to be 'similar' data. You could build a trie of the labeled data, and then compare your unlabled data against it, marking all data that is 'different enough' from the trie data as 0.

– DMarczak
Nov 20 '18 at 3:47

add a comment |

1 Answer
1

active

oldest

votes

Yes It is possible. You can use doc2vec (I suggest gensim library for python) to build a vector space for the words in your 500 positive documents. Using that representation it is possible to query the similarity between a new samples (a sample from you 9000 samples) and your corpora dataset (500 samples). If you consider the similarity "similar enough" you can label it as 1.

For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html

you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"

EDIT: answer to the comment. Yes I used the code some time ago (I don't remember too well if I had errors). The implementation of the code that I used is below. Please consider that I used a machine with 8 cores

def labelize_tweets_ug(tweets,label):

   result = 

   prefix = label

   for i, t in zip(tweets.index, tweets):

       result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))

   return result



# all_x is a list of tweets

all_x_w2v = labelize_tweets_ug(all_x, 'all')

cores = multiprocessing.cpu_count()





model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, 

workers=cores, alpha=0.065, min_alpha=0.065)

model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])



for epoch in range(30):

    model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)

    model_ug_cbow.alpha -= 0.002

    model_ug_cbow.min_alpha = model_ug_cbow.alpha

edited Nov 21 '18 at 6:24

answered Nov 20 '18 at 7:20

Roberto

50212

in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26

@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385890%2fhow-to-filtered-a-bunch-of-non-labeled-article-data-using-my-weak-model-by-its-s%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html

you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"

def labelize_tweets_ug(tweets,label):

   result = 

   prefix = label

   for i, t in zip(tweets.index, tweets):

       result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))

   return result



# all_x is a list of tweets

all_x_w2v = labelize_tweets_ug(all_x, 'all')

cores = multiprocessing.cpu_count()





model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, 

workers=cores, alpha=0.065, min_alpha=0.065)

model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])



for epoch in range(30):

    model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)

    model_ug_cbow.alpha -= 0.002

    model_ug_cbow.min_alpha = model_ug_cbow.alpha

edited Nov 21 '18 at 6:24

answered Nov 20 '18 at 7:20

Roberto

50212

in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26

@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25

add a comment |

For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html

you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"

def labelize_tweets_ug(tweets,label):

   result = 

   prefix = label

   for i, t in zip(tweets.index, tweets):

       result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))

   return result



# all_x is a list of tweets

all_x_w2v = labelize_tweets_ug(all_x, 'all')

cores = multiprocessing.cpu_count()





model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, 

workers=cores, alpha=0.065, min_alpha=0.065)

model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])



for epoch in range(30):

    model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)

    model_ug_cbow.alpha -= 0.002

    model_ug_cbow.min_alpha = model_ug_cbow.alpha

edited Nov 21 '18 at 6:24

answered Nov 20 '18 at 7:20

Roberto

50212

in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26

@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25

add a comment |

For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html

you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"

def labelize_tweets_ug(tweets,label):

   result = 

   prefix = label

   for i, t in zip(tweets.index, tweets):

       result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))

   return result



# all_x is a list of tweets

all_x_w2v = labelize_tweets_ug(all_x, 'all')

cores = multiprocessing.cpu_count()





model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, 

workers=cores, alpha=0.065, min_alpha=0.065)

model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])



for epoch in range(30):

    model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)

    model_ug_cbow.alpha -= 0.002

    model_ug_cbow.min_alpha = model_ug_cbow.alpha

edited Nov 21 '18 at 6:24

answered Nov 20 '18 at 7:20

Roberto

50212

For a nice tutorial and code refer to:
https://markroxor.github.io/gensim/static/notebooks/doc2vec-IMDB.html

you can skip the "Predictive Evaluation Methods"
probably the most interesting section for you is: "Do close documents seem more related than distant ones?"

def labelize_tweets_ug(tweets,label):

   result = 

   prefix = label

   for i, t in zip(tweets.index, tweets):

       result.append(TaggedDocument(t.split(), [prefix + '_%s' % i]))

   return result



# all_x is a list of tweets

all_x_w2v = labelize_tweets_ug(all_x, 'all')

cores = multiprocessing.cpu_count()





model_ug_cbow = Word2Vec(sg=0, size=100, negative=5, window=2, min_count=2, 

workers=cores, alpha=0.065, min_alpha=0.065)

model_ug_cbow.build_vocab([x.words for x in tqdm(all_x_w2v)])



for epoch in range(30):

    model_ug_cbow.train(utils.shuffle([x.words for x in tqdm(all_x_w2v)]),total_examples=len(all_x_w2v), epochs=1)

    model_ug_cbow.alpha -= 0.002

    model_ug_cbow.min_alpha = model_ug_cbow.alpha

edited Nov 21 '18 at 6:24

answered Nov 20 '18 at 7:20

Roberto

50212

edited Nov 21 '18 at 6:24

answered Nov 20 '18 at 7:20

Roberto

50212

answered Nov 20 '18 at 7:20

Roberto

50212

answered Nov 20 '18 at 7:20

Roberto

50212

in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26

@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25

add a comment |

in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26

@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25

in part of training model, on train_model.train(doc_list), it will error, it needs arguments of total_examples=x.corpus_count, and epochs=x.epochs, when i applied x with train_model, which is corpus_count is 100000 and epochs is 5, it will run but very slow, and error in the middle of process. have you tried those code?

– thomi dhia
Nov 20 '18 at 23:26

@thomidhia I edited my answer to reply to your comment.

– Roberto
Nov 21 '18 at 6:25

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu