nutch urls not fetched
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
add a comment |
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
How depth do you use?
– Quent
yesterday
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
New contributor
Trying to crawl some urls from a local website from this domain:
https://foo.foofoo.com
But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:
https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa
Only some urls (not all! only few!) like these get fetched:
https://foo.foofoo.com/en/foofoo
Here is my regex-urlfilter file, by which I only fetch english web pages:
-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com
Any brilliant idea, please?
java regex filter web-crawler nutch
java regex filter web-crawler nutch
New contributor
New contributor
edited yesterday
New contributor
asked yesterday
Oppa pi
62
62
New contributor
New contributor
How depth do you use?
– Quent
yesterday
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday
add a comment |
How depth do you use?
– Quent
yesterday
try this-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday
How depth do you use?
– Quent
yesterday
How depth do you use?
– Quent
yesterday
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
add a comment |
up vote
0
down vote
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
add a comment |
up vote
0
down vote
up vote
0
down vote
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).
New contributor
New contributor
answered yesterday
Oppa pi
62
62
New contributor
New contributor
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
add a comment |
cool. For me i have this and its work maybe you do not need to delete as muchprotocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
cool. For me i have this and its work maybe you do not need to delete as much
protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
cool. For me i have this and its work maybe you do not need to delete as much
protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
I see. Thank you @Quent ! :)
– Oppa pi
yesterday
add a comment |
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How depth do you use?
– Quent
yesterday
try this
-^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday
Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday