nutch urls not fetched











up vote
1
down vote

favorite












Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?










share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • How depth do you use?
    – Quent
    yesterday










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    yesterday












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    yesterday















up vote
1
down vote

favorite












Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?










share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • How depth do you use?
    – Quent
    yesterday










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    yesterday












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    yesterday













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?










share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











Trying to crawl some urls from a local website from this domain:



https://foo.foofoo.com


But, I am not able to make it for specific ones like these in below. Because nutch skips them even it generates them in order to be fetched. But it does not make it:



https://foo.foofoo.com/foo/foo/foo/foo-a-foo-foofoo-foo-foo-foofoo-foo-foofoo
https://foo.foofoo.com/foo/00550000006yDdKAAU/foofoo/foo-foo-foo-foofoo-foo-foo
https://foo.foofoo.com/foo/foo/foo/foofoo-foo-foofoo-foo-foo/foofoo-a-foo-foofoo-foofoo?foo_id=foo-fi-and-foo-fafoo-fa


Only some urls (not all! only few!) like these get fetched:



https://foo.foofoo.com/en/foofoo


Here is my regex-urlfilter file, by which I only fetch english web pages:



-^(file|ftp|mailto):
-^https?://foo.foofoo.com/(de|ja|fr|es-MX|pt-BR)
+^https?://foo.foofoo.com


Any brilliant idea, please?







java regex filter web-crawler nutch






share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited yesterday





















New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked yesterday









Oppa pi

62




62




New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • How depth do you use?
    – Quent
    yesterday










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    yesterday












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    yesterday


















  • How depth do you use?
    – Quent
    yesterday










  • try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
    – Quent
    yesterday












  • Still the same. But I think your writing is better than mine. As depth, it is 10.
    – Oppa pi
    yesterday
















How depth do you use?
– Quent
yesterday




How depth do you use?
– Quent
yesterday












try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday






try this -^(?:https?://)?foo.foofoo.com/(?:de|ja|fr|es-MX|pt-BR) +^(?:https?://)?foo.foofoo.com(?:/.*|.*)
– Quent
yesterday














Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday




Still the same. But I think your writing is better than mine. As depth, it is 10.
– Oppa pi
yesterday












1 Answer
1






active

oldest

votes

















up vote
0
down vote













After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    yesterday










  • I see. Thank you @Quent ! :)
    – Oppa pi
    yesterday











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.










 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    yesterday










  • I see. Thank you @Quent ! :)
    – Oppa pi
    yesterday















up vote
0
down vote













After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    yesterday










  • I see. Thank you @Quent ! :)
    – Oppa pi
    yesterday













up vote
0
down vote










up vote
0
down vote









After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).






share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









After removing some useless plugins for my use case, everything went back alright. These plugins are nutch-extensionpoints, parse-text and query(basic|site|url).







share|improve this answer








New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer






New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered yesterday









Oppa pi

62




62




New contributor




Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Oppa pi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    yesterday










  • I see. Thank you @Quent ! :)
    – Oppa pi
    yesterday


















  • cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
    – Quent
    yesterday










  • I see. Thank you @Quent ! :)
    – Oppa pi
    yesterday
















cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday




cool. For me i have this and its work maybe you do not need to delete as much protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more|tika)|query-(basic|site|url)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
– Quent
yesterday












I see. Thank you @Quent ! :)
– Oppa pi
yesterday




I see. Thank you @Quent ! :)
– Oppa pi
yesterday










Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.










 

draft saved


draft discarded


















Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.













Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.












Oppa pi is a new contributor. Be nice, and check out our Code of Conduct.















 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372642%2fnutch-urls-not-fetched%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

SQL update select statement

'app-layout' is not a known element: how to share Component with different Modules