How to speed up the Nifi streaming logs to Kafka












0















I'm new to nifi, trying to read files and push to kafka. From some basic reading, I'm able to do that with the following.enter image description here



With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile and FetchFile processors through slitText processors is great. But, getting settled at PublishKafka.



So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.



Can someone help me with this. Thanks










share|improve this question

























  • where is a bottleneck in your flow ?

    – daggett
    Nov 22 '18 at 7:37











  • @daggett well, that's the point i would like to know. Something to do with splits or publishKafka?

    – srikanth
    Nov 22 '18 at 8:29











  • provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.

    – daggett
    Nov 22 '18 at 10:04











  • @daggett I have done some test. Updated the question. Please check.

    – srikanth
    Nov 22 '18 at 11:42
















0















I'm new to nifi, trying to read files and push to kafka. From some basic reading, I'm able to do that with the following.enter image description here



With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile and FetchFile processors through slitText processors is great. But, getting settled at PublishKafka.



So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.



Can someone help me with this. Thanks










share|improve this question

























  • where is a bottleneck in your flow ?

    – daggett
    Nov 22 '18 at 7:37











  • @daggett well, that's the point i would like to know. Something to do with splits or publishKafka?

    – srikanth
    Nov 22 '18 at 8:29











  • provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.

    – daggett
    Nov 22 '18 at 10:04











  • @daggett I have done some test. Updated the question. Please check.

    – srikanth
    Nov 22 '18 at 11:42














0












0








0








I'm new to nifi, trying to read files and push to kafka. From some basic reading, I'm able to do that with the following.enter image description here



With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile and FetchFile processors through slitText processors is great. But, getting settled at PublishKafka.



So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.



Can someone help me with this. Thanks










share|improve this question
















I'm new to nifi, trying to read files and push to kafka. From some basic reading, I'm able to do that with the following.enter image description here



With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile and FetchFile processors through slitText processors is great. But, getting settled at PublishKafka.



So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.



Can someone help me with this. Thanks







apache-kafka apache-nifi






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 11:41







srikanth

















asked Nov 22 '18 at 7:16









srikanthsrikanth

465521




465521













  • where is a bottleneck in your flow ?

    – daggett
    Nov 22 '18 at 7:37











  • @daggett well, that's the point i would like to know. Something to do with splits or publishKafka?

    – srikanth
    Nov 22 '18 at 8:29











  • provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.

    – daggett
    Nov 22 '18 at 10:04











  • @daggett I have done some test. Updated the question. Please check.

    – srikanth
    Nov 22 '18 at 11:42



















  • where is a bottleneck in your flow ?

    – daggett
    Nov 22 '18 at 7:37











  • @daggett well, that's the point i would like to know. Something to do with splits or publishKafka?

    – srikanth
    Nov 22 '18 at 8:29











  • provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.

    – daggett
    Nov 22 '18 at 10:04











  • @daggett I have done some test. Updated the question. Please check.

    – srikanth
    Nov 22 '18 at 11:42

















where is a bottleneck in your flow ?

– daggett
Nov 22 '18 at 7:37





where is a bottleneck in your flow ?

– daggett
Nov 22 '18 at 7:37













@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?

– srikanth
Nov 22 '18 at 8:29





@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?

– srikanth
Nov 22 '18 at 8:29













provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.

– daggett
Nov 22 '18 at 10:04





provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.

– daggett
Nov 22 '18 at 10:04













@daggett I have done some test. Updated the question. Please check.

– srikanth
Nov 22 '18 at 11:42





@daggett I have done some test. Updated the question. Please check.

– srikanth
Nov 22 '18 at 11:42












1 Answer
1






active

oldest

votes


















2














You can try using Record Oriented processors i.e PublishKafkaRecord_1.0 processor.



So that your flow will be:



1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task


By using this flow we are not going to use SplitText processors and Define RecordReader/Writer controller services in PublishKafkaRecord processor.



In addition
you can also distribute the load by using Remote Process Groups



Flow:



1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task


Refer to this link for more details regards to design/configuring the above flow.



Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.



Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.






share|improve this answer
























  • That's great info. Testing this out, would post my update. Thanks

    – srikanth
    Nov 26 '18 at 7:40











  • I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

    – srikanth
    Nov 26 '18 at 11:25













  • @srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

    – Shu
    Nov 27 '18 at 2:52











  • i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

    – srikanth
    Nov 27 '18 at 7:33













  • @srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

    – Shu
    Nov 30 '18 at 14:03













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53425665%2fhow-to-speed-up-the-nifi-streaming-logs-to-kafka%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














You can try using Record Oriented processors i.e PublishKafkaRecord_1.0 processor.



So that your flow will be:



1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task


By using this flow we are not going to use SplitText processors and Define RecordReader/Writer controller services in PublishKafkaRecord processor.



In addition
you can also distribute the load by using Remote Process Groups



Flow:



1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task


Refer to this link for more details regards to design/configuring the above flow.



Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.



Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.






share|improve this answer
























  • That's great info. Testing this out, would post my update. Thanks

    – srikanth
    Nov 26 '18 at 7:40











  • I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

    – srikanth
    Nov 26 '18 at 11:25













  • @srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

    – Shu
    Nov 27 '18 at 2:52











  • i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

    – srikanth
    Nov 27 '18 at 7:33













  • @srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

    – Shu
    Nov 30 '18 at 14:03


















2














You can try using Record Oriented processors i.e PublishKafkaRecord_1.0 processor.



So that your flow will be:



1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task


By using this flow we are not going to use SplitText processors and Define RecordReader/Writer controller services in PublishKafkaRecord processor.



In addition
you can also distribute the load by using Remote Process Groups



Flow:



1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task


Refer to this link for more details regards to design/configuring the above flow.



Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.



Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.






share|improve this answer
























  • That's great info. Testing this out, would post my update. Thanks

    – srikanth
    Nov 26 '18 at 7:40











  • I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

    – srikanth
    Nov 26 '18 at 11:25













  • @srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

    – Shu
    Nov 27 '18 at 2:52











  • i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

    – srikanth
    Nov 27 '18 at 7:33













  • @srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

    – Shu
    Nov 30 '18 at 14:03
















2












2








2







You can try using Record Oriented processors i.e PublishKafkaRecord_1.0 processor.



So that your flow will be:



1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task


By using this flow we are not going to use SplitText processors and Define RecordReader/Writer controller services in PublishKafkaRecord processor.



In addition
you can also distribute the load by using Remote Process Groups



Flow:



1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task


Refer to this link for more details regards to design/configuring the above flow.



Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.



Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.






share|improve this answer













You can try using Record Oriented processors i.e PublishKafkaRecord_1.0 processor.



So that your flow will be:



1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task


By using this flow we are not going to use SplitText processors and Define RecordReader/Writer controller services in PublishKafkaRecord processor.



In addition
you can also distribute the load by using Remote Process Groups



Flow:



1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task


Refer to this link for more details regards to design/configuring the above flow.



Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.



Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 22 '18 at 18:45









ShuShu

4,8212418




4,8212418













  • That's great info. Testing this out, would post my update. Thanks

    – srikanth
    Nov 26 '18 at 7:40











  • I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

    – srikanth
    Nov 26 '18 at 11:25













  • @srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

    – Shu
    Nov 27 '18 at 2:52











  • i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

    – srikanth
    Nov 27 '18 at 7:33













  • @srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

    – Shu
    Nov 30 '18 at 14:03





















  • That's great info. Testing this out, would post my update. Thanks

    – srikanth
    Nov 26 '18 at 7:40











  • I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

    – srikanth
    Nov 26 '18 at 11:25













  • @srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

    – Shu
    Nov 27 '18 at 2:52











  • i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

    – srikanth
    Nov 27 '18 at 7:33













  • @srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

    – Shu
    Nov 30 '18 at 14:03



















That's great info. Testing this out, would post my update. Thanks

– srikanth
Nov 26 '18 at 7:40





That's great info. Testing this out, would post my update. Thanks

– srikanth
Nov 26 '18 at 7:40













I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

– srikanth
Nov 26 '18 at 11:25







I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?

– srikanth
Nov 26 '18 at 11:25















@srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

– Shu
Nov 27 '18 at 2:52





@srikanth, 1. You need to run all List processors on Primary node only, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG) and if we won't use RPG then all work is done by primary node only even you are running Fetch processors on all nodes. 3. By using RPG we are distributing the work across the cluster and if you increase Concurrent Task then we are going to get max performance out of NiFi cluster.

– Shu
Nov 27 '18 at 2:52













i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

– srikanth
Nov 27 '18 at 7:33







i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?

– srikanth
Nov 27 '18 at 7:33















@srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

– Shu
Nov 30 '18 at 14:03







@srikanth, Yes, that's expected behavior from RPG. Even though you have configured RPG with A node details still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically. community.hortonworks.com/articles/16120/…

– Shu
Nov 30 '18 at 14:03






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53425665%2fhow-to-speed-up-the-nifi-streaming-logs-to-kafka%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

Npm cannot find a required file even through it is in the searched directory

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith