How to speed up the Nifi streaming logs to Kafka
I'm new to nifi
, trying to read files and push to kafka
. From some basic reading, I'm able to do that with the following.
With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile
and FetchFile
processors through slitText
processors is great. But, getting settled at PublishKafka
.
So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.
Can someone help me with this. Thanks
apache-kafka apache-nifi
add a comment |
I'm new to nifi
, trying to read files and push to kafka
. From some basic reading, I'm able to do that with the following.
With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile
and FetchFile
processors through slitText
processors is great. But, getting settled at PublishKafka
.
So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.
Can someone help me with this. Thanks
apache-kafka apache-nifi
where is a bottleneck in your flow ?
– daggett
Nov 22 '18 at 7:37
@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?
– srikanth
Nov 22 '18 at 8:29
provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.
– daggett
Nov 22 '18 at 10:04
@daggett I have done some test. Updated the question. Please check.
– srikanth
Nov 22 '18 at 11:42
add a comment |
I'm new to nifi
, trying to read files and push to kafka
. From some basic reading, I'm able to do that with the following.
With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile
and FetchFile
processors through slitText
processors is great. But, getting settled at PublishKafka
.
So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.
Can someone help me with this. Thanks
apache-kafka apache-nifi
I'm new to nifi
, trying to read files and push to kafka
. From some basic reading, I'm able to do that with the following.
With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile
and FetchFile
processors through slitText
processors is great. But, getting settled at PublishKafka
.
So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.
Can someone help me with this. Thanks
apache-kafka apache-nifi
apache-kafka apache-nifi
edited Nov 22 '18 at 11:41
srikanth
asked Nov 22 '18 at 7:16
srikanthsrikanth
465521
465521
where is a bottleneck in your flow ?
– daggett
Nov 22 '18 at 7:37
@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?
– srikanth
Nov 22 '18 at 8:29
provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.
– daggett
Nov 22 '18 at 10:04
@daggett I have done some test. Updated the question. Please check.
– srikanth
Nov 22 '18 at 11:42
add a comment |
where is a bottleneck in your flow ?
– daggett
Nov 22 '18 at 7:37
@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?
– srikanth
Nov 22 '18 at 8:29
provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.
– daggett
Nov 22 '18 at 10:04
@daggett I have done some test. Updated the question. Please check.
– srikanth
Nov 22 '18 at 11:42
where is a bottleneck in your flow ?
– daggett
Nov 22 '18 at 7:37
where is a bottleneck in your flow ?
– daggett
Nov 22 '18 at 7:37
@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?
– srikanth
Nov 22 '18 at 8:29
@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?
– srikanth
Nov 22 '18 at 8:29
provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.
– daggett
Nov 22 '18 at 10:04
provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.
– daggett
Nov 22 '18 at 10:04
@daggett I have done some test. Updated the question. Please check.
– srikanth
Nov 22 '18 at 11:42
@daggett I have done some test. Updated the question. Please check.
– srikanth
Nov 22 '18 at 11:42
add a comment |
1 Answer
1
active
oldest
votes
You can try using Record Oriented
processors i.e PublishKafkaRecord_1.0
processor.
So that your flow will be:
1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task
By using this flow we are not going to use SplitText
processors and Define RecordReader/Writer
controller services in PublishKafkaRecord
processor.
In addition
you can also distribute the load by using Remote Process Groups
Flow:
1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task
Refer to this link for more details regards to design/configuring the above flow.
Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup
(to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions
in NiFi-1.8 version.
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
@srikanth, 1. You need to run allList
processors onPrimary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to useRemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done byprimary node only
even you are runningFetch
processors on all nodes. 3. By usingRPG
we are distributing the work across the cluster and if you increaseConcurrent Task
then we are going to get max performance out of NiFi cluster.
– Shu
Nov 27 '18 at 2:52
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
@srikanth, Yes, that's expected behavior fromRPG
. Even though you have configured RPG withA node details
still RPG will distribute the work to all the nodes in the cluster based onhow much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…
– Shu
Nov 30 '18 at 14:03
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53425665%2fhow-to-speed-up-the-nifi-streaming-logs-to-kafka%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can try using Record Oriented
processors i.e PublishKafkaRecord_1.0
processor.
So that your flow will be:
1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task
By using this flow we are not going to use SplitText
processors and Define RecordReader/Writer
controller services in PublishKafkaRecord
processor.
In addition
you can also distribute the load by using Remote Process Groups
Flow:
1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task
Refer to this link for more details regards to design/configuring the above flow.
Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup
(to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions
in NiFi-1.8 version.
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
@srikanth, 1. You need to run allList
processors onPrimary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to useRemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done byprimary node only
even you are runningFetch
processors on all nodes. 3. By usingRPG
we are distributing the work across the cluster and if you increaseConcurrent Task
then we are going to get max performance out of NiFi cluster.
– Shu
Nov 27 '18 at 2:52
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
@srikanth, Yes, that's expected behavior fromRPG
. Even though you have configured RPG withA node details
still RPG will distribute the work to all the nodes in the cluster based onhow much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…
– Shu
Nov 30 '18 at 14:03
add a comment |
You can try using Record Oriented
processors i.e PublishKafkaRecord_1.0
processor.
So that your flow will be:
1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task
By using this flow we are not going to use SplitText
processors and Define RecordReader/Writer
controller services in PublishKafkaRecord
processor.
In addition
you can also distribute the load by using Remote Process Groups
Flow:
1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task
Refer to this link for more details regards to design/configuring the above flow.
Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup
(to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions
in NiFi-1.8 version.
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
@srikanth, 1. You need to run allList
processors onPrimary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to useRemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done byprimary node only
even you are runningFetch
processors on all nodes. 3. By usingRPG
we are distributing the work across the cluster and if you increaseConcurrent Task
then we are going to get max performance out of NiFi cluster.
– Shu
Nov 27 '18 at 2:52
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
@srikanth, Yes, that's expected behavior fromRPG
. Even though you have configured RPG withA node details
still RPG will distribute the work to all the nodes in the cluster based onhow much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…
– Shu
Nov 30 '18 at 14:03
add a comment |
You can try using Record Oriented
processors i.e PublishKafkaRecord_1.0
processor.
So that your flow will be:
1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task
By using this flow we are not going to use SplitText
processors and Define RecordReader/Writer
controller services in PublishKafkaRecord
processor.
In addition
you can also distribute the load by using Remote Process Groups
Flow:
1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task
Refer to this link for more details regards to design/configuring the above flow.
Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup
(to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions
in NiFi-1.8 version.
You can try using Record Oriented
processors i.e PublishKafkaRecord_1.0
processor.
So that your flow will be:
1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task
By using this flow we are not going to use SplitText
processors and Define RecordReader/Writer
controller services in PublishKafkaRecord
processor.
In addition
you can also distribute the load by using Remote Process Groups
Flow:
1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task
Refer to this link for more details regards to design/configuring the above flow.
Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup
(to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions
in NiFi-1.8 version.
answered Nov 22 '18 at 18:45
ShuShu
4,8212418
4,8212418
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
@srikanth, 1. You need to run allList
processors onPrimary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to useRemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done byprimary node only
even you are runningFetch
processors on all nodes. 3. By usingRPG
we are distributing the work across the cluster and if you increaseConcurrent Task
then we are going to get max performance out of NiFi cluster.
– Shu
Nov 27 '18 at 2:52
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
@srikanth, Yes, that's expected behavior fromRPG
. Even though you have configured RPG withA node details
still RPG will distribute the work to all the nodes in the cluster based onhow much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…
– Shu
Nov 30 '18 at 14:03
add a comment |
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
@srikanth, 1. You need to run allList
processors onPrimary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to useRemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done byprimary node only
even you are runningFetch
processors on all nodes. 3. By usingRPG
we are distributing the work across the cluster and if you increaseConcurrent Task
then we are going to get max performance out of NiFi cluster.
– Shu
Nov 27 '18 at 2:52
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
@srikanth, Yes, that's expected behavior fromRPG
. Even though you have configured RPG withA node details
still RPG will distribute the work to all the nodes in the cluster based onhow much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…
– Shu
Nov 30 '18 at 14:03
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
That's great info. Testing this out, would post my update. Thanks
– srikanth
Nov 26 '18 at 7:40
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
I don't see a difference by introducring RemoteProcessGroup. How different is ListFiles->FetchFiles->Kafka from ListFiles->RemoteProcessGroup->FetchFiles-> Kakfa, when FetchFiles and ListFiles are configured as allnodes instead of primary?
– srikanth
Nov 26 '18 at 11:25
@srikanth, 1. You need to run all
List
processors on Primary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done by primary node only
even you are running Fetch
processors on all nodes. 3. By using RPG
we are distributing the work across the cluster and if you increase Concurrent Task
then we are going to get max performance out of NiFi cluster.– Shu
Nov 27 '18 at 2:52
@srikanth, 1. You need to run all
List
processors on Primary node only
, now all flowfiles are in primary node right now. 2.To distribute the work across all nodes we need to use RemoteProcessorGroup(RPG)
and if we won't use RPG then all work is done by primary node only
even you are running Fetch
processors on all nodes. 3. By using RPG
we are distributing the work across the cluster and if you increase Concurrent Task
then we are going to get max performance out of NiFi cluster.– Shu
Nov 27 '18 at 2:52
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
i have 3 node nifi cluster(say A,B,C), with B as primary. Configured RPG with A:8080/nifi, ListFile as primary and FetchFile as 'All nodes'. When I run, I'm only getting data from any one node. I.e. I run for 1st time, data is read from A, 2nd run data is read from B, 3rd run data is read from C. Every time I start and stop the flow. Is this the way it should work?
– srikanth
Nov 27 '18 at 7:33
@srikanth, Yes, that's expected behavior from
RPG
. Even though you have configured RPG with A node details
still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…– Shu
Nov 30 '18 at 14:03
@srikanth, Yes, that's expected behavior from
RPG
. Even though you have configured RPG with A node details
still RPG will distribute the work to all the nodes in the cluster based on how much load on each node at the time and distribute the load dynamically
. community.hortonworks.com/articles/16120/…– Shu
Nov 30 '18 at 14:03
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53425665%2fhow-to-speed-up-the-nifi-streaming-logs-to-kafka%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
where is a bottleneck in your flow ?
– daggett
Nov 22 '18 at 7:37
@daggett well, that's the point i would like to know. Something to do with splits or publishKafka?
– srikanth
Nov 22 '18 at 8:29
provide a high load to your flow and check queues between processors. where a larger amount of files/bytes are queued then the target processor is a bottleneck.
– daggett
Nov 22 '18 at 10:04
@daggett I have done some test. Updated the question. Please check.
– srikanth
Nov 22 '18 at 11:42