Cassandra sequential repair does not repair all nodes on one run?












3















Day before yesterday, I had issued a full sequential repair on one of my nodes in a 5 node Cassandra cluster for a single table using the below command.



nodetool repair -full -seq -tr <keyspace> <table> > <logfile>


Now the node on which the command was issued was repaired properly as can be infered from the below command



nodetool cfstats -H <keyspace.columnFamily>


The same, however, cannot be said about other nodes as for them I get a random value of repair %, significantly lesser.



I am not sure what is happening over here, looks like the only node that was repaired for the keyspace and column family was the node on which the repair command was issued. Any guesses on what might be going on here, or how to properly investigate into the issue



Thanks !










share|improve this question





























    3















    Day before yesterday, I had issued a full sequential repair on one of my nodes in a 5 node Cassandra cluster for a single table using the below command.



    nodetool repair -full -seq -tr <keyspace> <table> > <logfile>


    Now the node on which the command was issued was repaired properly as can be infered from the below command



    nodetool cfstats -H <keyspace.columnFamily>


    The same, however, cannot be said about other nodes as for them I get a random value of repair %, significantly lesser.



    I am not sure what is happening over here, looks like the only node that was repaired for the keyspace and column family was the node on which the repair command was issued. Any guesses on what might be going on here, or how to properly investigate into the issue



    Thanks !










    share|improve this question



























      3












      3








      3


      1






      Day before yesterday, I had issued a full sequential repair on one of my nodes in a 5 node Cassandra cluster for a single table using the below command.



      nodetool repair -full -seq -tr <keyspace> <table> > <logfile>


      Now the node on which the command was issued was repaired properly as can be infered from the below command



      nodetool cfstats -H <keyspace.columnFamily>


      The same, however, cannot be said about other nodes as for them I get a random value of repair %, significantly lesser.



      I am not sure what is happening over here, looks like the only node that was repaired for the keyspace and column family was the node on which the repair command was issued. Any guesses on what might be going on here, or how to properly investigate into the issue



      Thanks !










      share|improve this question
















      Day before yesterday, I had issued a full sequential repair on one of my nodes in a 5 node Cassandra cluster for a single table using the below command.



      nodetool repair -full -seq -tr <keyspace> <table> > <logfile>


      Now the node on which the command was issued was repaired properly as can be infered from the below command



      nodetool cfstats -H <keyspace.columnFamily>


      The same, however, cannot be said about other nodes as for them I get a random value of repair %, significantly lesser.



      I am not sure what is happening over here, looks like the only node that was repaired for the keyspace and column family was the node on which the repair command was issued. Any guesses on what might be going on here, or how to properly investigate into the issue



      Thanks !







      cassandra datastax scylla






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 2 at 22:12









      Nadav Har'El

      1,342413




      1,342413










      asked Jan 2 at 12:56









      Naman GuptaNaman Gupta

      213




      213
























          1 Answer
          1






          active

          oldest

          votes


















          5














          You said your cluster has 5 nodes, but not which replication factor (RF) you are using for your table - I'll assume you used the common RF=3. When RF=3, each piece of data is replicated 3 times across the five nodes.



          The key point you have missed is that in such a setup each specific node does not contain all the data. How much of the total data does it contain? Let's do some simple math: if the amount of actual data inserted into the table is X, then the total amount of data stored by the cluster is 3*X (since RF=3, there are three copies of each piece of data). This total is spread across 5 nodes, so each node will hold (3*X)/5, i.e., 3/5*X.



          When you start a repair on one specific node, it repairs only the data which this node has, i.e., as we just calculated, 3/5 of the total data. What this repair does is for each piece of data held by this node, it compares this data against the copies held by other replicas, repairs the inconsistencies and repairs all these copies. This means when the repair is over, in the node which we repaired all its data was repaired. But for other nodes, not all their data was repaired - just the parts which intersected with the node who initiated this repair. This intersection should be roughly 3/5*3/5 or 36% of the data (of course everything is distributed randomly, so you're likely to get a number close to 36% but not exactly 36%).



          So as you probably realized by now, this means that "nodetool repair" is not a cluster-wide operation. If you start it on one node, it is only guaranteed to repair all the data on one node, and may repair less on other nodes. So you must run the repair on each one of the nodes, separately.



          Now you may be asking: Since repairing node 1 also repaired 36% of node 2, wouldn't it be a waste to also repair node 2, since we already did 36% of the work? Indeed, it is a waste. So Cassandra has a repair option "-pr" ("primary range") which ensures that only one of the 3 replicas for each piece data will repair it. With RF=3, "nodetool repair -pr" will be three times faster than without "-pr"; You still need to run it separately on each of the nodes, and when all nodes finish your data will be 100% repaired on all nodes.



          All of this is fairly inconvenient and it's also hard to recover from transient failures during a long repair. This is why the both commercial Cassandra offerings - from Datastax and ScyllaDB - offer a separate repair tool which is more convenient than "nodetool repair", making sure that the entire cluster is repaired in the most efficient way possible, and recovering from transient problems without redoing the lengthy repair process from the beginning.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006795%2fcassandra-sequential-repair-does-not-repair-all-nodes-on-one-run%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            5














            You said your cluster has 5 nodes, but not which replication factor (RF) you are using for your table - I'll assume you used the common RF=3. When RF=3, each piece of data is replicated 3 times across the five nodes.



            The key point you have missed is that in such a setup each specific node does not contain all the data. How much of the total data does it contain? Let's do some simple math: if the amount of actual data inserted into the table is X, then the total amount of data stored by the cluster is 3*X (since RF=3, there are three copies of each piece of data). This total is spread across 5 nodes, so each node will hold (3*X)/5, i.e., 3/5*X.



            When you start a repair on one specific node, it repairs only the data which this node has, i.e., as we just calculated, 3/5 of the total data. What this repair does is for each piece of data held by this node, it compares this data against the copies held by other replicas, repairs the inconsistencies and repairs all these copies. This means when the repair is over, in the node which we repaired all its data was repaired. But for other nodes, not all their data was repaired - just the parts which intersected with the node who initiated this repair. This intersection should be roughly 3/5*3/5 or 36% of the data (of course everything is distributed randomly, so you're likely to get a number close to 36% but not exactly 36%).



            So as you probably realized by now, this means that "nodetool repair" is not a cluster-wide operation. If you start it on one node, it is only guaranteed to repair all the data on one node, and may repair less on other nodes. So you must run the repair on each one of the nodes, separately.



            Now you may be asking: Since repairing node 1 also repaired 36% of node 2, wouldn't it be a waste to also repair node 2, since we already did 36% of the work? Indeed, it is a waste. So Cassandra has a repair option "-pr" ("primary range") which ensures that only one of the 3 replicas for each piece data will repair it. With RF=3, "nodetool repair -pr" will be three times faster than without "-pr"; You still need to run it separately on each of the nodes, and when all nodes finish your data will be 100% repaired on all nodes.



            All of this is fairly inconvenient and it's also hard to recover from transient failures during a long repair. This is why the both commercial Cassandra offerings - from Datastax and ScyllaDB - offer a separate repair tool which is more convenient than "nodetool repair", making sure that the entire cluster is repaired in the most efficient way possible, and recovering from transient problems without redoing the lengthy repair process from the beginning.






            share|improve this answer




























              5














              You said your cluster has 5 nodes, but not which replication factor (RF) you are using for your table - I'll assume you used the common RF=3. When RF=3, each piece of data is replicated 3 times across the five nodes.



              The key point you have missed is that in such a setup each specific node does not contain all the data. How much of the total data does it contain? Let's do some simple math: if the amount of actual data inserted into the table is X, then the total amount of data stored by the cluster is 3*X (since RF=3, there are three copies of each piece of data). This total is spread across 5 nodes, so each node will hold (3*X)/5, i.e., 3/5*X.



              When you start a repair on one specific node, it repairs only the data which this node has, i.e., as we just calculated, 3/5 of the total data. What this repair does is for each piece of data held by this node, it compares this data against the copies held by other replicas, repairs the inconsistencies and repairs all these copies. This means when the repair is over, in the node which we repaired all its data was repaired. But for other nodes, not all their data was repaired - just the parts which intersected with the node who initiated this repair. This intersection should be roughly 3/5*3/5 or 36% of the data (of course everything is distributed randomly, so you're likely to get a number close to 36% but not exactly 36%).



              So as you probably realized by now, this means that "nodetool repair" is not a cluster-wide operation. If you start it on one node, it is only guaranteed to repair all the data on one node, and may repair less on other nodes. So you must run the repair on each one of the nodes, separately.



              Now you may be asking: Since repairing node 1 also repaired 36% of node 2, wouldn't it be a waste to also repair node 2, since we already did 36% of the work? Indeed, it is a waste. So Cassandra has a repair option "-pr" ("primary range") which ensures that only one of the 3 replicas for each piece data will repair it. With RF=3, "nodetool repair -pr" will be three times faster than without "-pr"; You still need to run it separately on each of the nodes, and when all nodes finish your data will be 100% repaired on all nodes.



              All of this is fairly inconvenient and it's also hard to recover from transient failures during a long repair. This is why the both commercial Cassandra offerings - from Datastax and ScyllaDB - offer a separate repair tool which is more convenient than "nodetool repair", making sure that the entire cluster is repaired in the most efficient way possible, and recovering from transient problems without redoing the lengthy repair process from the beginning.






              share|improve this answer


























                5












                5








                5







                You said your cluster has 5 nodes, but not which replication factor (RF) you are using for your table - I'll assume you used the common RF=3. When RF=3, each piece of data is replicated 3 times across the five nodes.



                The key point you have missed is that in such a setup each specific node does not contain all the data. How much of the total data does it contain? Let's do some simple math: if the amount of actual data inserted into the table is X, then the total amount of data stored by the cluster is 3*X (since RF=3, there are three copies of each piece of data). This total is spread across 5 nodes, so each node will hold (3*X)/5, i.e., 3/5*X.



                When you start a repair on one specific node, it repairs only the data which this node has, i.e., as we just calculated, 3/5 of the total data. What this repair does is for each piece of data held by this node, it compares this data against the copies held by other replicas, repairs the inconsistencies and repairs all these copies. This means when the repair is over, in the node which we repaired all its data was repaired. But for other nodes, not all their data was repaired - just the parts which intersected with the node who initiated this repair. This intersection should be roughly 3/5*3/5 or 36% of the data (of course everything is distributed randomly, so you're likely to get a number close to 36% but not exactly 36%).



                So as you probably realized by now, this means that "nodetool repair" is not a cluster-wide operation. If you start it on one node, it is only guaranteed to repair all the data on one node, and may repair less on other nodes. So you must run the repair on each one of the nodes, separately.



                Now you may be asking: Since repairing node 1 also repaired 36% of node 2, wouldn't it be a waste to also repair node 2, since we already did 36% of the work? Indeed, it is a waste. So Cassandra has a repair option "-pr" ("primary range") which ensures that only one of the 3 replicas for each piece data will repair it. With RF=3, "nodetool repair -pr" will be three times faster than without "-pr"; You still need to run it separately on each of the nodes, and when all nodes finish your data will be 100% repaired on all nodes.



                All of this is fairly inconvenient and it's also hard to recover from transient failures during a long repair. This is why the both commercial Cassandra offerings - from Datastax and ScyllaDB - offer a separate repair tool which is more convenient than "nodetool repair", making sure that the entire cluster is repaired in the most efficient way possible, and recovering from transient problems without redoing the lengthy repair process from the beginning.






                share|improve this answer













                You said your cluster has 5 nodes, but not which replication factor (RF) you are using for your table - I'll assume you used the common RF=3. When RF=3, each piece of data is replicated 3 times across the five nodes.



                The key point you have missed is that in such a setup each specific node does not contain all the data. How much of the total data does it contain? Let's do some simple math: if the amount of actual data inserted into the table is X, then the total amount of data stored by the cluster is 3*X (since RF=3, there are three copies of each piece of data). This total is spread across 5 nodes, so each node will hold (3*X)/5, i.e., 3/5*X.



                When you start a repair on one specific node, it repairs only the data which this node has, i.e., as we just calculated, 3/5 of the total data. What this repair does is for each piece of data held by this node, it compares this data against the copies held by other replicas, repairs the inconsistencies and repairs all these copies. This means when the repair is over, in the node which we repaired all its data was repaired. But for other nodes, not all their data was repaired - just the parts which intersected with the node who initiated this repair. This intersection should be roughly 3/5*3/5 or 36% of the data (of course everything is distributed randomly, so you're likely to get a number close to 36% but not exactly 36%).



                So as you probably realized by now, this means that "nodetool repair" is not a cluster-wide operation. If you start it on one node, it is only guaranteed to repair all the data on one node, and may repair less on other nodes. So you must run the repair on each one of the nodes, separately.



                Now you may be asking: Since repairing node 1 also repaired 36% of node 2, wouldn't it be a waste to also repair node 2, since we already did 36% of the work? Indeed, it is a waste. So Cassandra has a repair option "-pr" ("primary range") which ensures that only one of the 3 replicas for each piece data will repair it. With RF=3, "nodetool repair -pr" will be three times faster than without "-pr"; You still need to run it separately on each of the nodes, and when all nodes finish your data will be 100% repaired on all nodes.



                All of this is fairly inconvenient and it's also hard to recover from transient failures during a long repair. This is why the both commercial Cassandra offerings - from Datastax and ScyllaDB - offer a separate repair tool which is more convenient than "nodetool repair", making sure that the entire cluster is repaired in the most efficient way possible, and recovering from transient problems without redoing the lengthy repair process from the beginning.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 2 at 16:36









                Nadav Har'ElNadav Har'El

                1,342413




                1,342413
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006795%2fcassandra-sequential-repair-does-not-repair-all-nodes-on-one-run%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    'app-layout' is not a known element: how to share Component with different Modules

                    android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

                    WPF add header to Image with URL pettitions [duplicate]