How to fill NA in R for quasi-same row?












0















I'm looking for a way to fillNA in duplicated() rows. There are totally same rows and at one time there is a NA, so I decide to fill this one by value of complete row but I don't see how to deal with it.



Using the duplicated() function, I could have a data frame like that:



 df <- data.frame(
Year = rnorm(5),
hour = rnorm(5),
LOT = rnorm(5),
S123_AA = c('ABF4576','ABF4576','ABF4576','ABF4576','ABF4576'),
S135_AA = c('ABF5403',NA,'ABF5403','ABF5403','ABF5403'),
S13_BB = c('BF50343','BF50343','BF50343','BF50343',NA),
S1763_BB = c('AA3489','AA3489','AA3489','AA3489','AA3489'),
S173_BB = c('BQA0478','BQA0478','BQA0478','BQA0478','BQA0478'),
S234543 = c('AD4352','AD4352','AD4352','AD4352','AD4352'),
S1265UU5 = c('AZERTY', 'AZERTY', 'AZERTY', 'AZERTY','AZERTY')
)


The rows are similar, so how could I feel the NA by the value of the preceding raw (which is not an NA) ? There is no complete.cases()rows.










share|improve this question





























    0















    I'm looking for a way to fillNA in duplicated() rows. There are totally same rows and at one time there is a NA, so I decide to fill this one by value of complete row but I don't see how to deal with it.



    Using the duplicated() function, I could have a data frame like that:



     df <- data.frame(
    Year = rnorm(5),
    hour = rnorm(5),
    LOT = rnorm(5),
    S123_AA = c('ABF4576','ABF4576','ABF4576','ABF4576','ABF4576'),
    S135_AA = c('ABF5403',NA,'ABF5403','ABF5403','ABF5403'),
    S13_BB = c('BF50343','BF50343','BF50343','BF50343',NA),
    S1763_BB = c('AA3489','AA3489','AA3489','AA3489','AA3489'),
    S173_BB = c('BQA0478','BQA0478','BQA0478','BQA0478','BQA0478'),
    S234543 = c('AD4352','AD4352','AD4352','AD4352','AD4352'),
    S1265UU5 = c('AZERTY', 'AZERTY', 'AZERTY', 'AZERTY','AZERTY')
    )


    The rows are similar, so how could I feel the NA by the value of the preceding raw (which is not an NA) ? There is no complete.cases()rows.










    share|improve this question



























      0












      0








      0








      I'm looking for a way to fillNA in duplicated() rows. There are totally same rows and at one time there is a NA, so I decide to fill this one by value of complete row but I don't see how to deal with it.



      Using the duplicated() function, I could have a data frame like that:



       df <- data.frame(
      Year = rnorm(5),
      hour = rnorm(5),
      LOT = rnorm(5),
      S123_AA = c('ABF4576','ABF4576','ABF4576','ABF4576','ABF4576'),
      S135_AA = c('ABF5403',NA,'ABF5403','ABF5403','ABF5403'),
      S13_BB = c('BF50343','BF50343','BF50343','BF50343',NA),
      S1763_BB = c('AA3489','AA3489','AA3489','AA3489','AA3489'),
      S173_BB = c('BQA0478','BQA0478','BQA0478','BQA0478','BQA0478'),
      S234543 = c('AD4352','AD4352','AD4352','AD4352','AD4352'),
      S1265UU5 = c('AZERTY', 'AZERTY', 'AZERTY', 'AZERTY','AZERTY')
      )


      The rows are similar, so how could I feel the NA by the value of the preceding raw (which is not an NA) ? There is no complete.cases()rows.










      share|improve this question
















      I'm looking for a way to fillNA in duplicated() rows. There are totally same rows and at one time there is a NA, so I decide to fill this one by value of complete row but I don't see how to deal with it.



      Using the duplicated() function, I could have a data frame like that:



       df <- data.frame(
      Year = rnorm(5),
      hour = rnorm(5),
      LOT = rnorm(5),
      S123_AA = c('ABF4576','ABF4576','ABF4576','ABF4576','ABF4576'),
      S135_AA = c('ABF5403',NA,'ABF5403','ABF5403','ABF5403'),
      S13_BB = c('BF50343','BF50343','BF50343','BF50343',NA),
      S1763_BB = c('AA3489','AA3489','AA3489','AA3489','AA3489'),
      S173_BB = c('BQA0478','BQA0478','BQA0478','BQA0478','BQA0478'),
      S234543 = c('AD4352','AD4352','AD4352','AD4352','AD4352'),
      S1265UU5 = c('AZERTY', 'AZERTY', 'AZERTY', 'AZERTY','AZERTY')
      )


      The rows are similar, so how could I feel the NA by the value of the preceding raw (which is not an NA) ? There is no complete.cases()rows.







      r duplicates na






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 2 at 22:16







      Alex Germain

















      asked Jan 2 at 15:42









      Alex GermainAlex Germain

      798




      798
























          3 Answers
          3






          active

          oldest

          votes


















          1














          You could loop through the data and find the first none NA value and replace the NA values with that value



          # Loop through the data
          for(c in 1:ncol(df)) {
          vals <- df[,c]
          noneNA <- vals[!is.na(vals)][1]
          vals[is.na(vals)] <- noneNA
          df[,c] <- vals
          }


          Or alternatively you could review your data element by element and take a none NA value from either above or below the relevant cell using nested for loops.



          for(c in 1:ncol(df)) {
          for(r in 1:nrow(df)) {
          if (is.na(df[r,c])) {
          nearVals <- df[c(r-1, r+1),c]
          noneNA <- nearVals[!is.na(nearVals)][1]
          df[r,c] <- noneNA
          }
          }
          }





          share|improve this answer


























          • Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

            – Alex Germain
            Jan 2 at 22:15













          • I updated my answer based on your comment.

            – MatAff
            Jan 2 at 22:36











          • Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

            – Alex Germain
            Jan 2 at 23:58











          • The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

            – MatAff
            Jan 3 at 3:01











          • My df here is a subset of a full other one. Here it's just similar rows.

            – Alex Germain
            Jan 3 at 10:19



















          1














          reading your question made me think of an imputation problem for the dataframe.



          In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].



          In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:



                   Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
          1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          2 0.55379245 -0.7320060 -0.95088434 ABF4576 <NA> BF50343 AA3489 BQA0478 AD4352 AZERTY
          3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


          We can then create a function to calculate the mode of a particular column:



          getmode <- function(v) {
          uniqv <- unique(v)
          uniqv[which.max(tabulate(match(v, uniqv)))]
          }


          And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :



          workdf <- df
          workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])


          This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:



                   Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
          1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          2 0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
          5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


          If your objective was data cleaning I guess that you should use an imputation method to deal with it.






          share|improve this answer
























          • Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

            – Alex Germain
            Jan 3 at 0:04











          • As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

            – alessio
            Jan 3 at 0:13











          • Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

            – Alex Germain
            Jan 3 at 0:24






          • 1





            You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

            – alessio
            Jan 3 at 0:36



















          0














          You can do the following:



          library(zoo)

          # get cols with missing values
          na_cols <- names(df)[colSums(is.na(df)) > 0]

          # fill the missing value backwards
          for (i in na_cols){
          df[[i]] <- na.locf(df[[i]])
          }





          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54009176%2fhow-to-fill-na-in-r-for-quasi-same-row%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            You could loop through the data and find the first none NA value and replace the NA values with that value



            # Loop through the data
            for(c in 1:ncol(df)) {
            vals <- df[,c]
            noneNA <- vals[!is.na(vals)][1]
            vals[is.na(vals)] <- noneNA
            df[,c] <- vals
            }


            Or alternatively you could review your data element by element and take a none NA value from either above or below the relevant cell using nested for loops.



            for(c in 1:ncol(df)) {
            for(r in 1:nrow(df)) {
            if (is.na(df[r,c])) {
            nearVals <- df[c(r-1, r+1),c]
            noneNA <- nearVals[!is.na(nearVals)][1]
            df[r,c] <- noneNA
            }
            }
            }





            share|improve this answer


























            • Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

              – Alex Germain
              Jan 2 at 22:15













            • I updated my answer based on your comment.

              – MatAff
              Jan 2 at 22:36











            • Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

              – Alex Germain
              Jan 2 at 23:58











            • The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

              – MatAff
              Jan 3 at 3:01











            • My df here is a subset of a full other one. Here it's just similar rows.

              – Alex Germain
              Jan 3 at 10:19
















            1














            You could loop through the data and find the first none NA value and replace the NA values with that value



            # Loop through the data
            for(c in 1:ncol(df)) {
            vals <- df[,c]
            noneNA <- vals[!is.na(vals)][1]
            vals[is.na(vals)] <- noneNA
            df[,c] <- vals
            }


            Or alternatively you could review your data element by element and take a none NA value from either above or below the relevant cell using nested for loops.



            for(c in 1:ncol(df)) {
            for(r in 1:nrow(df)) {
            if (is.na(df[r,c])) {
            nearVals <- df[c(r-1, r+1),c]
            noneNA <- nearVals[!is.na(nearVals)][1]
            df[r,c] <- noneNA
            }
            }
            }





            share|improve this answer


























            • Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

              – Alex Germain
              Jan 2 at 22:15













            • I updated my answer based on your comment.

              – MatAff
              Jan 2 at 22:36











            • Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

              – Alex Germain
              Jan 2 at 23:58











            • The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

              – MatAff
              Jan 3 at 3:01











            • My df here is a subset of a full other one. Here it's just similar rows.

              – Alex Germain
              Jan 3 at 10:19














            1












            1








            1







            You could loop through the data and find the first none NA value and replace the NA values with that value



            # Loop through the data
            for(c in 1:ncol(df)) {
            vals <- df[,c]
            noneNA <- vals[!is.na(vals)][1]
            vals[is.na(vals)] <- noneNA
            df[,c] <- vals
            }


            Or alternatively you could review your data element by element and take a none NA value from either above or below the relevant cell using nested for loops.



            for(c in 1:ncol(df)) {
            for(r in 1:nrow(df)) {
            if (is.na(df[r,c])) {
            nearVals <- df[c(r-1, r+1),c]
            noneNA <- nearVals[!is.na(nearVals)][1]
            df[r,c] <- noneNA
            }
            }
            }





            share|improve this answer















            You could loop through the data and find the first none NA value and replace the NA values with that value



            # Loop through the data
            for(c in 1:ncol(df)) {
            vals <- df[,c]
            noneNA <- vals[!is.na(vals)][1]
            vals[is.na(vals)] <- noneNA
            df[,c] <- vals
            }


            Or alternatively you could review your data element by element and take a none NA value from either above or below the relevant cell using nested for loops.



            for(c in 1:ncol(df)) {
            for(r in 1:nrow(df)) {
            if (is.na(df[r,c])) {
            nearVals <- df[c(r-1, r+1),c]
            noneNA <- nearVals[!is.na(nearVals)][1]
            df[r,c] <- noneNA
            }
            }
            }






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 2 at 22:35

























            answered Jan 2 at 19:21









            MatAffMatAff

            605619




            605619













            • Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

              – Alex Germain
              Jan 2 at 22:15













            • I updated my answer based on your comment.

              – MatAff
              Jan 2 at 22:36











            • Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

              – Alex Germain
              Jan 2 at 23:58











            • The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

              – MatAff
              Jan 3 at 3:01











            • My df here is a subset of a full other one. Here it's just similar rows.

              – Alex Germain
              Jan 3 at 10:19



















            • Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

              – Alex Germain
              Jan 2 at 22:15













            • I updated my answer based on your comment.

              – MatAff
              Jan 2 at 22:36











            • Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

              – Alex Germain
              Jan 2 at 23:58











            • The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

              – MatAff
              Jan 3 at 3:01











            • My df here is a subset of a full other one. Here it's just similar rows.

              – Alex Germain
              Jan 3 at 10:19

















            Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

            – Alex Germain
            Jan 2 at 22:15







            Thanks for your answer, my problem is that there is no complete.cases() rows. The value could be taken on the previous or on the third following rows

            – Alex Germain
            Jan 2 at 22:15















            I updated my answer based on your comment.

            – MatAff
            Jan 2 at 22:36





            I updated my answer based on your comment.

            – MatAff
            Jan 2 at 22:36













            Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

            – Alex Germain
            Jan 2 at 23:58





            Thanks for your help, I just change the (r-1, r+1) part. Do you know any function to 'replace' this subset in a full data frame ? here it's only similar row and I want to reintegrate it in the full df given 5 or more keys for a row.

            – Alex Germain
            Jan 2 at 23:58













            The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

            – MatAff
            Jan 3 at 3:01





            The code here loops through the full dataset. The statement df[r,c] <- ... overwrites the value currently in the data frame. What else are you looking to replace?

            – MatAff
            Jan 3 at 3:01













            My df here is a subset of a full other one. Here it's just similar rows.

            – Alex Germain
            Jan 3 at 10:19





            My df here is a subset of a full other one. Here it's just similar rows.

            – Alex Germain
            Jan 3 at 10:19













            1














            reading your question made me think of an imputation problem for the dataframe.



            In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].



            In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 <NA> BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            We can then create a function to calculate the mode of a particular column:



            getmode <- function(v) {
            uniqv <- unique(v)
            uniqv[which.max(tabulate(match(v, uniqv)))]
            }


            And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :



            workdf <- df
            workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])


            This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            If your objective was data cleaning I guess that you should use an imputation method to deal with it.






            share|improve this answer
























            • Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

              – Alex Germain
              Jan 3 at 0:04











            • As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

              – alessio
              Jan 3 at 0:13











            • Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

              – Alex Germain
              Jan 3 at 0:24






            • 1





              You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

              – alessio
              Jan 3 at 0:36
















            1














            reading your question made me think of an imputation problem for the dataframe.



            In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].



            In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 <NA> BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            We can then create a function to calculate the mode of a particular column:



            getmode <- function(v) {
            uniqv <- unique(v)
            uniqv[which.max(tabulate(match(v, uniqv)))]
            }


            And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :



            workdf <- df
            workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])


            This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            If your objective was data cleaning I guess that you should use an imputation method to deal with it.






            share|improve this answer
























            • Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

              – Alex Germain
              Jan 3 at 0:04











            • As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

              – alessio
              Jan 3 at 0:13











            • Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

              – Alex Germain
              Jan 3 at 0:24






            • 1





              You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

              – alessio
              Jan 3 at 0:36














            1












            1








            1







            reading your question made me think of an imputation problem for the dataframe.



            In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].



            In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 <NA> BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            We can then create a function to calculate the mode of a particular column:



            getmode <- function(v) {
            uniqv <- unique(v)
            uniqv[which.max(tabulate(match(v, uniqv)))]
            }


            And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :



            workdf <- df
            workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])


            This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            If your objective was data cleaning I guess that you should use an imputation method to deal with it.






            share|improve this answer













            reading your question made me think of an imputation problem for the dataframe.



            In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].



            In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 <NA> BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            We can then create a function to calculate the mode of a particular column:



            getmode <- function(v) {
            uniqv <- unique(v)
            uniqv[which.max(tabulate(match(v, uniqv)))]
            }


            And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :



            workdf <- df
            workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])


            This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:



                     Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
            1 -0.32837526 0.7930541 -1.10954824 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            2 0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            3 0.36442118 0.9920967 -0.07345038 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343 AA3489 BQA0478 AD4352 AZERTY
            5 1.92550340 -1.0531371 0.88318695 ABF4576 ABF5403 <NA> AA3489 BQA0478 AD4352 AZERTY


            If your objective was data cleaning I guess that you should use an imputation method to deal with it.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 2 at 23:58









            alessioalessio

            35916




            35916













            • Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

              – Alex Germain
              Jan 3 at 0:04











            • As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

              – alessio
              Jan 3 at 0:13











            • Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

              – Alex Germain
              Jan 3 at 0:24






            • 1





              You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

              – alessio
              Jan 3 at 0:36



















            • Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

              – Alex Germain
              Jan 3 at 0:04











            • As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

              – alessio
              Jan 3 at 0:13











            • Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

              – Alex Germain
              Jan 3 at 0:24






            • 1





              You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

              – alessio
              Jan 3 at 0:36

















            Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

            – Alex Germain
            Jan 3 at 0:04





            Thanks for your really explicit answer. Just a question, here it's a subset where all rows was similar, but is there anyway to deal with NA on the full DF ? where there is several different group of 'same' raws ? How the mode could work on it ?

            – Alex Germain
            Jan 3 at 0:04













            As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

            – alessio
            Jan 3 at 0:13





            As an example: you're considering the situation where the S135_AA contained different values and not only "ABF5403" ? In that case the mode will take the most frequent value and will impute it. The fact is that from your starting dataframe you have NAs, so if you want to keep a particular data point you have to make a choice on a value to attribute to them, but only one value can be imputed. Otherwise you'll have to drop that data point.

            – alessio
            Jan 3 at 0:13













            Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

            – Alex Germain
            Jan 3 at 0:24





            Ok I see, I will take a look to missForest() to see if it can be helpful or not in m case

            – Alex Germain
            Jan 3 at 0:24




            1




            1





            You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

            – alessio
            Jan 3 at 0:36





            You may also have a look at kNN for imputing missing values and, if you wish to have a nice visualization plot for reports, at vis_miss.

            – alessio
            Jan 3 at 0:36











            0














            You can do the following:



            library(zoo)

            # get cols with missing values
            na_cols <- names(df)[colSums(is.na(df)) > 0]

            # fill the missing value backwards
            for (i in na_cols){
            df[[i]] <- na.locf(df[[i]])
            }





            share|improve this answer




























              0














              You can do the following:



              library(zoo)

              # get cols with missing values
              na_cols <- names(df)[colSums(is.na(df)) > 0]

              # fill the missing value backwards
              for (i in na_cols){
              df[[i]] <- na.locf(df[[i]])
              }





              share|improve this answer


























                0












                0








                0







                You can do the following:



                library(zoo)

                # get cols with missing values
                na_cols <- names(df)[colSums(is.na(df)) > 0]

                # fill the missing value backwards
                for (i in na_cols){
                df[[i]] <- na.locf(df[[i]])
                }





                share|improve this answer













                You can do the following:



                library(zoo)

                # get cols with missing values
                na_cols <- names(df)[colSums(is.na(df)) > 0]

                # fill the missing value backwards
                for (i in na_cols){
                df[[i]] <- na.locf(df[[i]])
                }






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 2 at 22:26









                YOLOYOLO

                5,5631425




                5,5631425






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54009176%2fhow-to-fill-na-in-r-for-quasi-same-row%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    'app-layout' is not a known element: how to share Component with different Modules

                    android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

                    WPF add header to Image with URL pettitions [duplicate]