Extract row indices where duplicated pairs exist and the values of these rows which are different

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have the dataframe below:

master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))

As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B

master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))

and then with:

library(data.table)

setDT(master2)



master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2), 

  .(N, locs = .(.I)), by=.EACHI]

I get:

#    A B N locs

# 1: 2 3 2  3,4

# 2: 4 6 2  7,8

So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

The desired output shpuld be something like:

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different position

#   <dbl> <dbl> <int> <chr> <chr>    <int>

#1     2     3     2 3, 4  C, D     3,4

#2     4     6     2 7, 8  C, D     3,4

edited Jan 3 at 19:12

asked Jan 3 at 16:55

firmo23

880517

1

Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24

because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36

add a comment |

I have the dataframe below:

master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))

master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))

and then with:

library(data.table)

setDT(master2)



master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2), 

  .(N, locs = .(.I)), by=.EACHI]

I get:

#    A B N locs

# 1: 2 3 2  3,4

# 2: 4 6 2  7,8

The desired output shpuld be something like:

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different position

#   <dbl> <dbl> <int> <chr> <chr>    <int>

#1     2     3     2 3, 4  C, D     3,4

#2     4     6     2 7, 8  C, D     3,4

edited Jan 3 at 19:12

asked Jan 3 at 16:55

firmo23

880517

1

Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24

because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36

add a comment |

I have the dataframe below:

master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))

master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))

and then with:

library(data.table)

setDT(master2)



master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2), 

  .(N, locs = .(.I)), by=.EACHI]

I get:

#    A B N locs

# 1: 2 3 2  3,4

# 2: 4 6 2  7,8

The desired output shpuld be something like:

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different position

#   <dbl> <dbl> <int> <chr> <chr>    <int>

#1     2     3     2 3, 4  C, D     3,4

#2     4     6     2 7, 8  C, D     3,4

edited Jan 3 at 19:12

asked Jan 3 at 16:55

firmo23

880517

I have the dataframe below:

master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))

master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))

and then with:

library(data.table)

setDT(master2)



master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2), 

  .(N, locs = .(.I)), by=.EACHI]

I get:

#    A B N locs

# 1: 2 3 2  3,4

# 2: 4 6 2  7,8

The desired output shpuld be something like:

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different position

#   <dbl> <dbl> <int> <chr> <chr>    <int>

#1     2     3     2 3, 4  C, D     3,4

#2     4     6     2 7, 8  C, D     3,4

edited Jan 3 at 19:12

asked Jan 3 at 16:55

firmo23

880517

edited Jan 3 at 19:12

asked Jan 3 at 16:55

firmo23

880517

edited Jan 3 at 19:12

asked Jan 3 at 16:55

firmo23

880517

asked Jan 3 at 16:55

firmo23

880517

asked Jan 3 at 16:55

firmo23

880517

1

Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24

because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36

add a comment |

1

Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24

because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36

Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24

because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36

add a comment |

1 Answer
1

active

oldest

votes

If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when

library(tidyverse)

master %>%

   mutate(rn = row_number()) %>% 

   group_by(A, B) %>% 

   filter(n() > 1) %>% 

   summarise(n = n(), 

            locs =  toString(rn),

            Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                  n_distinct(C) > 1 ~ 'C', 

                                  n_distinct(D) > 1 ~ 'D', 

                                  TRUE ~ 'Same')) 

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different

#   <dbl> <dbl> <int> <chr> <chr>    

#1     2     3     2 3, 4  C, D     

#2     4     6     2 7, 8  C, D

Update

Based on the comments to include 'position'

master %>%

    mutate(rn = row_number()) %>% 

    group_by(A, B) %>% 

    filter(n() > 1) %>% 

    mutate(position = toString(rn[!(duplicated(paste(C, D))| 

              duplicated(paste(C, D), fromLast = TRUE))])) %>%

    summarise(n = n(), 

             locs =  toString(rn),

             Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                   n_distinct(C) > 1 ~ 'C', 

                                   n_distinct(D) > 1 ~ 'D', 

                                   TRUE ~ 'Same'),

             position = first(position))

edited Jan 3 at 19:27

answered Jan 3 at 18:06

akrun

422k13209285

1

great answer tnx!

– firmo23
Jan 3 at 18:35

is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49

1

In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50

I edited properly

– firmo23
Jan 3 at 19:13

Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026582%2fextract-row-indices-where-duplicated-pairs-exist-and-the-values-of-these-rows-wh%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

library(tidyverse)

master %>%

   mutate(rn = row_number()) %>% 

   group_by(A, B) %>% 

   filter(n() > 1) %>% 

   summarise(n = n(), 

            locs =  toString(rn),

            Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                  n_distinct(C) > 1 ~ 'C', 

                                  n_distinct(D) > 1 ~ 'D', 

                                  TRUE ~ 'Same')) 

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different

#   <dbl> <dbl> <int> <chr> <chr>    

#1     2     3     2 3, 4  C, D     

#2     4     6     2 7, 8  C, D

Update

Based on the comments to include 'position'

master %>%

    mutate(rn = row_number()) %>% 

    group_by(A, B) %>% 

    filter(n() > 1) %>% 

    mutate(position = toString(rn[!(duplicated(paste(C, D))| 

              duplicated(paste(C, D), fromLast = TRUE))])) %>%

    summarise(n = n(), 

             locs =  toString(rn),

             Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                   n_distinct(C) > 1 ~ 'C', 

                                   n_distinct(D) > 1 ~ 'D', 

                                   TRUE ~ 'Same'),

             position = first(position))

edited Jan 3 at 19:27

answered Jan 3 at 18:06

akrun

422k13209285

1

great answer tnx!

– firmo23
Jan 3 at 18:35

is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49

1

In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50

I edited properly

– firmo23
Jan 3 at 19:13

Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50

|
show 1 more comment

library(tidyverse)

master %>%

   mutate(rn = row_number()) %>% 

   group_by(A, B) %>% 

   filter(n() > 1) %>% 

   summarise(n = n(), 

            locs =  toString(rn),

            Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                  n_distinct(C) > 1 ~ 'C', 

                                  n_distinct(D) > 1 ~ 'D', 

                                  TRUE ~ 'Same')) 

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different

#   <dbl> <dbl> <int> <chr> <chr>    

#1     2     3     2 3, 4  C, D     

#2     4     6     2 7, 8  C, D

Update

Based on the comments to include 'position'

master %>%

    mutate(rn = row_number()) %>% 

    group_by(A, B) %>% 

    filter(n() > 1) %>% 

    mutate(position = toString(rn[!(duplicated(paste(C, D))| 

              duplicated(paste(C, D), fromLast = TRUE))])) %>%

    summarise(n = n(), 

             locs =  toString(rn),

             Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                   n_distinct(C) > 1 ~ 'C', 

                                   n_distinct(D) > 1 ~ 'D', 

                                   TRUE ~ 'Same'),

             position = first(position))

edited Jan 3 at 19:27

answered Jan 3 at 18:06

akrun

422k13209285

1

great answer tnx!

– firmo23
Jan 3 at 18:35

is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49

1

In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50

I edited properly

– firmo23
Jan 3 at 19:13

Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50

|
show 1 more comment

library(tidyverse)

master %>%

   mutate(rn = row_number()) %>% 

   group_by(A, B) %>% 

   filter(n() > 1) %>% 

   summarise(n = n(), 

            locs =  toString(rn),

            Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                  n_distinct(C) > 1 ~ 'C', 

                                  n_distinct(D) > 1 ~ 'D', 

                                  TRUE ~ 'Same')) 

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different

#   <dbl> <dbl> <int> <chr> <chr>    

#1     2     3     2 3, 4  C, D     

#2     4     6     2 7, 8  C, D

Update

Based on the comments to include 'position'

master %>%

    mutate(rn = row_number()) %>% 

    group_by(A, B) %>% 

    filter(n() > 1) %>% 

    mutate(position = toString(rn[!(duplicated(paste(C, D))| 

              duplicated(paste(C, D), fromLast = TRUE))])) %>%

    summarise(n = n(), 

             locs =  toString(rn),

             Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                   n_distinct(C) > 1 ~ 'C', 

                                   n_distinct(D) > 1 ~ 'D', 

                                   TRUE ~ 'Same'),

             position = first(position))

edited Jan 3 at 19:27

answered Jan 3 at 18:06

akrun

422k13209285

library(tidyverse)

master %>%

   mutate(rn = row_number()) %>% 

   group_by(A, B) %>% 

   filter(n() > 1) %>% 

   summarise(n = n(), 

            locs =  toString(rn),

            Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                  n_distinct(C) > 1 ~ 'C', 

                                  n_distinct(D) > 1 ~ 'D', 

                                  TRUE ~ 'Same')) 

# A tibble: 2 x 4

# Groups:   A [?]

#     A     B     n locs  different

#   <dbl> <dbl> <int> <chr> <chr>    

#1     2     3     2 3, 4  C, D     

#2     4     6     2 7, 8  C, D

Update

Based on the comments to include 'position'

master %>%

    mutate(rn = row_number()) %>% 

    group_by(A, B) %>% 

    filter(n() > 1) %>% 

    mutate(position = toString(rn[!(duplicated(paste(C, D))| 

              duplicated(paste(C, D), fromLast = TRUE))])) %>%

    summarise(n = n(), 

             locs =  toString(rn),

             Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D', 

                                   n_distinct(C) > 1 ~ 'C', 

                                   n_distinct(D) > 1 ~ 'D', 

                                   TRUE ~ 'Same'),

             position = first(position))

edited Jan 3 at 19:27

answered Jan 3 at 18:06

akrun

422k13209285

edited Jan 3 at 19:27

answered Jan 3 at 18:06

akrun

422k13209285

answered Jan 3 at 18:06

akrun

422k13209285

answered Jan 3 at 18:06

akrun

422k13209285

1

great answer tnx!

– firmo23
Jan 3 at 18:35

is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49

1

In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50

I edited properly

– firmo23
Jan 3 at 19:13

Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50

|
show 1 more comment

1

great answer tnx!

– firmo23
Jan 3 at 18:35

is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49

1

In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50

I edited properly

– firmo23
Jan 3 at 19:13

Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50

great answer tnx!

– firmo23
Jan 3 at 18:35

is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49

In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50

I edited properly

– firmo23
Jan 3 at 19:13

Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu