Extract row indices where duplicated pairs exist and the values of these rows which are different





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















I have the dataframe below:



master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))


As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B



master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))


and then with:



library(data.table)
setDT(master2)

master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]


I get:



#    A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8


So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?



The desired output shpuld be something like:



# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4









share|improve this question




















  • 1





    Why is it 3, 4 for 2nd row?

    – akrun
    Jan 3 at 19:24











  • because it is different by C and D which are in the 3rd and 4th position of the master dataset

    – firmo23
    Jan 4 at 12:36


















1















I have the dataframe below:



master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))


As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B



master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))


and then with:



library(data.table)
setDT(master2)

master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]


I get:



#    A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8


So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?



The desired output shpuld be something like:



# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4









share|improve this question




















  • 1





    Why is it 3, 4 for 2nd row?

    – akrun
    Jan 3 at 19:24











  • because it is different by C and D which are in the 3rd and 4th position of the master dataset

    – firmo23
    Jan 4 at 12:36














1












1








1








I have the dataframe below:



master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))


As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B



master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))


and then with:



library(data.table)
setDT(master2)

master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]


I get:



#    A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8


So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?



The desired output shpuld be something like:



# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4









share|improve this question
















I have the dataframe below:



master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))


As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B



master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))


and then with:



library(data.table)
setDT(master2)

master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]


I get:



#    A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8


So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?



The desired output shpuld be something like:



# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4






r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 19:12







firmo23

















asked Jan 3 at 16:55









firmo23firmo23

880517




880517








  • 1





    Why is it 3, 4 for 2nd row?

    – akrun
    Jan 3 at 19:24











  • because it is different by C and D which are in the 3rd and 4th position of the master dataset

    – firmo23
    Jan 4 at 12:36














  • 1





    Why is it 3, 4 for 2nd row?

    – akrun
    Jan 3 at 19:24











  • because it is different by C and D which are in the 3rd and 4th position of the master dataset

    – firmo23
    Jan 4 at 12:36








1




1





Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24





Why is it 3, 4 for 2nd row?

– akrun
Jan 3 at 19:24













because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36





because it is different by C and D which are in the 3rd and 4th position of the master dataset

– firmo23
Jan 4 at 12:36












1 Answer
1






active

oldest

votes


















1














If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when



library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D


Update



Based on the comments to include 'position'



master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))





share|improve this answer





















  • 1





    great answer tnx!

    – firmo23
    Jan 3 at 18:35











  • is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

    – firmo23
    Jan 3 at 18:49






  • 1





    In this case, it is the same as locs. With a different example, it may become more clear about the intended output

    – akrun
    Jan 3 at 18:50











  • I edited properly

    – firmo23
    Jan 3 at 19:13











  • Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

    – firmo23
    Jan 4 at 12:50












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026582%2fextract-row-indices-where-duplicated-pairs-exist-and-the-values-of-these-rows-wh%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when



library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D


Update



Based on the comments to include 'position'



master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))





share|improve this answer





















  • 1





    great answer tnx!

    – firmo23
    Jan 3 at 18:35











  • is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

    – firmo23
    Jan 3 at 18:49






  • 1





    In this case, it is the same as locs. With a different example, it may become more clear about the intended output

    – akrun
    Jan 3 at 18:50











  • I edited properly

    – firmo23
    Jan 3 at 19:13











  • Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

    – firmo23
    Jan 4 at 12:50
















1














If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when



library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D


Update



Based on the comments to include 'position'



master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))





share|improve this answer





















  • 1





    great answer tnx!

    – firmo23
    Jan 3 at 18:35











  • is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

    – firmo23
    Jan 3 at 18:49






  • 1





    In this case, it is the same as locs. With a different example, it may become more clear about the intended output

    – akrun
    Jan 3 at 18:50











  • I edited properly

    – firmo23
    Jan 3 at 19:13











  • Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

    – firmo23
    Jan 4 at 12:50














1












1








1







If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when



library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D


Update



Based on the comments to include 'position'



master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))





share|improve this answer















If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when



library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D


Update



Based on the comments to include 'position'



master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 3 at 19:27

























answered Jan 3 at 18:06









akrunakrun

422k13209285




422k13209285








  • 1





    great answer tnx!

    – firmo23
    Jan 3 at 18:35











  • is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

    – firmo23
    Jan 3 at 18:49






  • 1





    In this case, it is the same as locs. With a different example, it may become more clear about the intended output

    – akrun
    Jan 3 at 18:50











  • I edited properly

    – firmo23
    Jan 3 at 19:13











  • Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

    – firmo23
    Jan 4 at 12:50














  • 1





    great answer tnx!

    – firmo23
    Jan 3 at 18:35











  • is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

    – firmo23
    Jan 3 at 18:49






  • 1





    In this case, it is the same as locs. With a different example, it may become more clear about the intended output

    – akrun
    Jan 3 at 18:50











  • I edited properly

    – firmo23
    Jan 3 at 19:13











  • Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

    – firmo23
    Jan 4 at 12:50








1




1





great answer tnx!

– firmo23
Jan 3 at 18:35





great answer tnx!

– firmo23
Jan 3 at 18:35













is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49





is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?

– firmo23
Jan 3 at 18:49




1




1





In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50





In this case, it is the same as locs. With a different example, it may become more clear about the intended output

– akrun
Jan 3 at 18:50













I edited properly

– firmo23
Jan 3 at 19:13





I edited properly

– firmo23
Jan 3 at 19:13













Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50





Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?

– firmo23
Jan 4 at 12:50




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026582%2fextract-row-indices-where-duplicated-pairs-exist-and-the-values-of-these-rows-wh%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith