Extract row indices where duplicated pairs exist and the values of these rows which are different
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have the dataframe below:
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B
master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))
and then with:
library(data.table)
setDT(master2)
master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]
I get:
# A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8
So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
The desired output shpuld be something like:
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4
r
add a comment |
I have the dataframe below:
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B
master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))
and then with:
library(data.table)
setDT(master2)
master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]
I get:
# A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8
So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
The desired output shpuld be something like:
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4
r
1
Why is it 3, 4 for 2nd row?
– akrun
Jan 3 at 19:24
because it is different by C and D which are in the 3rd and 4th position of the master dataset
– firmo23
Jan 4 at 12:36
add a comment |
I have the dataframe below:
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B
master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))
and then with:
library(data.table)
setDT(master2)
master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]
I get:
# A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8
So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
The desired output shpuld be something like:
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4
r
I have the dataframe below:
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B
master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))
and then with:
library(data.table)
setDT(master2)
master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]
I get:
# A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8
So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
The desired output shpuld be something like:
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4
r
r
edited Jan 3 at 19:12
firmo23
asked Jan 3 at 16:55
firmo23firmo23
880517
880517
1
Why is it 3, 4 for 2nd row?
– akrun
Jan 3 at 19:24
because it is different by C and D which are in the 3rd and 4th position of the master dataset
– firmo23
Jan 4 at 12:36
add a comment |
1
Why is it 3, 4 for 2nd row?
– akrun
Jan 3 at 19:24
because it is different by C and D which are in the 3rd and 4th position of the master dataset
– firmo23
Jan 4 at 12:36
1
1
Why is it 3, 4 for 2nd row?
– akrun
Jan 3 at 19:24
Why is it 3, 4 for 2nd row?
– akrun
Jan 3 at 19:24
because it is different by C and D which are in the 3rd and 4th position of the master dataset
– firmo23
Jan 4 at 12:36
because it is different by C and D which are in the 3rd and 4th position of the master dataset
– firmo23
Jan 4 at 12:36
add a comment |
1 Answer
1
active
oldest
votes
If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when
library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D
Update
Based on the comments to include 'position'
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))
1
great answer tnx!
– firmo23
Jan 3 at 18:35
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
1
In this case, it is the same aslocs. With a different example, it may become more clear about the intended output
– akrun
Jan 3 at 18:50
I edited properly
– firmo23
Jan 3 at 19:13
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026582%2fextract-row-indices-where-duplicated-pairs-exist-and-the-values-of-these-rows-wh%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when
library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D
Update
Based on the comments to include 'position'
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))
1
great answer tnx!
– firmo23
Jan 3 at 18:35
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
1
In this case, it is the same aslocs. With a different example, it may become more clear about the intended output
– akrun
Jan 3 at 18:50
I edited properly
– firmo23
Jan 3 at 19:13
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
|
show 1 more comment
If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when
library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D
Update
Based on the comments to include 'position'
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))
1
great answer tnx!
– firmo23
Jan 3 at 18:35
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
1
In this case, it is the same aslocs. With a different example, it may become more clear about the intended output
– akrun
Jan 3 at 18:50
I edited properly
– firmo23
Jan 3 at 19:13
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
|
show 1 more comment
If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when
library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D
Update
Based on the comments to include 'position'
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))
If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when
library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D
Update
Based on the comments to include 'position'
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))
edited Jan 3 at 19:27
answered Jan 3 at 18:06
akrunakrun
422k13209285
422k13209285
1
great answer tnx!
– firmo23
Jan 3 at 18:35
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
1
In this case, it is the same aslocs. With a different example, it may become more clear about the intended output
– akrun
Jan 3 at 18:50
I edited properly
– firmo23
Jan 3 at 19:13
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
|
show 1 more comment
1
great answer tnx!
– firmo23
Jan 3 at 18:35
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
1
In this case, it is the same aslocs. With a different example, it may become more clear about the intended output
– akrun
Jan 3 at 18:50
I edited properly
– firmo23
Jan 3 at 19:13
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
1
1
great answer tnx!
– firmo23
Jan 3 at 18:35
great answer tnx!
– firmo23
Jan 3 at 18:35
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
is it possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
– firmo23
Jan 3 at 18:49
1
1
In this case, it is the same as
locs. With a different example, it may become more clear about the intended output– akrun
Jan 3 at 18:50
In this case, it is the same as
locs. With a different example, it may become more clear about the intended output– akrun
Jan 3 at 18:50
I edited properly
– firmo23
Jan 3 at 19:13
I edited properly
– firmo23
Jan 3 at 19:13
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
Now the position is same as the locs but in this case I want the column position. Do u want me to create a new Q since the original is answered?
– firmo23
Jan 4 at 12:50
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026582%2fextract-row-indices-where-duplicated-pairs-exist-and-the-values-of-these-rows-wh%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown

1
Why is it 3, 4 for 2nd row?
– akrun
Jan 3 at 19:24
because it is different by C and D which are in the 3rd and 4th position of the master dataset
– firmo23
Jan 4 at 12:36