Is there a variable that contains the current row for filtering a subset in R?
I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine()
, which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).
Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))
?
My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)
Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.
Evaluation error: Wrong length for a vector, should be 2.
library(geosphere)
library(readr)
ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))
yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)
edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations
dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
The result shall be a filtered dataframe
lat lon timestamp P1
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...
r subset readr
add a comment |
I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine()
, which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).
Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))
?
My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)
Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.
Evaluation error: Wrong length for a vector, should be 2.
library(geosphere)
library(readr)
ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))
yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)
edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations
dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
The result shall be a filtered dataframe
lat lon timestamp P1
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...
r subset readr
Can you post sample data? Please edit the question with the output ofdput(yy)
. Or, if it is too big with the output ofdput(head(yy, 20))
.
– Rui Barradas
Jan 2 at 20:29
add a comment |
I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine()
, which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).
Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))
?
My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)
Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.
Evaluation error: Wrong length for a vector, should be 2.
library(geosphere)
library(readr)
ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))
yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)
edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations
dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
The result shall be a filtered dataframe
lat lon timestamp P1
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...
r subset readr
I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine()
, which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).
Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))
?
My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)
Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.
Evaluation error: Wrong length for a vector, should be 2.
library(geosphere)
library(readr)
ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))
yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)
edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations
dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
The result shall be a filtered dataframe
lat lon timestamp P1
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...
r subset readr
r subset readr
edited Jan 2 at 20:48
Clemensiver
asked Jan 2 at 20:09
ClemensiverClemensiver
112
112
Can you post sample data? Please edit the question with the output ofdput(yy)
. Or, if it is too big with the output ofdput(head(yy, 20))
.
– Rui Barradas
Jan 2 at 20:29
add a comment |
Can you post sample data? Please edit the question with the output ofdput(yy)
. Or, if it is too big with the output ofdput(head(yy, 20))
.
– Rui Barradas
Jan 2 at 20:29
Can you post sample data? Please edit the question with the output of
dput(yy)
. Or, if it is too big with the output of dput(head(yy, 20))
.– Rui Barradas
Jan 2 at 20:29
Can you post sample data? Please edit the question with the output of
dput(yy)
. Or, if it is too big with the output of dput(head(yy, 20))
.– Rui Barradas
Jan 2 at 20:29
add a comment |
1 Answer
1
active
oldest
votes
Here's a tidyverse approach that uses the pmap_df
function to run distHaversine
on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.
library(geosphere)
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))
dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
dat[.y, ] %>% set_names(c("lon2", "lat2")),
dist=distHaversine(dat[.x, ], dat[.y, ])))
dist
lon1 lat1 lon2 lat2 dist
1 -113.44239 79.825493 72.85465 -66.751384 18570291
2 -113.44239 79.825493 26.39748 60.020787 4259930
3 -113.44239 79.825493 -119.50131 -5.756667 9533243
4 -113.44239 79.825493 159.78216 8.997074 8969682
5 72.85465 -66.751384 26.39748 60.020787 14616198
6 72.85465 -66.751384 -119.50131 -5.756667 11905205
7 72.85465 -66.751384 159.78216 8.997074 10803902
8 26.39748 60.020787 -119.50131 -5.756667 13347748
9 26.39748 60.020787 159.78216 8.997074 11326140
10 -119.50131 -5.756667 159.78216 8.997074 9104543
If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply
function:
apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))
1 3 4 5
18570291 14616198 11905205 10803902
add a comment |
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012531%2fis-there-a-variable-that-contains-the-current-row-for-filtering-a-subset-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's a tidyverse approach that uses the pmap_df
function to run distHaversine
on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.
library(geosphere)
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))
dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
dat[.y, ] %>% set_names(c("lon2", "lat2")),
dist=distHaversine(dat[.x, ], dat[.y, ])))
dist
lon1 lat1 lon2 lat2 dist
1 -113.44239 79.825493 72.85465 -66.751384 18570291
2 -113.44239 79.825493 26.39748 60.020787 4259930
3 -113.44239 79.825493 -119.50131 -5.756667 9533243
4 -113.44239 79.825493 159.78216 8.997074 8969682
5 72.85465 -66.751384 26.39748 60.020787 14616198
6 72.85465 -66.751384 -119.50131 -5.756667 11905205
7 72.85465 -66.751384 159.78216 8.997074 10803902
8 26.39748 60.020787 -119.50131 -5.756667 13347748
9 26.39748 60.020787 159.78216 8.997074 11326140
10 -119.50131 -5.756667 159.78216 8.997074 9104543
If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply
function:
apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))
1 3 4 5
18570291 14616198 11905205 10803902
add a comment |
Here's a tidyverse approach that uses the pmap_df
function to run distHaversine
on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.
library(geosphere)
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))
dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
dat[.y, ] %>% set_names(c("lon2", "lat2")),
dist=distHaversine(dat[.x, ], dat[.y, ])))
dist
lon1 lat1 lon2 lat2 dist
1 -113.44239 79.825493 72.85465 -66.751384 18570291
2 -113.44239 79.825493 26.39748 60.020787 4259930
3 -113.44239 79.825493 -119.50131 -5.756667 9533243
4 -113.44239 79.825493 159.78216 8.997074 8969682
5 72.85465 -66.751384 26.39748 60.020787 14616198
6 72.85465 -66.751384 -119.50131 -5.756667 11905205
7 72.85465 -66.751384 159.78216 8.997074 10803902
8 26.39748 60.020787 -119.50131 -5.756667 13347748
9 26.39748 60.020787 159.78216 8.997074 11326140
10 -119.50131 -5.756667 159.78216 8.997074 9104543
If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply
function:
apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))
1 3 4 5
18570291 14616198 11905205 10803902
add a comment |
Here's a tidyverse approach that uses the pmap_df
function to run distHaversine
on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.
library(geosphere)
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))
dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
dat[.y, ] %>% set_names(c("lon2", "lat2")),
dist=distHaversine(dat[.x, ], dat[.y, ])))
dist
lon1 lat1 lon2 lat2 dist
1 -113.44239 79.825493 72.85465 -66.751384 18570291
2 -113.44239 79.825493 26.39748 60.020787 4259930
3 -113.44239 79.825493 -119.50131 -5.756667 9533243
4 -113.44239 79.825493 159.78216 8.997074 8969682
5 72.85465 -66.751384 26.39748 60.020787 14616198
6 72.85465 -66.751384 -119.50131 -5.756667 11905205
7 72.85465 -66.751384 159.78216 8.997074 10803902
8 26.39748 60.020787 -119.50131 -5.756667 13347748
9 26.39748 60.020787 159.78216 8.997074 11326140
10 -119.50131 -5.756667 159.78216 8.997074 9104543
If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply
function:
apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))
1 3 4 5
18570291 14616198 11905205 10803902
Here's a tidyverse approach that uses the pmap_df
function to run distHaversine
on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.
library(geosphere)
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))
dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
dat[.y, ] %>% set_names(c("lon2", "lat2")),
dist=distHaversine(dat[.x, ], dat[.y, ])))
dist
lon1 lat1 lon2 lat2 dist
1 -113.44239 79.825493 72.85465 -66.751384 18570291
2 -113.44239 79.825493 26.39748 60.020787 4259930
3 -113.44239 79.825493 -119.50131 -5.756667 9533243
4 -113.44239 79.825493 159.78216 8.997074 8969682
5 72.85465 -66.751384 26.39748 60.020787 14616198
6 72.85465 -66.751384 -119.50131 -5.756667 11905205
7 72.85465 -66.751384 159.78216 8.997074 10803902
8 26.39748 60.020787 -119.50131 -5.756667 13347748
9 26.39748 60.020787 159.78216 8.997074 11326140
10 -119.50131 -5.756667 159.78216 8.997074 9104543
If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply
function:
apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))
1 3 4 5
18570291 14616198 11905205 10803902
edited Jan 2 at 21:10
answered Jan 2 at 20:59


eipi10eipi10
60.2k16109165
60.2k16109165
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012531%2fis-there-a-variable-that-contains-the-current-row-for-filtering-a-subset-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you post sample data? Please edit the question with the output of
dput(yy)
. Or, if it is too big with the output ofdput(head(yy, 20))
.– Rui Barradas
Jan 2 at 20:29