Is there a variable that contains the current row for filtering a subset in R?

I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine(), which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).

Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?

My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)

Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.

Evaluation error: Wrong length for a vector, should be 2.

library(geosphere)   

library(readr)





ff <- function(x, pos)  subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000,    select= c(lat, lon, timestamp, value ))





yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff), 

    chunk_size = 100000, col_names = TRUE)

edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations

    dput(head(yy, 20))

structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023, 

52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023, 

52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185, 

4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794, 

-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp =    structure(c(1538352021, 

1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110, 

1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183, 

1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290, 

1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 

    P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0", 

    "4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0", 

    "0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon", 

"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df", 

"tbl", "data.frame"))

The result shall be a filtered dataframe

lat     lon     timestamp    P1        

9,5     50,5     1.1.2019    123    

8,8     49,3     1.1.2019    23     

...

edited Jan 2 at 20:48

asked Jan 2 at 20:09

Clemensiver

112

Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29

add a comment |

Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?

My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)

Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.

Evaluation error: Wrong length for a vector, should be 2.

library(geosphere)   

library(readr)





ff <- function(x, pos)  subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000,    select= c(lat, lon, timestamp, value ))





yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff), 

    chunk_size = 100000, col_names = TRUE)

edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations

    dput(head(yy, 20))

structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023, 

52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023, 

52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185, 

4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794, 

-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp =    structure(c(1538352021, 

1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110, 

1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183, 

1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290, 

1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 

    P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0", 

    "4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0", 

    "0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon", 

"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df", 

"tbl", "data.frame"))

The result shall be a filtered dataframe

lat     lon     timestamp    P1        

9,5     50,5     1.1.2019    123    

8,8     49,3     1.1.2019    23     

...

edited Jan 2 at 20:48

asked Jan 2 at 20:09

Clemensiver

112

Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29

add a comment |

Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?

My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)

Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.

Evaluation error: Wrong length for a vector, should be 2.

library(geosphere)   

library(readr)





ff <- function(x, pos)  subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000,    select= c(lat, lon, timestamp, value ))





yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff), 

    chunk_size = 100000, col_names = TRUE)

edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations

    dput(head(yy, 20))

structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023, 

52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023, 

52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185, 

4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794, 

-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp =    structure(c(1538352021, 

1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110, 

1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183, 

1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290, 

1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 

    P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0", 

    "4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0", 

    "0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon", 

"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df", 

"tbl", "data.frame"))

The result shall be a filtered dataframe

lat     lon     timestamp    P1        

9,5     50,5     1.1.2019    123    

8,8     49,3     1.1.2019    23     

...

edited Jan 2 at 20:48

asked Jan 2 at 20:09

Clemensiver

112

Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?

My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)

Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.

Evaluation error: Wrong length for a vector, should be 2.

library(geosphere)   

library(readr)





ff <- function(x, pos)  subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000,    select= c(lat, lon, timestamp, value ))





yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff), 

    chunk_size = 100000, col_names = TRUE)

edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations

    dput(head(yy, 20))

structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023, 

52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023, 

52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185, 

4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794, 

-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp =    structure(c(1538352021, 

1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110, 

1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183, 

1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290, 

1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 

    P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0", 

    "4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0", 

    "0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon", 

"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df", 

"tbl", "data.frame"))

The result shall be a filtered dataframe

lat     lon     timestamp    P1        

9,5     50,5     1.1.2019    123    

8,8     49,3     1.1.2019    23     

...

r subset readr

edited Jan 2 at 20:48

asked Jan 2 at 20:09

Clemensiver

112

edited Jan 2 at 20:48

asked Jan 2 at 20:09

Clemensiver

112

edited Jan 2 at 20:48

asked Jan 2 at 20:09

Clemensiver

112

asked Jan 2 at 20:09

Clemensiver

112

asked Jan 2 at 20:09

Clemensiver

112

Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29

add a comment |

Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29

Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29

add a comment |

1 Answer
1

active

oldest

votes

Here's a tidyverse approach that uses the pmap_df function to run distHaversine on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.

library(geosphere)   

library(tidyverse)



# Fake data

set.seed(2)

dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))



dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))), 

               ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")), 

                           dat[.y, ] %>% set_names(c("lon2", "lat2")), 

                           dist=distHaversine(dat[.x, ], dat[.y, ])))



dist

         lon1       lat1       lon2       lat2     dist

1  -113.44239  79.825493   72.85465 -66.751384 18570291

2  -113.44239  79.825493   26.39748  60.020787  4259930

3  -113.44239  79.825493 -119.50131  -5.756667  9533243

4  -113.44239  79.825493  159.78216   8.997074  8969682

5    72.85465 -66.751384   26.39748  60.020787 14616198

6    72.85465 -66.751384 -119.50131  -5.756667 11905205

7    72.85465 -66.751384  159.78216   8.997074 10803902

8    26.39748  60.020787 -119.50131  -5.756667 13347748

9    26.39748  60.020787  159.78216   8.997074 11326140

10 -119.50131  -5.756667  159.78216   8.997074  9104543

If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply function:

apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))

       1        3        4        5 

18570291 14616198 11905205 10803902

edited Jan 2 at 21:10

answered Jan 2 at 20:59

eipi10

60.2k16109165

add a comment |

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012531%2fis-there-a-variable-that-contains-the-current-row-for-filtering-a-subset-in-r%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

library(geosphere)   

library(tidyverse)



# Fake data

set.seed(2)

dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))



dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))), 

               ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")), 

                           dat[.y, ] %>% set_names(c("lon2", "lat2")), 

                           dist=distHaversine(dat[.x, ], dat[.y, ])))



dist

         lon1       lat1       lon2       lat2     dist

1  -113.44239  79.825493   72.85465 -66.751384 18570291

2  -113.44239  79.825493   26.39748  60.020787  4259930

3  -113.44239  79.825493 -119.50131  -5.756667  9533243

4  -113.44239  79.825493  159.78216   8.997074  8969682

5    72.85465 -66.751384   26.39748  60.020787 14616198

6    72.85465 -66.751384 -119.50131  -5.756667 11905205

7    72.85465 -66.751384  159.78216   8.997074 10803902

8    26.39748  60.020787 -119.50131  -5.756667 13347748

9    26.39748  60.020787  159.78216   8.997074 11326140

10 -119.50131  -5.756667  159.78216   8.997074  9104543

apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))

       1        3        4        5 

18570291 14616198 11905205 10803902

edited Jan 2 at 21:10

answered Jan 2 at 20:59

eipi10

60.2k16109165

add a comment |

library(geosphere)   

library(tidyverse)



# Fake data

set.seed(2)

dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))



dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))), 

               ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")), 

                           dat[.y, ] %>% set_names(c("lon2", "lat2")), 

                           dist=distHaversine(dat[.x, ], dat[.y, ])))



dist

         lon1       lat1       lon2       lat2     dist

1  -113.44239  79.825493   72.85465 -66.751384 18570291

2  -113.44239  79.825493   26.39748  60.020787  4259930

3  -113.44239  79.825493 -119.50131  -5.756667  9533243

4  -113.44239  79.825493  159.78216   8.997074  8969682

5    72.85465 -66.751384   26.39748  60.020787 14616198

6    72.85465 -66.751384 -119.50131  -5.756667 11905205

7    72.85465 -66.751384  159.78216   8.997074 10803902

8    26.39748  60.020787 -119.50131  -5.756667 13347748

9    26.39748  60.020787  159.78216   8.997074 11326140

10 -119.50131  -5.756667  159.78216   8.997074  9104543

apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))

       1        3        4        5 

18570291 14616198 11905205 10803902

edited Jan 2 at 21:10

answered Jan 2 at 20:59

eipi10

60.2k16109165

add a comment |

library(geosphere)   

library(tidyverse)



# Fake data

set.seed(2)

dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))



dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))), 

               ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")), 

                           dat[.y, ] %>% set_names(c("lon2", "lat2")), 

                           dist=distHaversine(dat[.x, ], dat[.y, ])))



dist

         lon1       lat1       lon2       lat2     dist

1  -113.44239  79.825493   72.85465 -66.751384 18570291

2  -113.44239  79.825493   26.39748  60.020787  4259930

3  -113.44239  79.825493 -119.50131  -5.756667  9533243

4  -113.44239  79.825493  159.78216   8.997074  8969682

5    72.85465 -66.751384   26.39748  60.020787 14616198

6    72.85465 -66.751384 -119.50131  -5.756667 11905205

7    72.85465 -66.751384  159.78216   8.997074 10803902

8    26.39748  60.020787 -119.50131  -5.756667 13347748

9    26.39748  60.020787  159.78216   8.997074 11326140

10 -119.50131  -5.756667  159.78216   8.997074  9104543

apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))

       1        3        4        5 

18570291 14616198 11905205 10803902

edited Jan 2 at 21:10

answered Jan 2 at 20:59

eipi10

60.2k16109165

library(geosphere)   

library(tidyverse)



# Fake data

set.seed(2)

dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))



dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))), 

               ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")), 

                           dat[.y, ] %>% set_names(c("lon2", "lat2")), 

                           dist=distHaversine(dat[.x, ], dat[.y, ])))



dist

         lon1       lat1       lon2       lat2     dist

1  -113.44239  79.825493   72.85465 -66.751384 18570291

2  -113.44239  79.825493   26.39748  60.020787  4259930

3  -113.44239  79.825493 -119.50131  -5.756667  9533243

4  -113.44239  79.825493  159.78216   8.997074  8969682

5    72.85465 -66.751384   26.39748  60.020787 14616198

6    72.85465 -66.751384 -119.50131  -5.756667 11905205

7    72.85465 -66.751384  159.78216   8.997074 10803902

8    26.39748  60.020787 -119.50131  -5.756667 13347748

9    26.39748  60.020787  159.78216   8.997074 11326140

10 -119.50131  -5.756667  159.78216   8.997074  9104543

apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))

       1        3        4        5 

18570291 14616198 11905205 10803902

edited Jan 2 at 21:10

answered Jan 2 at 20:59

eipi10

60.2k16109165

edited Jan 2 at 21:10

answered Jan 2 at 20:59

eipi10

60.2k16109165

answered Jan 2 at 20:59

eipi10

60.2k16109165

answered Jan 2 at 20:59

eipi10

60.2k16109165

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu