Is there a variable that contains the current row for filtering a subset in R?












2















I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine(), which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).



Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?



My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)



Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.



Evaluation error: Wrong length for a vector, should be 2.



library(geosphere)   
library(readr)


ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))


yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)


edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations



    dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))


The result shall be a filtered dataframe



lat     lon     timestamp    P1        
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...









share|improve this question

























  • Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

    – Rui Barradas
    Jan 2 at 20:29
















2















I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine(), which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).



Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?



My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)



Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.



Evaluation error: Wrong length for a vector, should be 2.



library(geosphere)   
library(readr)


ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))


yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)


edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations



    dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))


The result shall be a filtered dataframe



lat     lon     timestamp    P1        
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...









share|improve this question

























  • Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

    – Rui Barradas
    Jan 2 at 20:29














2












2








2








I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine(), which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).



Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?



My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)



Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.



Evaluation error: Wrong length for a vector, should be 2.



library(geosphere)   
library(readr)


ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))


yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)


edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations



    dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))


The result shall be a filtered dataframe



lat     lon     timestamp    P1        
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...









share|improve this question
















I want to filter a large dataframe that contains a latitude and longitude. I want to use the method distHaversine(), which generates the distance between two points by latitude and longitude. With that, I want to filter out measurements that are far away from a city.
The method expects 2 vectors, one reference point and one specific point, containing 2 values each(lat, lon).



Is there a generic variable I can choose to just take lat, lon from my dataframe, like distHaversine(c(8.682127, 50.110922), c([i,lat], [i,lon]))?



My workaround is to just filter by concrete values of latitude and longitude.
Thanks for help :)



Using lat and lon will lead to an error, since the method will calculate the distance for one point, not for a whole set. So I need to always take one value at once for this function.



Evaluation error: Wrong length for a vector, should be 2.



library(geosphere)   
library(readr)


ff <- function(x, pos) subset(x, distHaversine(c(8.682127, 50.110922), c(lat, lon))<60000, select= c(lat, lon, timestamp, value ))


yy <- readr::read_csv2_chunked("data.csv", DataFrameCallback$new(ff),
chunk_size = 100000, col_names = TRUE)


edit: for some reason, lat and long are integer, no double values. I noted that and divided by 1000 for calculations



    dput(head(yy, 20))
structure(list(lat = c(52023, 42139, 43762, 52023, 54644, 52023,
52023, 51278, -32879, 52023, 51434, 52023, 42139, 43762, 52023,
52023, 52023, -32879, 52023, 52023), lon = c(4692, 24794, -79185,
4692, 9760, 4692, 4692, 12588, -68877, 4692, 6115, 4692, 24794,
-79185, 4692, 4692, 4692, -68877, 4692, 4692), timestamp = structure(c(1538352021,
1538352035, 1538352044, 1538352050, 1538352061, 1538352080, 1538352110,
1538352110, 1538352132, 1538352140, 1538352147, 1538352170, 1538352183,
1538352192, 1538352200, 1538352230, 1538352260, 1538352283, 1538352290,
1538352320), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
P1 = c("1.2", "10.80", "3.00", "1.7", "12.3", "2.0", "1.0",
"4.75", "1.00", "1.0", "19.3", "1.8", "11.60", "4.00", "1.0",
"0.8", "1.0", "2.00", "1.1", "1.3")), .Names = c("lat", "lon",
"timestamp", "P1"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))


The result shall be a filtered dataframe



lat     lon     timestamp    P1        
9,5 50,5 1.1.2019 123
8,8 49,3 1.1.2019 23
...






r subset readr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 2 at 20:48







Clemensiver

















asked Jan 2 at 20:09









ClemensiverClemensiver

112




112













  • Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

    – Rui Barradas
    Jan 2 at 20:29



















  • Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

    – Rui Barradas
    Jan 2 at 20:29

















Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29





Can you post sample data? Please edit the question with the output of dput(yy). Or, if it is too big with the output of dput(head(yy, 20)).

– Rui Barradas
Jan 2 at 20:29












1 Answer
1






active

oldest

votes


















0














Here's a tidyverse approach that uses the pmap_df function to run distHaversine on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.



library(geosphere)   
library(tidyverse)

# Fake data
set.seed(2)
dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))

dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
dat[.y, ] %>% set_names(c("lon2", "lat2")),
dist=distHaversine(dat[.x, ], dat[.y, ])))

dist



         lon1       lat1       lon2       lat2     dist
1 -113.44239 79.825493 72.85465 -66.751384 18570291
2 -113.44239 79.825493 26.39748 60.020787 4259930
3 -113.44239 79.825493 -119.50131 -5.756667 9533243
4 -113.44239 79.825493 159.78216 8.997074 8969682
5 72.85465 -66.751384 26.39748 60.020787 14616198
6 72.85465 -66.751384 -119.50131 -5.756667 11905205
7 72.85465 -66.751384 159.78216 8.997074 10803902
8 26.39748 60.020787 -119.50131 -5.756667 13347748
9 26.39748 60.020787 159.78216 8.997074 11326140
10 -119.50131 -5.756667 159.78216 8.997074 9104543



If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply function:



apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))



       1        3        4        5 
18570291 14616198 11905205 10803902






share|improve this answer


























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012531%2fis-there-a-variable-that-contains-the-current-row-for-filtering-a-subset-in-r%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Here's a tidyverse approach that uses the pmap_df function to run distHaversine on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.



    library(geosphere)   
    library(tidyverse)

    # Fake data
    set.seed(2)
    dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))

    dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
    ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
    dat[.y, ] %>% set_names(c("lon2", "lat2")),
    dist=distHaversine(dat[.x, ], dat[.y, ])))

    dist



             lon1       lat1       lon2       lat2     dist
    1 -113.44239 79.825493 72.85465 -66.751384 18570291
    2 -113.44239 79.825493 26.39748 60.020787 4259930
    3 -113.44239 79.825493 -119.50131 -5.756667 9533243
    4 -113.44239 79.825493 159.78216 8.997074 8969682
    5 72.85465 -66.751384 26.39748 60.020787 14616198
    6 72.85465 -66.751384 -119.50131 -5.756667 11905205
    7 72.85465 -66.751384 159.78216 8.997074 10803902
    8 26.39748 60.020787 -119.50131 -5.756667 13347748
    9 26.39748 60.020787 159.78216 8.997074 11326140
    10 -119.50131 -5.756667 159.78216 8.997074 9104543



    If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply function:



    apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))



           1        3        4        5 
    18570291 14616198 11905205 10803902






    share|improve this answer






























      0














      Here's a tidyverse approach that uses the pmap_df function to run distHaversine on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.



      library(geosphere)   
      library(tidyverse)

      # Fake data
      set.seed(2)
      dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))

      dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
      ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
      dat[.y, ] %>% set_names(c("lon2", "lat2")),
      dist=distHaversine(dat[.x, ], dat[.y, ])))

      dist



               lon1       lat1       lon2       lat2     dist
      1 -113.44239 79.825493 72.85465 -66.751384 18570291
      2 -113.44239 79.825493 26.39748 60.020787 4259930
      3 -113.44239 79.825493 -119.50131 -5.756667 9533243
      4 -113.44239 79.825493 159.78216 8.997074 8969682
      5 72.85465 -66.751384 26.39748 60.020787 14616198
      6 72.85465 -66.751384 -119.50131 -5.756667 11905205
      7 72.85465 -66.751384 159.78216 8.997074 10803902
      8 26.39748 60.020787 -119.50131 -5.756667 13347748
      9 26.39748 60.020787 159.78216 8.997074 11326140
      10 -119.50131 -5.756667 159.78216 8.997074 9104543



      If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply function:



      apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))



             1        3        4        5 
      18570291 14616198 11905205 10803902






      share|improve this answer




























        0












        0








        0







        Here's a tidyverse approach that uses the pmap_df function to run distHaversine on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.



        library(geosphere)   
        library(tidyverse)

        # Fake data
        set.seed(2)
        dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))

        dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
        ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
        dat[.y, ] %>% set_names(c("lon2", "lat2")),
        dist=distHaversine(dat[.x, ], dat[.y, ])))

        dist



                 lon1       lat1       lon2       lat2     dist
        1 -113.44239 79.825493 72.85465 -66.751384 18570291
        2 -113.44239 79.825493 26.39748 60.020787 4259930
        3 -113.44239 79.825493 -119.50131 -5.756667 9533243
        4 -113.44239 79.825493 159.78216 8.997074 8969682
        5 72.85465 -66.751384 26.39748 60.020787 14616198
        6 72.85465 -66.751384 -119.50131 -5.756667 11905205
        7 72.85465 -66.751384 159.78216 8.997074 10803902
        8 26.39748 60.020787 -119.50131 -5.756667 13347748
        9 26.39748 60.020787 159.78216 8.997074 11326140
        10 -119.50131 -5.756667 159.78216 8.997074 9104543



        If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply function:



        apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))



               1        3        4        5 
        18570291 14616198 11905205 10803902






        share|improve this answer















        Here's a tidyverse approach that uses the pmap_df function to run distHaversine on each pair of lat/lon coordinates and return a data frame with the results. You can then filter the output for points that are within some distance from each other.



        library(geosphere)   
        library(tidyverse)

        # Fake data
        set.seed(2)
        dat = data.frame(lon=runif(5,-180,180), lat = runif(5,-90,90))

        dist = pmap_df(data.frame(t(combn(1:nrow(dat), 2))),
        ~data.frame(dat[.x, ] %>% set_names(c("lon1","lat1")),
        dat[.y, ] %>% set_names(c("lon2", "lat2")),
        dist=distHaversine(dat[.x, ], dat[.y, ])))

        dist



                 lon1       lat1       lon2       lat2     dist
        1 -113.44239 79.825493 72.85465 -66.751384 18570291
        2 -113.44239 79.825493 26.39748 60.020787 4259930
        3 -113.44239 79.825493 -119.50131 -5.756667 9533243
        4 -113.44239 79.825493 159.78216 8.997074 8969682
        5 72.85465 -66.751384 26.39748 60.020787 14616198
        6 72.85465 -66.751384 -119.50131 -5.756667 11905205
        7 72.85465 -66.751384 159.78216 8.997074 10803902
        8 26.39748 60.020787 -119.50131 -5.756667 13347748
        9 26.39748 60.020787 159.78216 8.997074 11326140
        10 -119.50131 -5.756667 159.78216 8.997074 9104543



        If you just want a quick way to get distances between a given lat/lon coordinate (to make things concrete, let's say the coordinates in the second row of the data frame) and all the other coordinates, here's an approach using the base R apply function:



        apply(dat[-2, ], 1, function(ll) distHaversine(dat[2,], ll))



               1        3        4        5 
        18570291 14616198 11905205 10803902







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 2 at 21:10

























        answered Jan 2 at 20:59









        eipi10eipi10

        60.2k16109165




        60.2k16109165
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54012531%2fis-there-a-variable-that-contains-the-current-row-for-filtering-a-subset-in-r%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

            How to fix TextFormField cause rebuild widget in Flutter