r - Create a vector function to clean address data for Houston Crime Data -


there tutorials mapping houston crime data, no easy examples of how clean raw data provided hpd. https://github.com/hadley/ggplot2/wiki/crime-in-downtown-houston,-texas-:-combining-ggplot2-and-google-maps

d <- structure(list(blockrange = c("5400-5499", "3700-3799", "2200-2299",      "1000-1099", "1200-1299", "unk", "1900-1999", "500-599", "1200-1299"     ), streetname = c("bell", "bell", "bell", "bell", "bell", "bell",      "bell", "bell", "bell"), date = c("4/28/2015", "4/11/2015", "4/26/2015",      "4/9/2015", "4/9/2015", "4/21/2015", "4/26/2015", "4/26/2015",      "4/17/2015")), row.names = c(60l, 75l, 88l, 4972l, 4990l, 5096l,      5098l, 5099l, 5155l), class = "data.frame", .names = c("blockrange",      "streetname", "date")) 

this return lon , lat:

x <- ggeocode("1950 bell st, houston, tx") #[1]  29.74800 -95.35926 

however, needs function geocode entire database , add columns lon , lat

example of selection of finished data.

structure(list(address = c("9650 marlive ln", "4750 telephone rd",  "5050 wickview ln", "1050 ashland st", "8350 canyon", "9350 rowan ln",  "2550 southmore blvd", "6350 rupley cir", "5050 georgi ln", "10750 briar forest dr" ), lon = c(-95.4373883, -95.2988769, -95.455864, -95.4033373,  -95.3779081, -95.5483009, -95.3733977, -95.3156032, -95.4665841,  -95.565934), lat = c(29.6779015, 29.6917121, 29.5992174, 29.7902425,  29.6706341, 29.7022336, 29.7198936, 29.6902746, 29.8297359, 29.747596 )), row.names = 82729:82738, class = "data.frame", .names = c("address",  "lon", "lat")) 

here functions geocoding:

library(rcurl) library(rjsonio) library(dplyr) library(gdata)   construct.geocode.url <- function(address, return.call = "json", sensor = "false") {   root <- "http://maps.google.com/maps/api/geocode/"   u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")   return(urlencode(u)) }  ggeocode <- function(address,verbose=false) {   if(verbose) cat(address,"\n")   u <- construct.geocode.url(address)   doc <- geturl(u)   x <- fromjson(doc,simplify = false)   if(x$status=="ok") {     lat <- x$results[[1]]$geometry$location$lat     lng <- x$results[[1]]$geometry$location$lng     return(c(lat, lng))   } else {     return(c(na,na))   } } 

how can write function using dplyr or method adds 3 more columns output of [address, long, lat]?

i.e..

data.frame <- mutate(d, address = convertblockrange(blockrange) + streetname, "houston, tx"), lon = geocode(address)[0] , lat = geocode(address)[1]) 

this blocking point of question:

#function convert - "2200-2299" integer 2250.. i.e find middle of block.           library(stringr)                  convertblockrange <- function(blockrange){     m <-   unlist(str_split(d$blockrange, "-"))   m2 <- mean(c(as.numeric(m[1]),as.numeric(m[2]))) + .5   m2 } 

you can calculate mean block range splitting range , averaging:

e.g.

x <- '5400-5499' mean(as.numeric(strsplit(x, '-')[[1]])) # 5449.5 

to scale up, can use separate tidyr package. cool things automagically putting min/max of blockrange new column, converting types string numeric (convert=t, type.convert=as.numeric). filter out "unk" addresses first - have handle them separately.

library(dplyr) library(tidyr)  d %>%   filter(blockrange != "unk") %>%   # df blockmin & blockmax   separate(blockrange, c("blockmin", "blockmax"), sep = "-",             convert=t, type.convert=as.numeric, remove=false) %>%   # calc average (round down) , address   mutate(block=floor((blockmin + blockmax)/2),          address=paste(block, streetname))  #   blockrange blockmin blockmax streetname      date block   address # 1  5400-5499     5400     5499       bell 4/28/2015  5449 5449 bell # 2  3700-3799     3700     3799       bell 4/11/2015  3749 3749 bell # 3  2200-2299     2200     2299       bell 4/26/2015  2249 2249 bell # 4  1000-1099     1000     1099       bell  4/9/2015  1049 1049 bell # 5  1200-1299     1200     1299       bell  4/9/2015  1249 1249 bell # 6  1900-1999     1900     1999       bell 4/26/2015  1949 1949 bell # 7    500-599      500      599       bell 4/26/2015   549  549 bell # 8  1200-1299     1200     1299       bell 4/17/2015  1249 1249 bell 

then %>% group_by(address) unique addresses , geocode (though i'd think how restrict maximum number of requests etc here).

with regards adding output lat , lon columns @ once, don't think dplyr yet (see this feature request).

if want use dplyr syntax here, best bet change ggeocode vectorised, e.g.

ggeocode2 <- function (addresses) {     x <- data.frame(t(sapply(addresses[[1]], ggeocode)), row.names=null)     names(x) <- c('lat', 'lng')     x }  d2 %>%    select(address) %>%    ggeocode2 %>%    bind_cols(d2, .) 

but really think should skip dplyr sugar particular step , manual loop , cbind result, gives greater control on request limiting.


Comments