there tutorials mapping houston crime data, no easy examples of how clean raw data provided hpd. https://github.com/hadley/ggplot2/wiki/crime-in-downtown-houston,-texas-:-combining-ggplot2-and-google-maps
d <- structure(list(blockrange = c("5400-5499", "3700-3799", "2200-2299", "1000-1099", "1200-1299", "unk", "1900-1999", "500-599", "1200-1299" ), streetname = c("bell", "bell", "bell", "bell", "bell", "bell", "bell", "bell", "bell"), date = c("4/28/2015", "4/11/2015", "4/26/2015", "4/9/2015", "4/9/2015", "4/21/2015", "4/26/2015", "4/26/2015", "4/17/2015")), row.names = c(60l, 75l, 88l, 4972l, 4990l, 5096l, 5098l, 5099l, 5155l), class = "data.frame", .names = c("blockrange", "streetname", "date")) this return lon , lat:
x <- ggeocode("1950 bell st, houston, tx") #[1] 29.74800 -95.35926 however, needs function geocode entire database , add columns lon , lat
example of selection of finished data.
structure(list(address = c("9650 marlive ln", "4750 telephone rd", "5050 wickview ln", "1050 ashland st", "8350 canyon", "9350 rowan ln", "2550 southmore blvd", "6350 rupley cir", "5050 georgi ln", "10750 briar forest dr" ), lon = c(-95.4373883, -95.2988769, -95.455864, -95.4033373, -95.3779081, -95.5483009, -95.3733977, -95.3156032, -95.4665841, -95.565934), lat = c(29.6779015, 29.6917121, 29.5992174, 29.7902425, 29.6706341, 29.7022336, 29.7198936, 29.6902746, 29.8297359, 29.747596 )), row.names = 82729:82738, class = "data.frame", .names = c("address", "lon", "lat")) here functions geocoding:
library(rcurl) library(rjsonio) library(dplyr) library(gdata) construct.geocode.url <- function(address, return.call = "json", sensor = "false") { root <- "http://maps.google.com/maps/api/geocode/" u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "") return(urlencode(u)) } ggeocode <- function(address,verbose=false) { if(verbose) cat(address,"\n") u <- construct.geocode.url(address) doc <- geturl(u) x <- fromjson(doc,simplify = false) if(x$status=="ok") { lat <- x$results[[1]]$geometry$location$lat lng <- x$results[[1]]$geometry$location$lng return(c(lat, lng)) } else { return(c(na,na)) } } how can write function using dplyr or method adds 3 more columns output of [address, long, lat]?
i.e..
data.frame <- mutate(d, address = convertblockrange(blockrange) + streetname, "houston, tx"), lon = geocode(address)[0] , lat = geocode(address)[1]) this blocking point of question:
#function convert - "2200-2299" integer 2250.. i.e find middle of block. library(stringr) convertblockrange <- function(blockrange){ m <- unlist(str_split(d$blockrange, "-")) m2 <- mean(c(as.numeric(m[1]),as.numeric(m[2]))) + .5 m2 }
you can calculate mean block range splitting range , averaging:
e.g.
x <- '5400-5499' mean(as.numeric(strsplit(x, '-')[[1]])) # 5449.5 to scale up, can use separate tidyr package. cool things automagically putting min/max of blockrange new column, converting types string numeric (convert=t, type.convert=as.numeric). filter out "unk" addresses first - have handle them separately.
library(dplyr) library(tidyr) d %>% filter(blockrange != "unk") %>% # df blockmin & blockmax separate(blockrange, c("blockmin", "blockmax"), sep = "-", convert=t, type.convert=as.numeric, remove=false) %>% # calc average (round down) , address mutate(block=floor((blockmin + blockmax)/2), address=paste(block, streetname)) # blockrange blockmin blockmax streetname date block address # 1 5400-5499 5400 5499 bell 4/28/2015 5449 5449 bell # 2 3700-3799 3700 3799 bell 4/11/2015 3749 3749 bell # 3 2200-2299 2200 2299 bell 4/26/2015 2249 2249 bell # 4 1000-1099 1000 1099 bell 4/9/2015 1049 1049 bell # 5 1200-1299 1200 1299 bell 4/9/2015 1249 1249 bell # 6 1900-1999 1900 1999 bell 4/26/2015 1949 1949 bell # 7 500-599 500 599 bell 4/26/2015 549 549 bell # 8 1200-1299 1200 1299 bell 4/17/2015 1249 1249 bell then %>% group_by(address) unique addresses , geocode (though i'd think how restrict maximum number of requests etc here).
with regards adding output lat , lon columns @ once, don't think dplyr yet (see this feature request).
if want use dplyr syntax here, best bet change ggeocode vectorised, e.g.
ggeocode2 <- function (addresses) { x <- data.frame(t(sapply(addresses[[1]], ggeocode)), row.names=null) names(x) <- c('lat', 'lng') x } d2 %>% select(address) %>% ggeocode2 %>% bind_cols(d2, .) but really think should skip dplyr sugar particular step , manual loop , cbind result, gives greater control on request limiting.
Comments
Post a Comment