Scrape website data with the new R package rvest (+ a postscript on interacting with web pages with RSelenium)

Copying tables or lists from a website is not only a painful and dull activity but it's error prone and not easily reproducible. Thankfully there are packages in Python and R to automate the process. In a previous post we described using Python's Beautiful Soup to extract information from web pages. In this post we take advantage of a new R package called rvest to extract addresses from an online list. We then use ggmap to geocode those addresses and create a Leaflet map with the leaflet package. In the interest of coding local, we opted to use, as the example, data on wineries and breweries here in the Finger Lakes region of New York.

0) Load the libraries

In this example we will take advantage of several nice packages, most of which are available on R's main website (CRAN). The one exception is the leaflet package that you'll need to install from GitHub. Instructions are here. Note that this is the leaflet package, not the leafletR package which we highlighted previously.

If you want a little background on dplyr you can read this post and we have some details on ggmap here.

#devtools::install_github("rstudio/leaflet")
library(dplyr)
library(rvest)
library(ggmap)
library(leaflet)
library(RColorBrewer)

1) Parse the entire website

The Visit Ithaca website has a nice list of wineries and breweries from which we can extract addresses.

With rvest the first step is simply to parse the entire website and this can be done easily with the html function.

# URL for the Visit Ithaca website, wineries page
url<-html("http://www.visitithaca.com/attractions/wineries.html")

2) Find and extract the pieces of the website you need

Probably the single biggest challenge when extracting data from a website is determining which pieces of the HTML code you want to extract. A web page tends to be a convoluted set of nested objects (together, they are known as the Documennt Object Model or DOM for short) and you need to identify what part of the DOM you need.

In order to do this, you will need to examine the web page guts using your browser's developer tools. To open the developer tools if you're using Chrome or Firefox click F12 (Cmd + Opt + I for Mac) or for Safari you would use Command-Option-I. From this point forward I'll be using Chrome. Note that the author of the package, Hadley Wickham recommends using selectorgadget.com, a Chrome extension, to help identify the web page elements you need. And he recommends this page for learning more about selectors.

Note that to follow along, you may want to browse to the wineries page that the example uses.

When you click on F12 in Chrome you'll see something like what's below. You should pay particular attention to the element selector which is circled in red and you should make sure that you're looking at the Elements tab.

developer_tools

Just by looking at the page you can intuit that the winery names might be a different element in the DOM (they have a different location, different font etc on the main page). Since the names and addresses are slightly separated we will extract the set of names separately from the set of addresses starting with the names.

Extract the names

To pick out the names, scroll down to the list of wineries and use the element selector in the developer tools to click on one of the winery names on the main page. When you inspect this element you'll see that the winery name is embedded in a hyperlink tag (<a href...) and it has a class of indSearchListingTitle. NOTE: the selector has changed since we originally published this post. Since the names are the only elements on the page with this class, we can use a simple selector based on class alone to extract the names.

name

The function html_nodes pulls out the entire node from the DOM and then the function html_text allows us to limit to the text within the node. Note the use of the pipe %>% which essentially passes the results of html_nodes to html_text. For more information on pipes you can read more here.

# Pull out the names of the wineries and breweries
# Find the name of the selector (which has
# changed since the tutorial was first written)
selector_name<-".indSearchListingTitle"

fnames<-html_nodes(x = url, css = selector_name) %>%
  html_text()

head(fnames)
## [1] "Varick Winery & Vineyard"                       
## [2] "Ithaca Beer Company"                            
## [3] "Bellwether Hard Cider / Bellwether Wine Cellars"
## [4] "Lakeshore Winery"                               
## [5] "Knapp Winery & Vineyard Restaurant"             
## [6] "Goose Watch Winery"

Extract the addresses

The address is a little trickier. Using the select tool in the development tools you'll see that all the right hand side material is within a container (a <div>) with the class of results_summary_item_details. But if you were to use this selector alone you would get all the material including the description, phone number etc. More than we want. If you look more closely at this example, you'll see that the winery-specific material is listed in a three column table (first column is the image, second is blank space and third is the address info etc). So we can add a second part of the selector to grab the third column and, finally, we don't want the whole third column but rather just the first piece in between the first set of <strong> tags. Altogether:

  • In words: Select the material in the sections with a class of .results_summary_item_detals but limit to the third column and, more specifically, the material between the first set of strong tags in the third column

  • In CSS selector code: .results_summary_item_details td:nth-child(3) strong:first-child


# Pull out the addresses of the wineries and breweries
selector_address<-".indSearchListingMetaContainer div:nth-child(2) div:nth-child(1) .indMetaInfoWrapper"

faddress<-html_nodes(url, selector_address) %>%
    html_text()

head(faddress)
## [1] "on the Cayuga Lake Wine Trail5102 Rt. 89, Romulus, NY 14541"                         
## [2] "122 Ithaca Beer Drive Ithaca, NY 14850"                                              
## [3] "on the Cayuga Lake Wine Trail9070 Rt. 89, Trumansburg, NY 14886"                     
## [4] "5132 Rt. 89, Romulus, NY 14541"                                                      
## [5] "on the Cayuga Lake Wine Trail2770 County Rd. 128, (Ernsberger Rd.) Romulus, NY 14541"
## [6] "on the Cayuga Lake Wine Trail5480 Rt. 89, Romulus, NY 14541"

Minor cleanup

The code on this website is a little inconsistent – sometimes the winery name is first and sometimes it's preceded by something like “on the Cayuga Lake Wine Trail” so let's do a little final cleanup.

to_remove<-paste(c("\n", "^\\s+|\\s+$", "on the Cayuga Lake Wine Trail, ",
"Cayuga Lake Wine Trail, ", "on the Cayuga Wine Trail, ", 
"on the Finger Lakes Beer Trail, "), collapse="|")

faddress<-gsub(to_remove, "", faddress)

head(faddress)
## [1] "5102 Rt. 89, Romulus, NY 14541"                         
## [2] "122 Ithaca Beer Drive Ithaca, NY 14850"                 
## [3] "9070 Rt. 89, Trumansburg, NY 14886"                     
## [4] "5132 Rt. 89, Romulus, NY 14541"                         
## [5] "2770 County Rd. 128, (Ernsberger Rd.) Romulus, NY 14541"
## [6] "5480 Rt. 89, Romulus, NY 14541"

Now we have the names and addresses and we're ready for geocoding.

3) Use ggmap to geocode the addresses

Geocode

The package ggmap has a nice geocode function that we'll use to extract coordinates. For more detail checkout this post.

# Using output="latlona" includes the address that was geocoded
geocodes<-geocode(faddress, output="latlona")

head(geocodes)
##         lon      lat                                      address
## 1 -76.77388 42.78128     5102 new york 89, romulus, ny 14541, usa
## 2 -76.53506 42.41655 122 ithaca beer drive, ithaca, ny 14850, usa
## 3 -76.67176 42.58552  9070 new york 89, interlaken, ny 14847, usa
## 4 -76.77593 42.77588     5132 new york 89, romulus, ny 14541, usa
## 5 -76.79253 42.76668      ernsberger road, romulus, ny 14541, usa
## 6 -76.77504 42.75862     5480 new york 89, romulus, ny 14541, usa

[Note 5/26/2015] A commenter (see below) asks if this approach may not be allowed given Google’s Terms of Service (the map is not being shown with a Google Map). I agree that this is questionable so you may want to use an alternative. One good alternative is Yahoo PlaceFinder which does not appear to put a similar restriction on geocode results. Luckily, there is a mini R package for this created by Jeff Allen. You would install the package from GitHub — and you will need to get a key and secret code from Yahoo (the help on the package page will guide you) — code for doing the geocoding is below.

I did experiment with more open source geocoders like the function in this stackoverflow discussion which uses MapQuest but I found that the results were not good. My test case was the Varick Winery address (5102 NY-89, Romulus, NY 14541) which Google can find but many other geocoders could not. Yahoo does find it. For geocoder comparisons see this spreadsheet which is referred to in this GIS stackexchange post.

# code using Yahoo instead of Google
devtools::install_github("trestletech/rydn")
addresses<-lapply(faddress, function(x){
  tmp<-find_place(x, commercial=FALSE, 
             key="GET A KEY",
             secret="GET SECRET CODE")
  # this is the lazy way to do this, in a few cases more than one result is
  # provided and I have not looked into the details so I'm picking the first
  tmp<-tmp[1 ,c("quality",  "latitude", "longitude", "radius")]
  
})
geocodes<-do.call("rbind", addresses) %>%
  rename(lat=latitude, lon=longitude) %>%
  mutate(address=faddress)

Cleanup

The results come back as lower case so we will convert to proper case. We also want to categorize into winery vs brewery/cidery/other and do some additional minor cleaning. We’re taking advantage of the package dplyr to help us with the cleaning (mutate, filter, select are all verbs in this package)

# FUNCTION from help for chartr 
capwords<-function(s, strict = FALSE) {
  cap<-function(s) paste(toupper(substring(s, 1, 1)),
        {s<-substring(s, 2); if(strict) tolower(s) else s},
        sep = "", collapse = " " )
    sapply(strsplit(s, split = " "),
        cap, USE.NAMES = !is.null(names(s)))
}
# ---------------------------------------


full<-mutate(geocodes, name=fnames) %>%
  mutate(category=ifelse(grepl("Winery", name), 1, 2)) %>%
  mutate(addressUse=gsub("Ny", "NY", capwords(gsub(", usa", "", address)))) %>%
  mutate(street=sapply(strsplit(addressUse, ","), "[[", 1)) %>%
  mutate(city=sapply(strsplit(addressUse, ","), "[[", 2)) %>%
  filter(!grepl('Interlaken|Ithaca|Aurora|Seneca Falls', street)) %>%
  select(name, street, city, category, lat, lon)

head(full)
##                                              name           street
## 1                        Varick Winery & Vineyard 5102 New York 89
## 2 Bellwether Hard Cider / Bellwether Wine Cellars 9070 New York 89
## 3                                Lakeshore Winery 5132 New York 89
## 4              Knapp Winery & Vineyard Restaurant  Ernsberger Road
## 5                              Goose Watch Winery 5480 New York 89
## 6                     Cayuga Ridge Estates Winery 6800 New York 89
##          city category      lat       lon
## 1     Romulus        1 42.78128 -76.77388
## 2  Interlaken        2 42.58552 -76.67176
## 3     Romulus        1 42.77588 -76.77593
## 4     Romulus        1 42.76668 -76.79253
## 5     Romulus        1 42.75862 -76.77504
## 6        Ovid        1 42.69610 -76.74304

4) Map the data!

We're making use of RStudio's leaflet package to create the map.


# Assign colors for our 3 categories
cols<-colorFactor(c("#3F9A82", "#A1CD4D", "#2D4565"), domain = full$category)


# Create the popup information with inline css styling
popInfo<-paste("<h4 style='border-bottom: thin dotted #43464C;
    padding-bottom:4px; margin-bottom:4px;
    font-family: Tahoma, Geneva, sans-serif;
    color:#43464C;'>", full$name, "</h4>
    <span style='color:#9197A6;'>", full$street, "<br>",
    paste(full$city, ", NY", sep=""), "</span>", sep="")


# Create the final map color-coded by type!
leaflet(data=full, height="650px", width="100%") %>%
    addCircles(lat = ~ lat, lng = ~ lon, color = ~cols(category), weight=2, opacity=1,
        fillOpacity=0.6, radius=500, popup = popInfo) %>%
    addTiles("http://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}.png") %>%
    setView(-76.63, 42.685, zoom=10) %>% addLegend(
  position = 'bottomright',
  colors = cols(1:2),
  labels = c("Winery", "Brewery, Cidery, Other"), opacity = 1
)

And there we have it!

* Last updated September 21, 2016 to address changes in the wine trail website we use in the example. Thanks to Eric Hopkins for alerting us to the issue and providing the updated code.

Postscript — sites with forms and buttons

Not all the sites you work with will have a clean list or table available on a single page. There are a variety of different options for dealing with pages with buttons, pull downs and related elements. One particularly useful package in this setting (and one that we’re using on a project right now) is RSelenium. We plan to add a blog post about RSelenium but for the time being I will give a sneak peek of preliminary code with no significant details except to say that RSelenium can be used to mimic actual browser use.

The code below opens a browser, goes to a web page, clicks on the “search” button and then scrapes a table of data and then clicks to the next page.

# Sneak preview of code for interacting with a web page with RSelenium
# a proper blog post with explanation will follow.

library(RSelenium)
# make sure you have the server
checkForServer()

# use default server 
startServer()
remDr<-remoteDriver$new()


# send request to server
url<-"https://programs.iowadnr.gov/animalfeedingoperations/FacilitySearch.aspx?Page=0"
remDr$open(silent = TRUE) #opens a browser
remDr$navigate(url)


# identify search button and click
searchID<-'//*[@id="ctl00_foPageContent_SearchButton"]'
webElem<-remDr$findElement(value = searchID)
webElem$clickElement()

# identify the table
tableID<-'//*[@id="ctl00_foPageContent_Panel1"]/div[2]/table'
webElem<-remDr$findElement(value = tableID)

doc<-htmlParse(remDr$getPageSource()[[1]])

tabledat<-readHTMLTable(doc)[[17]]
tabledat[,]<-lapply(tabledat[,],
    function(x) gsub("ƃĀ‚Ć‚", "", as.character(x)))
tabledat<-tabledat[-nrow(tabledat),-1]
# go to next page
nextID<-'//*[@id="ctl00_foPageContent_FacilitySearchRepeater_ctl11_PagePlus1"]'
webElem<-remDr$findElement(value = nextID)
webElem$clickElement()
Posted in R

17 responses

  1. Great tutorial once again! Thanks Zev!

    I used your tutorial for one test case. I had error when trying to extract the table from the website. R returned the error: “html_tag(x) == “table” is not TRUE”. Do you know any workaround solution when the table doesn’t have tags? Thanks!

    My code:

    library(rvest)
    # Webpage URL
    theurl <- "http://www.usbr.gov/pn-bin/yak/webarccsv.pl?station=AMRW&year=2015&month=5&day=1&year=2015&month=5&day=13&pcode=AF&pcode=QD&pcode=QU&quot;
    # Get data
    url <- html(theurl)
    data <- html_nodes(url, "p+ pre")
    # Extract table data only (DATE, AF, QD, & QU columns)
    table_data <- html_table(data)

    • Even though the data you’re trying to get looks like a table it’s not a table (the HTML code would need table, tr, td etc). Change the html_table to html_text and then parse into a table using something like unlist(strsplit( table_data, "\\\r\\\n")) followed by another strsplit. There’s probably a better way to parse into a table.

  2. What a great tutorial! I am new to GIS and the more functionality I see with R, the more excited I get.
    I do have a question regarding the use of ggmap and specifically the geocode function. From what I have read geocode uses the Google API which has certain restrictions on what you can use it for. I am getting the impression that you can only use the data with a corresponding Google map and that there are limitations in storing of the data.
    From the Google terms of service
    (b) No Pre-Fetching, Caching, or Storage of Content.
    (h) No Use of Content without a Google Map
    It seems that this would disallow the use of this service in a academic or health services research setting (where there might not be a publishable map on the internet and the storage of the geocodes for reproducibility is necessary). Is there a way around this or would public health researchers have to use a different service?

    • Thanks for pointing this out and good point. I updated the post above and included code for geocoding with Yahoo!

  3. This is a great package and I am able to pull the data as advertised, but I’m trying to extract an attribute from the div.

    i.e.

    any clues how to get just the id ? thx

  4. Thanks a lot for the tutorial, it s really really useful and understandable also for newbies šŸ™‚
    two quick questions?

    – how to delay a bit this command?
    tableID<-'//*[@id="ctl00_foPageContent_Panel1"]/div[2]/table'
    webElem<-remDr$findElement(value = tableID)

    If the server is not quick enough to load the page after clicking search, the command will result in a mistake, because the table is not there yet. I am using Sys.sleep(10), which is however arbitrary.

    – What do the numbers in parenthesis mean? I guess 17 stands for table 17, what about the 1?
    doc<-htmlParse(remDr$getPageSource()[[1]])
    tabledat<-readHTMLTable(doc)[[17]]

    Thanks!

    • I don’t know of a way other than Sys.sleep() to get around this — this is what we have used. In the code you mention the [[1]] refers to the base html tag. And yes, the [[17]] refers to the table number.

  5. Hi, thank you very much for this well written aid. I have been using rvest for a project but now understand more about it.
    Can you use rvest and rselenium in the same code? What would that look like?
    I.e. I have a code which is successfully using rvest to scrape TripAdvisor reviews for a worldwide study on ecosystem use. However, I am only able to scrape a partial review as it stops where the ‘more’ button is inserted to cut review. I have been told I have to interact with javascript (and RSelenium) to pull up the entire review. Is this a simple addition into my current rvest-using code? Do you know anyone who could advise me on the specifics as I am moderately new to this.

    • You can definitely use both but they do different things in different ways. Your best bet would to carve out a very specific, reproducible example of what you want to do and post it to stackoverflow, you can then send me the link (zev@3.85.142.75).

  6. First of all great intro.
    One thing I noticed is that the “table structure” for the addresses no longer is uniform among wineries. Dills Run Winery has no associated image so its reaching for the address in a non-existant spot.

    Do you have an easy remedy for persnickety layouts like this?

  7. I have used above steps to read a field in html page but it is not working.
    Page URL: http://www.msn.com/en-in/entertainment/tv/highly-educated-tv-stars/ss-AAiMCyB?li=AAgfYGb

    When we open the page URL and press ctrl+U and search for “pagetotalCount” it is present inside a tag. I have used html_node to read the source but not able to get the result.

    library(rvest)
    library(RCurl)
    library(XML)
    library(magrittr)
    library(rjson)

    doc<- read_html("http://www.msn.com/en-in/entertainment/tv/highly-educated-tv-stars/ss-AAiMCyB?li=AAgfYGb
    ")
    selector_name<- ".slide-metadata"
    yyy% html_text()

    I am getting an empty output.

    Could you please have a look into it.

Leave a Reply to Patrick Cancel reply

Your email address will not be published. Required fields are marked *