10 Web Scraping in R

https://learn.datacamp.com/courses/web-scraping-in-r

Main functions and concepts covered in this BP chapter:

  1. read_html()
  2. html_elements()
  3. CSS & Selecting
    • classes and ids
    • combinators
  4. XPATH
  5. tibble()
  6. scraping practices
    • httr
    • user agents

Packages used in this chapter:

## Load all packages used in this chapter
library(tidyverse) #includes dplyr, ggplot2, and other common packages
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

10.1 Introduction to HTML and Web Scraping

The structure of an HTML is like a tree. You have multiple elements or nodes that branch off of one another. You have your base <html> and then children of it, and children of its children, etc.

This website has information on most of the types of elements you will find in an HTML.

Use read_html() to read in an html. We can either create one (a simple one) or use a website. For some examples, we use a website called https://www.scrapethissite.com/pages/simple/.

countries <- read_html("https://www.scrapethissite.com/pages/simple/")

In this example, we pull all the countries names.

countriesnames <- countries %>%
  html_elements("h3.country-name") %>%
  html_text() 
head(countriesnames)
## [1] "\n                            \n                            Andorra\n                        "             
## [2] "\n                            \n                            United Arab Emirates\n                        "
## [3] "\n                            \n                            Afghanistan\n                        "         
## [4] "\n                            \n                            Antigua and Barbuda\n                        " 
## [5] "\n                            \n                            Anguilla\n                        "            
## [6] "\n                            \n                            Albania\n                        "

We can use html_element() and html_children() to pull all of the children from specific nodes.

list_raw_html <- '
"\n<html>\n  <body>\n    <ol>\n      <li>Learn HTML</li>\n      <li>Learn CSS</li>\n      <li>Learn R</li>\n      <li>Scrape everything!*</li>\n    </ol>\n    <small>*Do it responsibly!</small>\n  </body>\n</html>"
'
# Read in the corresponding HTML string
list_html <- read_html(list_raw_html)
# Extract the ol node
ol_node <- list_html %>% 
    html_element('ol')
# Extract and print all the children from ol_node
ol_node %>% 
    html_children()
## {xml_nodeset (4)}
## [1] <li>Learn HTML</li>
## [2] <li>Learn CSS</li>
## [3] <li>Learn R</li>
## [4] <li>Scrape everything!*</li>

In the following example, we can put hyperlinks into a clean table.

hyperlink_raw_html <-
  "\n<html>\n  <body>\n    <h3>Helpful links</h3>\n    <ul>\n      <li><a href=\"https://wikipedia.org\">Wikipedia</a></li>\n      <li><a href=\"https://dictionary.com\">Dictionary</a></li>\n      <li><a href=\"https://duckduckgo.com\">Search Engine</a></li>\n    </ul>\n    <small>\n      Compiled with help from <a href=\"https://google.com\">Google</a>.\n    </small>\n  </body>\n</html>"
# Extract all the a nodes from the bulleted list
links <- hyperlink_raw_html %>% 
  read_html() %>%
  html_elements('li a') # 'ul a' is also correct!

# Extract the needed values for the data frame
domain_value = links %>% html_attr('href')
name_value = links %>% html_text()

# Construct a data frame
link_df <- tibble(
  domain = domain_value,
  name = name_value
)

link_df
## # A tibble: 3 × 2
##   domain                 name         
##   <chr>                  <chr>        
## 1 https://wikipedia.org  Wikipedia    
## 2 https://dictionary.com Dictionary   
## 3 https://duckduckgo.com Search Engine

This is how you construct an html table and turn it into one using html_table.

myfriends_html <- '
  <html>
    <table>
      <tr>
        <th> Name </th>  #headings
        <th> Age </th>
      </tr>
      <tr>
        <td> Riley </td> #row 1
        <td> 19 </td>
      </tr>
      <tr>
        <td> Lainie </td> #row 2
        <td> 21 </td?
      </tr?
    </table>
  </html>
        
        '

myfriendshtml <- read_html(myfriends_html)
myfriends <- myfriendshtml %>% 
  html_table() #if the table doesn't already have the first row as a header, you can use "header = TRUE" inside this function
myfriends
## [[1]]
## # A tibble: 2 × 2
##   Name     Age
##   <chr>  <int>
## 1 Riley     19
## 2 Lainie    21

10.3 Advanced Selection with XPATH

XPATH stands for XML Path Language. With this language, a so-called path through an HTML tree can be formulated, which is a slightly different approach than the one with CSS selectors.

The basic syntax uses axes, which is / or //. A single / is a child relationship and // is a general descendant relationship. It also uses steps, which are HTML types like span or a, and predicates, which are specified within square brackets and declare conditions that must hold true for the type that precedes them.

10.3.1 Basic Predicates

# The following code pulls all of the span elements, which look to be the capital city, population, and area of the countries
cappoparea <- countries %>%
  html_elements(xpath = '//span') %>%
  html_text
head(cappoparea)
## [1] "Andorra la Vella" "84000"            "468.0"            "Abu Dhabi"       
## [5] "4975593"          "82880.0"
# The following code pulls all of the span elements, which have class = country-capital 
capital <- countries %>%
  html_elements(xpath = '//span[@class = "country-capital"]') %>%
  html_text
head(capital)
## [1] "Andorra la Vella" "Abu Dhabi"        "Kabul"            "St. John's"      
## [5] "The Valley"       "Tirana"
# The following code selects all h3 elements that are children of class = col-md-4 country

countryname <- countries %>%
  html_elements(xpath = '//*[@class = "col-md-4 country"]/h3') %>%
  html_text %>%
  trimws()

head(countryname)
## [1] "Andorra"              "United Arab Emirates" "Afghanistan"         
## [4] "Antigua and Barbuda"  "Anguilla"             "Albania"
#selects all span elements with class of country-area that are children of class of country-info
area <- countries %>%
  html_elements(xpath = '//*[@class = "country-info"]/span[@class = "country-area"]') %>%
  html_text %>%
  trimws() %>%
  as.numeric()

head(area)
## [1]    468  82880 647500    443    102  28748

With XPATH, something that’s not possible with CSS can be done: selecting elements based on the properties of their descendants.

There are no good examples of this using the countries data, but the basic syntax is found below.

# Select all divs with p descendants having the "third" class
#weather_html %>% 
  #html_elements(xpath = '//div[p[@class = "third"]]')

10.3.2 Advanced Predicates

# Select the value of each second span found in each div with class = country-info

pop <- countries %>%
  html_elements(xpath = '//div[@class = "country-info"]/span[position() = 2]') %>%
  html_text %>%
  trimws() %>%
  as.numeric()

head(pop)
## [1]    84000  4975593 29121286    86754    13254  2986952
# Select the value of each span except the second found in each div with class = country-info, which is the capital and the area

caparea <- countries %>%
  html_elements(xpath = '//div[@class = "country-info"]/span[position() != 2]') %>%
  html_text %>%
  trimws() 
head(caparea)
## [1] "Andorra la Vella" "468.0"            "Abu Dhabi"        "82880.0"         
## [5] "Kabul"            "647500.0"
# Select the value of each span found after the 1st position in each div with class = country-info

poparea <- countries %>%
  html_elements(xpath = '//div[@class = "country-info"]/span[position() >= 2]') %>%
  html_text %>%
  trimws() 

head(poparea)
## [1] "84000"    "468.0"    "4975593"  "82880.0"  "29121286" "647500.0"

You can also use the count() function to find elements that have a certain number of other elements. Again, there isn’t really an applicable example from the countries dataset but this would be the syntax.

# Select only divs with one header and at least two paragraphs
# forecast_html %>%
#   html_elements(xpath = '//div[count(h2) = 1 and count(p) > 1]')

10.3.3 Text()

Sometimes, you only want to select text that’s a direct descendant of a parent element. Xpath allows us to do that.

Quick sidenote about using xpath to print a table:

# Extract the data frame from the table using a known function from rvest
friends <- myfriendshtml %>% 
  html_element(xpath = "//table") %>% 
  html_table()
# Print the contents of the friends data frame
friends
## # A tibble: 2 × 2
##   Name     Age
##   <chr>  <int>
## 1 Riley     19
## 2 Lainie    21
#Extracting only the text from the h3 element inside the div element with the specified classes
name <- countries %>% 
  html_nodes(".country-name") %>%
  html_text() %>%
  trimws() %>%
  as.character()
head(name)
## [1] "Andorra"              "United Arab Emirates" "Afghanistan"         
## [4] "Antigua and Barbuda"  "Anguilla"             "Albania"

Putting everything into a tibble:

countrydata <- tibble(name, pop, capital, area)

The text() function also allows you to select elements (and their parents) based on their text.

programming_html <-
 ' <h3>The rules of programming</h3>
<ol>
  <li>Have <em>fun</em>.</li>
  <li><strong>Do not</strong> repeat yourself.</li>
  <li>Think <em>twice</em> when naming variables.</li>
</ol> '

programming_html <- read_html(programming_html)

# Select all li elements
programming_html %>%
    html_elements(xpath = '//li') %>%
    # Select all em elements within li elements that have "twice" as text
    html_elements(xpath = 'em[text() = "twice"]') %>%
    # Wander up the tree to select the parent of the em 
    html_elements(xpath = '..')
## {xml_nodeset (1)}
## [1] <li>Think <em>twice</em> when naming variables.</li>

10.4 Scraping Best Practices

HTTP stands for Hypertext Transfer Protocol and is a relatively simple set of rules that dictate how modern web browsers, or clients, communicate with a web server. A web document or website that contains multiple assets like text, images, and videos, fetches all these resources via so-called GET requests from one or more than server

A request is often only composed of a so-called method, in this case, GET, a protocol version, and several so-called headers. The most important is probably the host – the address of the resource that is to be fetched. In turn, the response from the web server tells the client whether the request was successful, which is denoted by the status code and status message. Also, the headers tell the client, your browser, how to deal with that response. Helpful information is for example the Content-Type, which tells the browser which format of content was returned. In this case, it’s simple HTML text that now can be rendered in the browser. Here are some typical status codes: 200 stands for “everything went well” while 404 indicates that the resource was not found on the web server. Codes in the 300-range are so-called redirects, telling you to fetch the resource at a different address. Lastly, codes in the 500-range usually result when there was an error on the server, for instance, when a program crashed because of the request.

The most common request methods, or at least those that will become relevant when you scrape a page, are GET and POST. GET is always used when a resource, be it an HTML page or a mere image, is to be fetched without submitting any user data. POST, on the other hand, is used when you need to submit some data to a web server. This most often is the result of a form that was filled out by the user. The POST request has a payload, which follows the headers. In this case, a couple of key-value-pairs with data are submitted. These could be form fields that were filled with value1 and value2, respectively. Of course, POST requests also result in a response from the server.

With the httr library, you can easily create and send HTTP requests from your R session. With the aptly named GET() function, for example, you can send a GET request. When doing so in the console, the response from the server is printed as soon as it is received. Here, the request was successful, as you can see from the 200 status code. After the response headers, the actual content of the web site is listed. In this case, it’s HTML text.

If you want you can extract the actual content from the response and parse it directly into an HTML document that rvest can work with. It’s really straightforward: Just use the content() function on the response from the GET function. As you can see, the already familiar HTML document results.

10.4.1 httr

Here is an alternative way to read in an html using httr.

library(httr)
countries <- GET("https://www.scrapethissite.com/pages/simple/")
countries1 <- content(countries)

Check the status of the response object:

status_code(countries)
## [1] 200

200 is good!

10.4.2 User Agent

Sometimes it is good to add a user agent to tell the serve who you are.

# Pass a custom user agent to a GET query to the mentioned URL
response <- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
# Print the response content
content(response)
## No encoding supplied: defaulting to UTF-8.
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n</body>

Use `set_config() to make a custom user agent globally available across all future requests.

# Globally set the user agent to "A request from a DataCamp course on scraping"
set_config(add_headers(`User-Agent` = "A request from a DataCamp course on scraping"))
# Pass a custom user agent to a GET query to the mentioned URL
response <- GET('https://httpbin.org/user-agent')
# Print the response content
content(response)
## $`user-agent`
## [1] "A request from a DataCamp course on scraping"

10.4.3 Throttling

When you are making multiple requests one after another, it can put a lot on a smaller website. To slow down your requests, you can throttle your requests.

fruit_pages <- c("https://en.wikipedia.org/wiki/Pear",
                 "https://en.wikipedia.org/wiki/Apple",
                 "https://en.wikipedia.org/wiki/Orange_(fruit)")
library(purrr)
# Define a throttled read_html() function with a delay of 1s
read_html_delayed <- slowly(read_html, 
                            rate = rate_delay(1))
# Construct a loop that goes over all page urls
for(page_url in fruit_pages){
  # Read in the html of each URL with the function defined above
  html <- read_html_delayed(page_url)
}

Now you can extract what you want using html_elements()