10 Web Scraping in R
https://learn.datacamp.com/courses/web-scraping-in-r
Main functions and concepts covered in this BP chapter:
read_html()
html_elements()
- CSS & Selecting
- classes and ids
- combinators
- XPATH
tibble()
- scraping practices
- httr
- user agents
Packages used in this chapter:
## Load all packages used in this chapter
library(tidyverse) #includes dplyr, ggplot2, and other common packages
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
10.1 Introduction to HTML and Web Scraping
The structure of an HTML is like a tree. You have multiple elements or nodes that branch off of one another. You have your base <html>
and then children of it, and children of its children, etc.
This website has information on most of the types of elements you will find in an HTML.
Use read_html()
to read in an html. We can either create one (a simple one) or use a website. For some examples, we use a website called https://www.scrapethissite.com/pages/simple/.
<- read_html("https://www.scrapethissite.com/pages/simple/") countries
In this example, we pull all the countries names.
<- countries %>%
countriesnames html_elements("h3.country-name") %>%
html_text()
head(countriesnames)
## [1] "\n \n Andorra\n "
## [2] "\n \n United Arab Emirates\n "
## [3] "\n \n Afghanistan\n "
## [4] "\n \n Antigua and Barbuda\n "
## [5] "\n \n Anguilla\n "
## [6] "\n \n Albania\n "
We can use html_element()
and html_children()
to pull all of the children from specific nodes.
<- '
list_raw_html "\n<html>\n <body>\n <ol>\n <li>Learn HTML</li>\n <li>Learn CSS</li>\n <li>Learn R</li>\n <li>Scrape everything!*</li>\n </ol>\n <small>*Do it responsibly!</small>\n </body>\n</html>"
'
# Read in the corresponding HTML string
<- read_html(list_raw_html)
list_html # Extract the ol node
<- list_html %>%
ol_node html_element('ol')
# Extract and print all the children from ol_node
%>%
ol_node html_children()
## {xml_nodeset (4)}
## [1] <li>Learn HTML</li>
## [2] <li>Learn CSS</li>
## [3] <li>Learn R</li>
## [4] <li>Scrape everything!*</li>
In the following example, we can put hyperlinks into a clean table.
<-
hyperlink_raw_html "\n<html>\n <body>\n <h3>Helpful links</h3>\n <ul>\n <li><a href=\"https://wikipedia.org\">Wikipedia</a></li>\n <li><a href=\"https://dictionary.com\">Dictionary</a></li>\n <li><a href=\"https://duckduckgo.com\">Search Engine</a></li>\n </ul>\n <small>\n Compiled with help from <a href=\"https://google.com\">Google</a>.\n </small>\n </body>\n</html>"
# Extract all the a nodes from the bulleted list
<- hyperlink_raw_html %>%
links read_html() %>%
html_elements('li a') # 'ul a' is also correct!
# Extract the needed values for the data frame
= links %>% html_attr('href')
domain_value = links %>% html_text()
name_value
# Construct a data frame
<- tibble(
link_df domain = domain_value,
name = name_value
)
link_df
## # A tibble: 3 × 2
## domain name
## <chr> <chr>
## 1 https://wikipedia.org Wikipedia
## 2 https://dictionary.com Dictionary
## 3 https://duckduckgo.com Search Engine
This is how you construct an html table and turn it into one using html_table
.
<- '
myfriends_html <html>
<table>
<tr>
<th> Name </th> #headings
<th> Age </th>
</tr>
<tr>
<td> Riley </td> #row 1
<td> 19 </td>
</tr>
<tr>
<td> Lainie </td> #row 2
<td> 21 </td?
</tr?
</table>
</html>
'
<- read_html(myfriends_html)
myfriendshtml <- myfriendshtml %>%
myfriends html_table() #if the table doesn't already have the first row as a header, you can use "header = TRUE" inside this function
myfriends
## [[1]]
## # A tibble: 2 × 2
## Name Age
## <chr> <int>
## 1 Riley 19
## 2 Lainie 21
10.3 Advanced Selection with XPATH
XPATH stands for XML Path Language. With this language, a so-called path through an HTML tree can be formulated, which is a slightly different approach than the one with CSS selectors.
The basic syntax uses axes, which is /
or //
. A single /
is a child relationship and //
is a general descendant relationship. It also uses steps, which are HTML types like span or a, and predicates, which are specified within square brackets and declare conditions that must hold true for the type that precedes them.
10.3.1 Basic Predicates
# The following code pulls all of the span elements, which look to be the capital city, population, and area of the countries
<- countries %>%
cappoparea html_elements(xpath = '//span') %>%
html_texthead(cappoparea)
## [1] "Andorra la Vella" "84000" "468.0" "Abu Dhabi"
## [5] "4975593" "82880.0"
# The following code pulls all of the span elements, which have class = country-capital
<- countries %>%
capital html_elements(xpath = '//span[@class = "country-capital"]') %>%
html_texthead(capital)
## [1] "Andorra la Vella" "Abu Dhabi" "Kabul" "St. John's"
## [5] "The Valley" "Tirana"
# The following code selects all h3 elements that are children of class = col-md-4 country
<- countries %>%
countryname html_elements(xpath = '//*[@class = "col-md-4 country"]/h3') %>%
%>%
html_text trimws()
head(countryname)
## [1] "Andorra" "United Arab Emirates" "Afghanistan"
## [4] "Antigua and Barbuda" "Anguilla" "Albania"
#selects all span elements with class of country-area that are children of class of country-info
<- countries %>%
area html_elements(xpath = '//*[@class = "country-info"]/span[@class = "country-area"]') %>%
%>%
html_text trimws() %>%
as.numeric()
head(area)
## [1] 468 82880 647500 443 102 28748
With XPATH, something that’s not possible with CSS can be done: selecting elements based on the properties of their descendants.
There are no good examples of this using the countries data, but the basic syntax is found below.
# Select all divs with p descendants having the "third" class
#weather_html %>%
#html_elements(xpath = '//div[p[@class = "third"]]')
10.3.2 Advanced Predicates
# Select the value of each second span found in each div with class = country-info
<- countries %>%
pop html_elements(xpath = '//div[@class = "country-info"]/span[position() = 2]') %>%
%>%
html_text trimws() %>%
as.numeric()
head(pop)
## [1] 84000 4975593 29121286 86754 13254 2986952
# Select the value of each span except the second found in each div with class = country-info, which is the capital and the area
<- countries %>%
caparea html_elements(xpath = '//div[@class = "country-info"]/span[position() != 2]') %>%
%>%
html_text trimws()
head(caparea)
## [1] "Andorra la Vella" "468.0" "Abu Dhabi" "82880.0"
## [5] "Kabul" "647500.0"
# Select the value of each span found after the 1st position in each div with class = country-info
<- countries %>%
poparea html_elements(xpath = '//div[@class = "country-info"]/span[position() >= 2]') %>%
%>%
html_text trimws()
head(poparea)
## [1] "84000" "468.0" "4975593" "82880.0" "29121286" "647500.0"
You can also use the count()
function to find elements that have a certain number of other elements.
Again, there isn’t really an applicable example from the countries dataset but this would be the syntax.
# Select only divs with one header and at least two paragraphs
# forecast_html %>%
# html_elements(xpath = '//div[count(h2) = 1 and count(p) > 1]')
10.3.3 Text()
Sometimes, you only want to select text that’s a direct descendant of a parent element. Xpath allows us to do that.
Quick sidenote about using xpath to print a table:
# Extract the data frame from the table using a known function from rvest
<- myfriendshtml %>%
friends html_element(xpath = "//table") %>%
html_table()
# Print the contents of the friends data frame
friends
## # A tibble: 2 × 2
## Name Age
## <chr> <int>
## 1 Riley 19
## 2 Lainie 21
#Extracting only the text from the h3 element inside the div element with the specified classes
<- countries %>%
name html_nodes(".country-name") %>%
html_text() %>%
trimws() %>%
as.character()
head(name)
## [1] "Andorra" "United Arab Emirates" "Afghanistan"
## [4] "Antigua and Barbuda" "Anguilla" "Albania"
Putting everything into a tibble:
<- tibble(name, pop, capital, area) countrydata
The text() function also allows you to select elements (and their parents) based on their text.
<-
programming_html ' <h3>The rules of programming</h3>
<ol>
<li>Have <em>fun</em>.</li>
<li><strong>Do not</strong> repeat yourself.</li>
<li>Think <em>twice</em> when naming variables.</li>
</ol> '
<- read_html(programming_html)
programming_html
# Select all li elements
%>%
programming_html html_elements(xpath = '//li') %>%
# Select all em elements within li elements that have "twice" as text
html_elements(xpath = 'em[text() = "twice"]') %>%
# Wander up the tree to select the parent of the em
html_elements(xpath = '..')
## {xml_nodeset (1)}
## [1] <li>Think <em>twice</em> when naming variables.</li>
10.4 Scraping Best Practices
HTTP stands for Hypertext Transfer Protocol and is a relatively simple set of rules that dictate how modern web browsers, or clients, communicate with a web server. A web document or website that contains multiple assets like text, images, and videos, fetches all these resources via so-called GET requests from one or more than server
A request is often only composed of a so-called method, in this case, GET, a protocol version, and several so-called headers. The most important is probably the host – the address of the resource that is to be fetched. In turn, the response from the web server tells the client whether the request was successful, which is denoted by the status code and status message. Also, the headers tell the client, your browser, how to deal with that response. Helpful information is for example the Content-Type, which tells the browser which format of content was returned. In this case, it’s simple HTML text that now can be rendered in the browser. Here are some typical status codes: 200 stands for “everything went well” while 404 indicates that the resource was not found on the web server. Codes in the 300-range are so-called redirects, telling you to fetch the resource at a different address. Lastly, codes in the 500-range usually result when there was an error on the server, for instance, when a program crashed because of the request.
The most common request methods, or at least those that will become relevant when you scrape a page, are GET and POST. GET is always used when a resource, be it an HTML page or a mere image, is to be fetched without submitting any user data. POST, on the other hand, is used when you need to submit some data to a web server. This most often is the result of a form that was filled out by the user. The POST request has a payload, which follows the headers. In this case, a couple of key-value-pairs with data are submitted. These could be form fields that were filled with value1 and value2, respectively. Of course, POST requests also result in a response from the server.
With the httr library, you can easily create and send HTTP requests from your R session. With the aptly named GET() function, for example, you can send a GET request. When doing so in the console, the response from the server is printed as soon as it is received. Here, the request was successful, as you can see from the 200 status code. After the response headers, the actual content of the web site is listed. In this case, it’s HTML text.
If you want you can extract the actual content from the response and parse it directly into an HTML document that rvest can work with. It’s really straightforward: Just use the content() function on the response from the GET function. As you can see, the already familiar HTML document results.
10.4.1 httr
Here is an alternative way to read in an html using httr.
library(httr)
<- GET("https://www.scrapethissite.com/pages/simple/")
countries <- content(countries) countries1
Check the status of the response object:
status_code(countries)
## [1] 200
200 is good!
10.4.2 User Agent
Sometimes it is good to add a user agent to tell the serve who you are.
# Pass a custom user agent to a GET query to the mentioned URL
<- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
response # Print the response content
content(response)
## No encoding supplied: defaulting to UTF-8.
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n</body>
Use `set_config() to make a custom user agent globally available across all future requests.
# Globally set the user agent to "A request from a DataCamp course on scraping"
set_config(add_headers(`User-Agent` = "A request from a DataCamp course on scraping"))
# Pass a custom user agent to a GET query to the mentioned URL
<- GET('https://httpbin.org/user-agent')
response # Print the response content
content(response)
## $`user-agent`
## [1] "A request from a DataCamp course on scraping"
10.4.3 Throttling
When you are making multiple requests one after another, it can put a lot on a smaller website. To slow down your requests, you can throttle your requests.
<- c("https://en.wikipedia.org/wiki/Pear",
fruit_pages "https://en.wikipedia.org/wiki/Apple",
"https://en.wikipedia.org/wiki/Orange_(fruit)")
library(purrr)
# Define a throttled read_html() function with a delay of 1s
<- slowly(read_html,
read_html_delayed rate = rate_delay(1))
# Construct a loop that goes over all page urls
for(page_url in fruit_pages){
# Read in the html of each URL with the function defined above
<- read_html_delayed(page_url)
html }
Now you can extract what you want using html_elements()