Static webscraping
This course:
What is webscraping?
It is the action of collecting data from webpages.
Anything that is visible on a webpage can be webscraped.
Why is it useful?
To get original data that isn’t available anywhere else.
But webscraping can be time-consuming and isn’t always possible.
Order of preference:
The objective of some websites is to provide data to people: they want us to being able to access this data easily.
They usually provide an API (Application Programming Interface) so that we can interact with their services via code.
This is better than webscraping.
Example: World Development Indicators API
In R, we can use the package WDI
to access this API:
Many data services have an API: World Bank, Eurostat, Google Maps, ArcGIS, US Census, etc.
Sometimes, those services require an API key (might need to pay for this).
If you want data from a source, try searching “R package API <name of the service>” first.
For this morning, the objectives are to:
The web works with 3 languages:
HTML tells us the content of the page and how it is structured using tags.
Example: <p>Hello my name is <b>Etienne</b></p>
<p> is a paragraph tag and <b> is a bold tag.
Popular tags:
h1
-h4
: headersp
: paragrapha
: linkdiv
: generic containerA tag can have attributes, e.g. class
or style
:
<p class = "my_class">Hello my name is <b>Etienne</b></p>
This is used either for custom styling with CSS or for interactions with JavaScript.
How can we see the HTML of a page?
Either:
CSS is a language only about the style of the page.
There is no content at all, e.g. it’s only here to say “bold text should be in red and underlined”.
Example:
b {
color: red;
border-bottom: 1px dashed;
}
It is still useful to know CSS paths because we’ll use them to select specific elements in the page.
p : all "p" tags
p span : all "span" contained in "p" tags
p, a : all "p" and "a" tags
#id1 : all elements with id equal to id1
.class1 : all elements of class "class1"
p.class1 : all "p" elements of class "class1"
p.class1 span : all "span" in "p" tags of class "class1"
p > span : all "span" that are direct children of p
h1 + p : all "p" that follow *directly* an "h1" (direct sibling)
h1 ~ p : all "p" that follow an "h1" (siblings placed after)
[id] : all elements with an existing "id" attribute
[class^=my] : all elements whose class starts with "my"
p[class*=low] : all "p" elements whose class contains the string low
<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
</div>
CSS selector for this paragraph?
p.extra
<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
</div>
CSS selector for this div
?
div.info > div
<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
</div>
JavaScript (JS) is a language that takes care of all the interactions with the page:
It makes it more difficult to scrape, we’ll handle that in part 2 tomorrow.
Today, we only look at HTML and CSS and focus on static webpages.
Static webpage:
Dynamic webpage: the website uses JavaScript to fetch data from their server and dynamically update the page.
Example: Premier League stats.
Get results of municipal elections in Spain for 2019 from El Pais.
Turns out that since I’ve done this webscraping, an R package became available: infoelectoral
.
Let’s pretend there isn’t any.
My process for webscraping is always the same:
Importantly, separate webscraping from data processing (we’ll see why later).
Let’s start with the results in Bacares in 2019.
In R, the main package for webscraping is rvest
.
We can read HTML with read_html()
and then select specific HTML elements with html_element()
or html_elements()
(this is where our CSS knowledge is useful).
Part 1 in R
Three rules to follow for polite webscraping:
There’s an R package for thatTM: polite
.
Instead of:
we do:
By default we use a 5-second delay or what the website recommends.
If we want to lower this, we need to specify our own “User-Agent”, which is sort of our ID card on the web.
Now we have the table for one municipality: how can we generalize that (politely)?
The URL is:
https://resultados.elpais.com/elecciones/2019/municipales/01/04/19.html
Three components:
To generalize the code, we need to loop through all of those combinations:
new objective: gather all those combinations from the website
Part 2 in R
Great, let’s run everything!
Not so fast
The default behavior of errors is to stop the script, which leads to this kind of situation:
need to handle errors with tryCatch()
tryCatch :
try to run an expression
catch the potential warnings/errors
Example: try to compute log("a")
.
What if I want to return NA
instead of an error?
tryCatch(
# try to evaluate the expression
{
log("a")
},
# what happens if there's a warning?
warning = function(w) {
print("There was a warning. Here's the message:")
print(w)
},
# what happens if there's an error?
error = function(e) {
print("There was an error. Returning `NA`.")
return(NA)
}
)
[1] "There was an error. Returning `NA`."
[1] NA
Create a fake loop that goes from i = 1:10
and creates a list containing i*2
. Let’s say that there’s an error when i = 3
:
Create a fake loop that goes from i = 1:10
and creates a list containing i*2
. Let’s say that there’s an error when i = 3
:
We don’t have values for i >= 3
because there was an error that stopped the loop.
Now, let’s catch the error to avoid breaking the loop:
[1] "i = 1. So far so good."
[1] "i = 2. So far so good."
[1] "Error for i = 3. `x[[3]]` will be NULL."
[1] "i = 3. So far so good."
[1] "i = 4. So far so good."
[1] "i = 5. So far so good."
[1] "i = 6. So far so good."
[1] "i = 7. So far so good."
[1] "i = 8. So far so good."
[1] "i = 9. So far so good."
[1] "i = 10. So far so good."
[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
NULL
[[4]]
[1] 8
[[5]]
[1] 10
[[6]]
[1] 12
[[7]]
[1] 14
[[8]]
[1] 16
[[9]]
[1] 18
[[10]]
[1] 20
tryCatch
in our loopPart 3 in R
Note that until now, we have only gathered the data, not processed it.
Maybe we even did too much work!
We could have simply downloaded each HTML page locally: remember that a webpage is just a text file!
Advantages of downloading locally:
Cons:
Pay attention to a website’s Terms of Use/Service.
Some websites explicitly say that you are not allowed to programmatically access their resources.
I’m not a lawyer.
For me, scraping politely (slowly, identifying yourself) on websites where data is publicly available is fine.
Scraping data behind a paywall or needing an account is not, because the data is not publicly available.
Can get banned temporarily or permanently from some websites.
Sometimes, scraping static websites is needed (but check for available APIs first!).
The package rvest
is the best tool to do that in R. If you prefer Python, use beautifulsoup
.
Code on GitHub: https://github.com/etiennebacher/innsbruck_teaching_may_2025
Acknowledgements: some of those slides were adapted from Laurent Bergé’s slides.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.5.0 (2025-04-11)
os Ubuntu 24.04.2 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2025-06-02
pandoc 3.1.3 @ /usr/bin/ (via rmarkdown)
quarto 1.6.40 @ /home/etienne/.local/lib/python3.12/site-packages/quarto_cli/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cli 3.6.4 2025-02-13 [1] RSPM
digest 0.6.37 2024-08-19 [1] RSPM
evaluate 1.0.3 2025-01-10 [1] RSPM
fastmap 1.2.0 2024-05-15 [1] RSPM
htmltools 0.5.8.1 2024-04-04 [1] RSPM
jsonlite 2.0.0 2025-03-27 [1] RSPM
knitr 1.50 2025-03-16 [1] RSPM
rlang 1.1.6 2025-04-11 [1] CRAN (R 4.5.0)
rmarkdown 2.29 2024-11-04 [1] RSPM
sessioninfo 1.2.3 2025-02-05 [1] RSPM
xfun 0.52 2025-04-02 [1] RSPM
yaml 2.3.10 2024-07-26 [1] RSPM
[1] /home/etienne/R/x86_64-pc-linux-gnu-library/4.5
[2] /opt/R/4.5.0/lib/R/library
──────────────────────────────────────────────────────────────────────────────