forked from cjbarrie/sicss_22
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04_scrape_pages.qmd
76 lines (58 loc) · 1.34 KB
/
04_scrape_pages.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: "Web scraping"
subtitle: "SICSS, 2022"
author: Christopher Barrie
format:
revealjs:
chalkboard: true
editor: visual
---
## Scraping webpages
```{r, eval = T, echo = F}
library(dplyr)
library(readr)
library(stringr)
library(rvest)
```
```{r, eval = F, echo = T}
#| code-line-numbers: "|6|8"
library(dplyr)
library(readr)
library(stringr)
library(rvest)
url <- "https://wayback.archive-it.org/2358/20120130161341/http://www.tahrirdocuments.org/2011/03/voice-of-the-revolution-3-page-2/"
html <- read_html(url)
```
## Getting page text
```{r, eval = F, echo = T}
#| code-line-numbers: "|2|3|4"
# identify relevant text
html %>%
html_elements("p") %>%
html_text(trim=TRUE)
```
## Getting page elements
```{r, eval = F, echo = T}
#| code-line-numbers: "|2|3|4"
# identify relevant text
html %>%
html_elements(".calendar") %>%
html_text(trim=TRUE)
```
## Looping through urls
```{r, eval = F, echo = T}
#| code-line-numbers: "|2|3|4|5|6|7|8|9"
pamlinks_all <- character(0)
for (i in seq_along(urlpages_all)) {
url <- urlpages_all[i]
html <- read_html(url)
links <- html_elements(html, ".post , h2") %>%
html_children() %>%
html_attr("href") %>%
na.omit() %>%
`attributes<-`(NULL)
pamlinks_all <- c(pamlinks_all, links)
}
```
## It's not always that easy...
![](images/gesslerscraping.png){fig-align="center"}