R: Web Scraping

Collecting and preprocessing data is always the first step in a data analysis project or in a machine learning pipeline. The web plays a crucial role here: Often, authoritative statistical data are published as tables on regularly updated websites. Data found on social networks might provide valuable ground truth for training machine learning algorithms. However, gathering data from websites is often not that straightforward and requires an understanding of the architecture of the web.

In this course, you'll learn how to leverage R to collect and parse data found on various kinds of websites. By doing so, you'll get to know typical website architectures and how to approach them efficiently for scraping. The first part of the course will be held remotely and will introduce various concepts and R functions, while the second part will be held on site, where you'll be faced with some hands-on scraping challenges.

General information

Duration 12 hours
Overview of approaches for collecting data from a remote source
Introduction of different R packages for scraping (httr and rvest)
How to parse tabular data on websites into R data frames
Scraping best practices
Where to go from here & approaches for more complicated websites
Either some basic knowledge of R (ideally with the Tidyverse) or completion of the following courses.
  • ARE - R: Basic Introduction
  • ARF - R: tidyverse for Data Science
Students and employees of the University of Zurich. This course is particularly suitable for students at the MSc-/PhD-Level as well as other academic personnel such as postdocs.
Handouts will be distributed during the course.

Dates

Code Referents Dates Available seats Place
There are currently no open courses

Please note before booking

Before booking your course, please note our General Conditions of Participation (pdf, 92 KB) but especially our Fair Play: Registration and Deregistration (pdf, 299 KB).... Thank you very much!

 

 

Contact

E-mail: training@zi.uzh.ch
Contact details

Course programme of the FS23:

The program for the spring semester 2023 (pdf in German, 475 KB) will be online from January. Registration is possible from 01.02.2023 (during the night from approx. 01:00)

CMS, OLAT and Science IT courses also allow prior registration