# Chapter 5: Scraping Web Data 1 - BeautifulSoup & HTML


1. Come in. Sit down. Open Teams.
2. Make sure your notebook from last class is saved.
3. Open up the Jupyter Lab server.
4. Open up the Jupyter Lab terminal.
5. Activate Conda: `module load anaconda3/2022.05`
6. Activate the shared virtual environment: `source activate /courses/PHYS7332.202510/shared/phys7332-env/`
7. Run `python3 git_fixer2.py --ignore FILEPATH1_TO_IGNORE, FOLDER_TO_IGNORE/`
8. Github:
    - git status (figure out what files have changed)
    - git add ... (add the file that you changed, aka the `_MODIFIED` one(s))
    - git commit -m "your changes"
    - git push origin main
________

## Goals of today's class
1. Learn how websites work (at a very high level).
2. Figure out basic principles of scraping; learn some useful tricks for scraping.

## How do websites work?

### What is the "backend" of a website?
The backend of a website has a bunch of moving parts. Content probably lives in a database (or, more likely these days, several databases). There are servers (big computers) with functions that are responsible for getting content from the databases and sending it, in a machine-readable format, to the frontend. This functionality is called an API, or *Application Programming Interface*. It is (generally speaking) how computers request and get data. 

### What is the "frontend" of a website?
The frontend of a website is where you, the human user, see the content. The frontend takes the big chunk of content sent from the server via the API and puts it into a nice, pretty format. This is what you see as "the website." The way that websites typically display content is via a combination of HTML (hypertext markup language) templates and Javascript, for the interactive features. 

#### HTML
HTML is a programming language that people use to make webpages. It consists of *elements* that can be nested within each other. Elements are indicated with *tags;* typically there are open tags (`<p>`) and close tags (`</p>`) that surround the contents of an element. Browsers read HTML and use the tags to figure out how to display content; you don't see the raw HTML when you use a browser (though you can do so using the "inspect element" feature). Here is an example of a (very bare-bones) HTML file:
```
<!DOCTYPE html>
<html>
<head><h1>This is a header!</h1></head>
<body>

<h2>This is a heading!</h2>

<p>And this is a paragraph</p>

</body>
</html>
```
Let's see what this looks like in our notebook!

In [2]:
from IPython.display import display, HTML

my_html_string = """
<!DOCTYPE html>
<html>
<head><h1>This is a header!</h1></head>
<body>

<h2>This is a heading!</h2>

<p>And this is a paragraph</p>

</body>
</html>
"""
display(HTML(my_html_string))

Here's an example with a link and an image:

In [3]:
now_with_link = """
<!DOCTYPE html>
<html>
<head><h1>This has a link!</h1></head>
<p>
<a href="https://northeastern.edu">This is a link</a>
</p>
<img src="images/whale.jpg" alt="this is a whale" width=200 height=200>
"""
display(HTML(now_with_link))

Obviously most websites are more fancy than that, but at their core, when you visit them, HTML is being generated -- and you can look at it with your computer instead of via your browser. 
The act of looking at web pages via your computer (i.e. programatically) instead of via a conventional browser is called *scraping*, and it's not super hard to do!

## Ways to access a website
### Visiting the website via a browser 
Pros: 
* Does not require that much specialized knowledge.
* Is how you're generally encouraged to use websites.
* Easy to understand what you're looking at.

Cons:
* Does not scale well (if you're trying to look at 5000 webpages, this is not a good approach)

### Using a website's API
Pros:
* Much faster
* Scales better
* Output is easily machine-readable

Cons:
* The API exists because the website's owner allows it to exist (see: Twitter/X). 
* Might cost money
* Might have rate limits
* Output is not easy to read if you are a human

### Scraping a website
Pros:
* Also scales pretty well
* Does not require the goodwill of a website's owner
* Scraping publicly accessible data is [legal](https://techcrunch.com/2022/04/18/web-scraping-legal-court/) in the US

Cons:
* You run the risk of getting your IP banned
* Often have to build a custom scraper for each website
* Not doable for all websites (e.g. Facebook)

## How do we scrape a website?

### First, we practice good robot citizenship via the `robots.txt` file!
https://en.wikipedia.org/wiki/Robots_exclusion_standard

http://www.robotstxt.org/robotstxt.html

- It is a standard used by websites to communicate with web crawlers and other web robots
- The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned
- Robots are often used by search engines to categorize web sites
- Not all robots cooperate with the standard; email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out

In practice,
- when a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt)
- this text file contains the instructions in a specific format
- robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site
- if this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.
- a robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file.

### Let's check out the `robots.txt` for Northeastern's course catalog using the `requests` package.
The `requests` package lets us make requests to websites or APIs. It gives us back HTML webpages that we can read through as if they were .html files. 

In [2]:
import requests
res = requests.get('https://catalog.northeastern.edu/robots.txt')
print(res.text)

Sitemap: https://catalog.northeastern.edu/sitemap.xml
User-agent: *
Disallow: /archive/
Disallow: /admin/
Disallow: /pagewiz/
Disallow: /courseleaf/
Disallow: /wiztest/
Disallow: /navbar/
Disallow: /gallery/
Disallow: /clmail/
Disallow: /dbleaf/
Disallow: /depts/
Disallow: /responseform/
Disallow: /mig/
Disallow: /tmp/
Disallow: /ribbit/
Disallow: /azindex/
Disallow: /catalogcontents/
Disallow: /shared/
Disallow: /cim/
Disallow: /courseadmin/
Disallow: /programadmin/
Disallow: /miscadmin/
Disallow: /js/
Disallow: /images/
Disallow: /css/
Disallow: /styles/
Disallow: /search/
Disallow: /xsearch/
Disallow: /migration/
Disallow: /fonts/
Disallow: /pdf/
Disallow: /wen/
Disallow: /graduate/engineering/multidisciplinary/user-experience-design-graduate-certificate/
Disallow: /graduate/health-sciences/nursing/dnp-concentration-nurse-anesthesia/
Disallow: /graduate/engineering/multidisciplinary/full-stack-software-engineering-graduate-certificate/
Disallow: /graduate/engineering/multidisciplina

Our `User-agent` is categorized under `*`, so we're not allowed to look at a bunch of different pages, as listed above. That's okay, though, because we're allowed to look at `/course-descriptions`. 

## Actually Scraping Data
Now we're going to walk through the process of crawling Northeastern's course catalog a bit. We'll start by looking at some real HTML and discuss parsing it. 

In [3]:
# Scraping the main catalog page
catalog_res = requests.get('https://catalog.northeastern.edu/course-descriptions/')
catalog_html = catalog_res.text

# Displaying the raw HTML
catalog_html

'\n\n<!doctype html>\n<html class="no-js" xml:lang="en" lang="en" dir="ltr">\n\n<head>\n<meta http-equiv="X-UA-Compatible" content="IE=Edge" />\n<title>Course Descriptions &lt; Northeastern University Academic Catalog</title>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta property="og:site_name" content="Northeastern University Academic Catalog" />\n<link rel="search" type="application/opensearchdescription+xml"\n\t\t\thref="/search/opensearch.xml" title="Catalog" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />\n<link href="/favicon.ico" rel="shortcut icon" />\n<link rel="stylesheet" type="text/css" href="/css/reset.css" />\n<link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=Roboto:400,400i,500,500i,700,700i">\n<link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=Lato:300,300i,400,400i,700,700i,900">\n<link rel="stylesheet" type="text/css" href="/fonts/fo

Wow! That's really hard to read if you're a human! We're going to use the `BeautifulSoup` python package to parse the HTML that we just got. Parsing HTML on your own is not something I recommend; there are already tools that do it correctly, and writing the [regular expressions](https://www.regular-expressions.info/) to navigate the tree structure of HTML is simply not worth your time. 

In [123]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(catalog_html)
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en" xml:lang="en">
 <head>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <title>
   Course Descriptions &lt; Northeastern University Academic Catalog
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Northeastern University Academic Catalog" property="og:site_name"/>
  <link href="/search/opensearch.xml" rel="search" title="Catalog" type="application/opensearchdescription+xml"/>
  <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <link href="/favicon.ico" rel="shortcut icon"/>
  <link href="/css/reset.css" rel="stylesheet" type="text/css"/>
  <link href="//fonts.googleapis.com/css?family=Roboto:400,400i,500,500i,700,700i" rel="stylesheet" type="text/css"/>
  <link href="//fonts.googleapis.com/css?family=Lato:300,300i,400,400i,700,700i,900" rel="stylesheet" type="text/css"/>
  <link href="/fonts/font-awesome/font-awesome.min.css" re

### Reading HTML with BeautifulSoup
As you can see, `BeautifulSoup` puts the HTML in a much neater format. We can also use it to programatically navigate the HTML document we're seeing. Let's open up the [course catalog](https://catalog.northeastern.edu/course-descriptions/) in our browser and use the "inspect element"/"inspect" tool in our browser to see exactly which parts of the HTML are responsible for important chunks of the content we're seeing.

Right-click on the title ("Course Descriptions") and use the menu that appears to open up the inspect tool. Here's what this looks like in Firefox:
![Screenshot of Firefox inspect element](images/inspect_course_description_lesson_5.png)

We know that the element containing the large words "Course Description" is an `h1` element (header font) wrapped in a div whose ID is `site-title`. `div` tags denote separate divisions or sections within an HTML document. Let's use BeautifulSoup to find this particular element and see what's nested inside it:

In [133]:
site_title_div = soup.find('div', id="site-title")
print('this is the entire div in question:')
print(site_title_div)
print('\n and this is the H1 element:')
print(site_title_div.h1)
print('\n and here is the text we actually see in our browser:')
print(site_title_div.h1.string)

this is the entire div in question:
<div id="site-title">
<div class="wrap">
<h1>
                Course Descriptions
            </h1>
</div>
</div>

 and this is the H1 element:
<h1>
                Course Descriptions
            </h1>

 and here is the text we actually see in our browser:

                Course Descriptions
            


We can use BeautifulSoup to navigate the HTML document! Let's try looking at a bunch of links and play around with navigating them. Specifically, we're going to look at the different departments with courses listed in the course catalog. Let's start by inspecting the first linked department, Accounting:
![inspecting the element containing the link to Accounting](images/accounting_inspect_course_5.png)

The department links appear to be nested inside a div whose ID is `atozindex`. Inside this div, if we scroll through the inspector console, we'll see a bunch of `ul` elements (`ul` makes an unordered (bulleted) list) alternating with `h2` elements, one each for each letter of the alphabet.
![The ul and h2 elements for the alphabetical department listings](images/alphabetical_div_inspect_course_5.png)

Let's pull out the div we're interested in and look at all the `ul` elements nested inside it. Inside each `ul` are the links to all the departments starting with a particular letter. 

In [148]:
soup_alpha_div = soup.find('div', id="atozindex")
print('This is the first ul element:')
first_ul = soup_alpha_div.find('ul')
print(first_ul)
print('\n And this is the first list (li) element within the list:')
first_bullet = first_ul.find('li')
print(first_bullet)
print('\n Finally, we can get the link and its href (the partial URL that points to the accounting courses):')
print(first_bullet.a.get('href'))

This is the first ul element:
<ul>
<li><a href="/course-descriptions/acct/">Accounting (ACCT)</a></li>
<li><a href="/course-descriptions/acc/">Accounting - CPS (ACC)</a></li>
<li><a href="/course-descriptions/avm/">Advanced Manufacturing Systems - CPS (AVM)</a></li>
<li><a href="/course-descriptions/afam/">African American Studies (AFAM)</a></li>
<li><a href="/course-descriptions/afcs/">Africana Studies (AFCS)</a></li>
<li><a href="/course-descriptions/afrs/">African Studies (AFRS)</a></li>
<li><a href="/course-descriptions/amsl/">American Sign Language (AMSL)</a></li>
<li><a href="/course-descriptions/aly/">Analytics - CPS (ALY)</a></li>
<li><a href="/course-descriptions/anth/">Anthropology (ANTH)</a></li>
<li><a href="/course-descriptions/ant/">Anthropology - CPS (ANT)</a></li>
<li><a href="/course-descriptions/apl/">Applied Logistics - CPS (APL)</a></li>
<li><a href="/course-descriptions/arab/">Arabic (ARAB)</a></li>
<li><a href="/course-descriptions/arch/">Architecture (ARCH)</a></

If we want to get all items matching a description, we use the `find_all` function. Let's look at all the bulleted lists of departments and get the links that point to their courses:

In [149]:
department_hrefs = []
for ul in soup_alpha_div.find_all('ul'):
    for li in ul.find_all('li'):
        department_hrefs.append(li.a.get('href'))

We can use these hrefs (which are partial URLs) to navigate programatically to a specific department's courses. Let's pick a random department href and navigate to its course descriptions:

In [152]:
import random
my_href = random.choice(department_hrefs)

my_full_url = 'https://catalog.northeastern.edu' + my_href
print(my_full_url)
dept_html = requests.get(my_full_url).text

https://catalog.northeastern.edu/course-descriptions/phl/


## Scraping Course Descriptions
Using the URL for your randomly selected department and the skills you've just learned, see if you can gather up some course titles and descriptions. What would be a good way to store this data? Does your answer change if you just need to plug the data into a function versus needing to send the data to a colleague?

In [153]:
# Your Turn!
def get_course_titles_and_descriptions(dept_html):
    """
    Given a string of raw html (dept_html) that contains a department's course titles and descriptions,
    return a useful data structure that contains only the titles and descriptions. 
    """
    pass

### Bonus Fun:
Can you find the links to the prerequisites for a course in your data structure and retrieve *the prerequisites' descriptions* if they are in the same department?

## Word Frequency & Co-Occurrence
We have functionality to obtain course descriptions, so let's put it to use. For a particular department's course listings, let's look at what words they use most often. To do this, we'll make use of two neat tricks: the `split` method for strings and `collections.counter`. 

To split up a string by any delimiter, we can use `split`. By default, `my_string.split()` will split a string at space characters, returning a list of the chunks that were separated by space characters before. If you want to use a different delimiter, like a comma, you can use `my_string.split(",")`. Delimiters can also be more than one character; a common one is a comma followed by a space. Additionally, if you have extra whitespace on either side of a string after you split it up, you can use `my_chunk.strip()` to strip leading & trailing whitespace. 

We'll also want to make sure that uppercase versions of a word are indexed the same as lowercase versions of the same word, so we'll use `my_string.lower()` to make sure all words are in lowercase. 

And recall that `collections.counter` can count up how many instances of each unique item shows up in an iterable (like a list). 

In [10]:
my_string = "whale dolphin fish shark"
print(my_string.split())
my_other_string = "whale,dolphin,fish,shark"
print(my_other_string.split(','))
my_next_string = "whale, dolphin, fish, shark"
print(my_next_string.split(', '))

my_string_with_spaces = '  whale '
print('my string is ' + my_string_with_spaces)
print('my string is ' + my_string_with_spaces.strip())

uppercase_string = "WHALE"
print(uppercase_string.lower())

['whale', 'dolphin', 'fish', 'shark']
['whale', 'dolphin', 'fish', 'shark']
['whale', 'dolphin', 'fish', 'shark']
my string is   whale 
my string is whale
whale


### Word Frequency
Now we'll write a function that takes the output of the last function you wrote, `get_course_titles_and_descriptions`, and returns a dict of word frequencies. We'll also remove *stopwords*, which are words that are so commonly used they don't tell us much about the text. These are words like "an", "and", "the", "than", etc. (We sourced our stopwords from [this repo](https://github.com/stopwords-iso/stopwords-en)). Can you make a histogram of the top 10 most frequently used words in your list of course descriptions?

In [12]:
# Your Turn!
STOPWORDS = set()
with open('data/stopwords-en.txt', 'r') as f:
    for line in f.readlines():
        STOPWORDS.add(line.strip())
        
def get_word_frequencies_from_all_course_descriptions(course_descriptions):
    """
    Given a list of course descriptions, return a dict of word usage frequencies. 
    
    Input: course_descriptions, a list of strings
    Output: word_freqs, a dict mapping words (lowercased) to integer frequency counts.
    """
    pass

In [14]:
# Your Turn Again!
%matplotlib inline
import matplotlib.pyplot as plt

# plt.bar(top_ten_words, top_ten_word_counts)

### Word Co-Occurrence
Another interesting thing we can do is create a data structure for word co-occurrence -- which words show up in the same course description a lot of the time? Given that same list of course descriptions, let's create a data structure that keeps track of which words appear frequently in the same description. I recommend using a matrix or a nested dictionary; both have upsides and drawbacks. Which pair of words occurs together the most frequently?

In [13]:
# Your Turn!
def get_word_co_occurrences_from_all_course_descriptions(course_descriptions):
    """
    Given a list of course descriptions, return a data structure with word co-occurrence counts.
    
    Input: course_descriptions, a list of strings
    Output: a data structure of your choice indicating word co-occurrences.
    """
    pass

## Resources & Acknowledgements

The description of the `robots.txt` file we use in this lesson comes from [Dr. Matteo Chinazzi](https://www.matteochinazzi.com/) and [Dr. Qian Zhang's](https://www.zhangqianrach.org/) 2018 rendition of this course.

[A more in-depth view on how websites work](https://www.freecodecamp.org/news/how-the-web-works-a-primer-for-newcomers-to-web-development-or-anyone-really-b4584e63585c/)

[Selenium](https://www.selenium.dev/) is a package that lets you programatically simulate using a browser to scrape and interact with websites; it's handy for interacting with the Javascript elements of websites, for example. 

[nltk](https://www.nltk.org/), or the Natural Language Toolkit, is a Python package for cleaning and parsing text (natural language) data. If you're interested in diving into natural language processing more, NLTK is a great first step.

[More on co-occurrence matrices](https://www.baeldung.com/cs/co-occurrence-matrices)