Class 0: Introduction to the Course, Github, Computing Setup#

Goal of today’s class:

  1. Introduce the syllabus

  2. Ensure we’re all on the Explorer Cluster

  3. Set up classroom virtual environment

Welcome to PHYS 7332 - Network Science Data II#


How The Course Works#

This course offers an introduction to network analysis and is designed to provide students with an overview of the core data scientific skills required to analyze complex networks. Through hands-on lectures, labs, and projects, students will learn actionable skills about network analysis techniques using Python (in particular, the networkx library). The course network data collection, data input/output, network statistics, dynamics, and visualization. Students also learn about random graph models and algorithms for computing network properties like path lengths, clustering, degree distributions, and community structure. In addition, students will develop web scraping skills and will be introduced to the vast landscape of software tools for analyzing complex networks. The course ends with a large-scale final project that demonstrates the proficiency of the students in network analysis.

This syllabus may be updated and can be found here: https://brennanklein.com/phys7332-fall25.

This course has been built from the foundation of the years of work/development by Matteo Chinazzi and Qian Zhang, for earlier iterations of Network Science Data.

Course Learning Outcomes#

  • Proficiency in Python and networkx for network analysis.

  • Strong foundation of complex network algorithms and their applications.

  • Skills in statistical description of networks.

  • Experience in collecting and analyzing online data.

  • Broad knowledge of various network libraries and tools.

Course Materials#

There are no required materials for this course, but we will periodically draw from:

Additionally, we recommend engagement with other useful network science and/or Python materials:

Class Expectations#

This is a “new” class (it had previously been taught as a part of the Network Science PhD program at Northeastern, but for the past 5+ years we’ve not taught it). We’ve brought it back after years of realizing that it contains valuable skills for students studying network science, and we’re hoping to gain important insights around what skills are essential to teach, what can be assumed prior knowledge, and what is too challenging for students. As such, we’ll be asking for frequent feedback on the assignments, classes, etc.

Classroom Interactions#

This is largely a “coding” course. That means we’ll all have our computers out and follow along the same notebook as the instructor is teaching. There will be frequent “your turn” sections of the code, where we’ll expect you to break off into solo / small teams to solve an in-class challenge. When the class comes back together, we’ll ask for volunteers to share screen and show off what they’ve done.

Coursework, Class Structure, Grading#

This is a twice-weekly hands-on class that emphasizes building experience with coding. This does not necessarily mean every second of every class will be live-coding, but it will inevitably come up in how the class is taught. We are often on the lookout for improving the pedagogical approach to this material, and we would welcome feedback on class structure. The course will be co-taught, featuring lectures from the core instructors as well as outside experts. Grading in this course will be as follows:

  • Class Attendance & Participation: 10%

  • Problem Sets: 45%

  • Mid-Semester Project Presentation: 15%

  • Final Project — Presentation & Report: 30%

Final Project#

The final project for this course is a chance for students to synthesize their knowledge of network analysis into pedagogical materials around a topic of their choosing. Modeled after chapters in the Jupyter book for this course, students will be required to make a new “chapter” for our class’s textbook; this requires creating a thoroughly documented, informative Python notebook that explains an advanced topic that was not deeply explored in the course. For these projects, students are required to conduct their own research into the background of the technique, the original paper(s) introducing the topic, and how/if it is currently used in today’s network analysis literature. Students will demonstrate that they have mastered this technique by using informative data for illustrating the usefulness of the topic they’ve chosen. Every chapter should contain informative data visualizations that build on one another, section-by-section. The purpose of this assignment is to demonstrate the coding skills gained in this course, doing so by learning a new network analysis technique and sharing it with members of the class. Over time, these lessons may find their way into the curriculum for future iterations of this class. Halfway through the semester, there will be project update presentations where students receive class and instructor feedback on their project topics. Throughout, we will be available to brainstorm students’ ideas for project topics.

Ideas for Final Project Chapters (non-exhaustive):#

  1. Graph Embedding (or other ML technique)

  2. Network Reconstruction from Dynamics

  3. Link Prediction

  4. Graph Distances and Network Comparison

  5. Motifs in Networks

  6. Network Sparsification

  7. Spectral Properties of Networks

  8. Mechanistic vs Statistical Network Models

  9. Robustness / Resilience of Network Structure

  10. Network Game Theory (Prisoner’s Dilemma, Schelling Model, etc.)

  11. Homophily in Networks

  12. Network Geometry and Random Hyperbolic Graphs

  13. Information Theory in/of Complex Networks

  14. Discrete Models of Network Dynamics (Voter model, Ising model, SIS, etc.)

  15. Continuous Models of Network Dynamics (Kuramoto model, Lotka-Volterra model, etc.)

  16. Percolation in Networks

  17. Signed Networks

  18. Coarse Graining Networks

  19. Mesoscale Structure in Networks (e.g. core-periphery)

  20. Graph Isomorphism and Approximate Isomorphism

  21. Inference in Networks: Beyond Community Detection

  22. Activity-Driven Network Models

  23. Forecasting with Networks

  24. Higher-Order Networks

  25. Introduction to Graph Neural Networks

  26. Hopfield Networks and Boltzmann Machines

  27. Graph Curvature or Topology

  28. Reservoir Computing

  29. Adaptive Networks

  30. Multiplex/Multilayer Networks

  31. Simple vs. Complex Contagion

  32. Graph Summarization Techniques

  33. Network Anomalies

  34. Modeling Cascading Failures

  35. Topological Data Analysis in Networks

  36. Self-organized Criticality in Networks

  37. Network Rewiring Dynamics

  38. Fitting Distributions to Network Data

  39. Hierarchical Networks

  40. Ranking in Networks

  41. Deeper Dive: Random Walks on Networks

  42. Deeper Dive: Directed Networks

  43. Deeper Dive: Network Communities

  44. Deeper Dive: Network Null Models

  45. Deeper Dive: Network Paths and their Statistics

  46. Deeper Dive: Network Growth Models

  47. Deeper Dive: Network Sampling

  48. Deeper Dive: Spatially-Embedded and Urban Networks

  49. Deeper Dive: Hypothesis Testing in Social Networks

  50. Deeper Dive: Working with Massive Data

  51. Deeper Dive: Bipartite Networks

  52. Many more possible ideas! Send us whatever you come up with


What You’ll Learn#

Students should leave this class with an ever-growing codebase of resources for analyzing and deriving insights from complex networks, using Python. These skills range from being able to (from scratch) code algorithms on graphs, including path length calculations, network sampling, dynamical processes, and network null models; as well as interfacing with standard data science questions around storing, querying, and analyzing large complex datasets.

Schedule#

DATE

CLASS

Mon, Sep 1, 25

Labor Day

Wed, Sep 3, 25

Class 0: Introduction to the Course, GitHub, Computing Setup

Fri, Sep 5, 25

Class 1: Python Refresher (Data Structures, NumPy)

Mon, Sep 8, 25

Wed, Sep 10, 25

Class 2: Introduction to NetworkX 1 — Loading Data, Basic Statistics

Fri, Sep 12, 25

Class 3: Introduction to NetworkX 2 — Graph Algorithms

Mon, Sep 15, 25

Announce Assignment 1

Wed, Sep 17, 25

Class 4: Distributions of Network Properties & Centralities

Fri, Sep 19, 25

Class 5: Scraping Web Data 1 — BeautifulSoup, HTML, Pandas

Mon, Sep 22, 25

Wed, Sep 24, 25

Class 6: Data Science & SQL

Fri, Sep 26, 25

Class 7: Scraping Web Data 2 — Creating a Network from SQL Data

Mon, Sep 29, 25

Assignment 1 due September 29

Wed, Oct 1, 25

Class 8: Clustering & Community Detection 1 — Traditional

Fri, Oct 3, 25

Class 9: Clustering & Community Detection 2 — Contemporary

Mon, Oct 6, 25

Announce Assignment 2

Wed, Oct 8, 25

Class 10: Clustering & Community Detection 3 — Advanced

Fri, Oct 10, 25

Class 11: Project Update Presentations

Mon, Oct 13, 25

Indigenous Peoples’ Day

Wed, Oct 15, 25

Class 12: Visualization 1 — Python

Fri, Oct 17, 25

Class 13: Visualization 2 — Guest Lecture (Pedro Cruz, Northeastern)

Mon, Oct 20, 25

Assignment 2 due October 20

Wed, Oct 22, 25

Class 14: Introduction to Machine Learning 1 — General

Fri, Oct 24, 25

Class 15: Introduction to Machine Learning 2 — Networks

Mon, Oct 27, 25

Announce Assignment 3

Wed, Oct 29, 25

Class 16: Dynamics on Networks 1 — Diffusion and Random Walks

Fri, Oct 31, 25

Class 17: Dynamics on Networks 2 — Compartmental Models

Mon, Nov 3, 25

Wed, Nov 5, 25

Class 18: Dynamics on Networks 3 — Agent‑Based Models

Fri, Nov 7, 25

Class 19: Network Sampling

Mon, Nov 10, 25

Assignment 3 due November 10

Wed, Nov 12, 25

Class 20: Network Filtering / Thresholding

Fri, Nov 14, 25

Class 21: Dynamics of Networks — Temporal Networks

Mon, Nov 17, 25

Wed, Nov 19, 25

Class 22: Network Comparison & Graph Distances

Fri, Nov 21, 25

Class 23: Network Reconstruction from Dynamics

Mon, Nov 24, 25

Thanksgiving Break (No Class)

Wed, Nov 26, 25

Fri, Nov 28, 25

Mon, Dec 1, 25

Wed, Dec 3, 25

Class 24: Big Data — Scalability & Cluster Computing

Fri, Dec 5, 25

Class 25: Spatial Data, OSMnx, GeoPandas

Mon, Dec 8, 25

Thu, Dec 11, 25

Class 26: Final Presentations

Fri, Dec 12, 25


Explorer Cluster#

From Research Computing at Northeastern: “The Northeastern University HPC Cluster is a high-powered, multi-node, parallel computing system designed to meet the computational and data-intensive needs of various academic and industry-oriented research projects. The cluster incorporates cutting-edge computing technologies and robust storage solutions, providing an efficient and scalable environment for large-scale simulations, complex calculations, artificial intelligence, machine learning, big data analytics, bioinformatics, and more.

Whether you are a seasoned user or just beginning your journey into high-performance computing, our portal offers comprehensive resources, including in-depth guides, tutorials, best practices, and troubleshooting tips. Furthermore, the platform provides a streamlined interface to monitor, submit, and manage your computational jobs on the HPC cluster.

We invite you to explore the resources, tools, and services available through the portal. Join us as we endeavor to harness the power of high-performance computing, enabling breakthroughs in research and fostering innovation at Northeastern University.” (https://rc-docs.northeastern.edu/en/latest/index.html)

Documentation:#

Trainings:#

Open OnDemand (OOD)#

“Open OnDemand empowers students, researchers, and industry professionals with remote web access to supercomputers.” (https://openondemand.org/)

Open OnDemand is a software platform that provides an easy-to-use web interface for accessing and managing high-performance computing (HPC) resources. It simplifies the process of submitting jobs, managing files, and running interactive applications on HPC systems without the need for complex command-line interactions. For the class this fall, we’ll be using Open OnDemand to give students hands-on experience with HPC resources, letting us run simulations, analyze large datasets, and explore computational tools in a user-friendly environment. This is designed to let students focus on learning and applying computational techniques without getting bogged down by technical complexities.


Interacting with OOD#

Open OnDemand Documentation: https://osc.github.io/ood-documentation/latest/


Standard Apps:

Specialized Apps:


Courses:


Our Very Own Partition#


Jupyter Notebook Server#

This app will launch JupyterLab Notebook on a node on the courses partition, with requested CPUs, GPUs and memory for up to 24 hours. A custom conda environment is optional.

Virtual Environment#

What is conda and why do we use it?#

Python has the ability to do a lot of things on its own, but sometimes we need to install packages that extend Python’s functionality. For example, scikit-learn has a lot of machine learning algorithms, and geopy does handy geographic math (like calculating geodesic distances). Managing the dependencies these packages have on other packages is really annoying and really easy to mess up (in fact, it is NP-complete,, which means it’s a very hard type of computational problem), so we use tools called package managers. pip is one package manager, and conda is a fancier one. conda, short for anaconda, can handle packages that are outside the Python ecosystem, like C compilers, whereas pip cannot.

Class Virtual Environment#

Our class has a shared virtual environment that you can access once you use module to load conda. module is a tool that’s frequently used on computing clusters to load software packages on demand. Here’s some information on loading conda on Explorer, and here’s some information on module. First, you need to request a compute node on Explorer. We’ll go to Open On Demand (https://ood.explorer.northeastern.edu/) and open up an interactive shell: opening an interactive shell

Once you’re in the interactive shell, request a compute node:

srun --partition=short --nodes=1 --cpus-per-task=1 --pty /bin/bash

And then you’ll load conda:

module load anaconda3/2024.06

From here, we can load the class virtual environment:

source activate /courses/PHYS7332.202610/shared/phys7332-env

You should see (/courses/PHYS7332.202610/shared/phys7332-env) in front of your command prompt in the Explorer terminal.


GitHub#

What is GitHub?#

GitHub is a website that implements git.

git is a piece of open-source software that does distributed version control. This is really important for big teams making changes simultaneously to a large codebase. Git lets people document and track their code, and its various versions, so they can collaborate more efficiently.

On Github, you’ll keep your code in what’s called a repository. That’s a chunk of code that all works together. You can download a repository to your own computer, make changes to it, and push them to the repository on GitHub. Then your changes will be reflected in the version of the repository that’s on GitHub, and anyone can download the version with your changes in it.

Our class has a repository that’s available at https://github.com/network-science-data-and-models/phys7332_fa25/. You will download a fork of the repository to make your very own. Occasionally we’ll update the repository and you’ll need to incorporate the updates; we’ll explain how to do that in a second.

a screenshot of the github repo with fork command

Forking and Maintaining The Class Repository#

First, let’s fork the repository. Go to the repository GitHub link. You’ll see a little icon on the top left that says “Fork”. Click on it: a screenshot of the github repo with fork command

Once you’ve clicked the fork button, you’ll get this screen:

a screenshot of fork creation Make sure you’re listed as the owner and create the fork. Then you’ll clone your fork in the Explorer terminal using the URL obtained via the green Code buttion: a screenshot of the clone button

We’re going to first navigate to your folder in our course partition:

cd /courses/PHYS7332.202610/students/USERNAME where USERNAME is your Northeastern username. Then we’ll clone the repo:

git clone [URL YOU COPIED]

Now let’s set up a way to get updates from the original repo. Let’s cd into the repo: cd REPO_FILEPATH

git remote add upstream https://github.com/network-science-data-and-models/phys7332_fa25.git

git fetch upstream

git checkout main

git merge upstream/main

This sets up the original version of the GitHub repo as an upstream source. Now we can, at any point, fetch the upstream source and merge it into our main branch, which is the main branch of your forked repository. You’ll be working in this forked repository in your /courses/PHYS7332.202610/students/USERNAME directory throughout the course. If you want to sync your forked version with the original version (as we will be adding new notebooks over the course of the class), you can run the following commands:

git fetch upstream

git checkout main

git merge upstream/main

(Here’s a StackOverflow thread explaining this in more detail.)


Final step: Start a Jupyter Lab instance!#

Recall from earlier, the “Courses” tab:

Click on that, then fill in the following information (after you do this once, it’ll auto-populate for future classes):


We’re in!

Now, navigate to this jupyter notebook, and get a feel for the notebooks, the folders, the files, etc. that you see.


Fixing Github Problems & Creating a One-Stop Shop For Git Updates#

We’re going to go through a series of steps to hopefully make updating the Github repo less of a pain in the butt in the long run!

Setting up SSH Keys with GitHub#

First we’re going to set up SSH keys on Explorer and put them on GitHub. This lets us authenticate our identity to GitHub and makes it possible for us to push our changed code to our forked repository. Neat! First, let’s follow the tutorial here to create SSH keys, because they explain it way better than I can. Then you’ll add your public SSH key (the file that ends in .pub) to GitHub using these instructions. Label them something memorable, like explorer key, so you’ll know which machine(s) it’s attached to.

Syncing the Fork#

Now we’re going to sync our forked repository with the main student repository that I own. We can do this on GitHub, which is neat! This will incorporate all the fancy changes I made (okay, it’s basically one script) and let us test out the git update script. Go to your forked repository on GitHub; you’ll see a button that says Sync fork. Click on that; your screen should look something like this.

Click the green button (Update branch) to sync your fork with the main repository. Don’t discard any commits you may have made.

Going to Explorer and Trying This Weird Script#

Next we’ll go to Explorer and pull our new changes using git pull origin main. You should see a new file come up that is labeled git_fixer.py. That is going to (hopefully) be our new best friend for the near future.

Now, make some modifications to class_00_github_computing_setup.ipynb. You can just add comments at the end or get fancier; do what your heart desires.

Next, we’ll run python3 git_fixer.py. If this script works well enough, it should create a version of any files that you’ve modified from the original that have MODIFIED_{a bunch of numbers}.ipynb at the end of the filename. It will also update any files that have been modified on my end and add any new files that I’ve created in my repository.

Finally, run git status. You may need to add and commit the changes to your fork so you have your modified notebooks (ideally with your notes and added code) on GitHub in your forked repository. Then push them (git push origin main) so they’re reflected in your forked repository.

Going Forward#

At the beginning of each class, run git_fixer.py to make sure you have all the new code for the upcoming classes. Then you will run git status and add/commit/push any changes you wish to see reflected in your forked repository.


Next time…#

Python refresher! class_01_python_refresher.ipynb


References and further resources:#

  1. Class Webpages

  2. Explorer Cluster “OOD”: https://ood.explorer.neu.edu/

  3. Python for R users: https://rebeccabarter.com/blog/2023-09-11-from_r_to_python