Beyond Python Logo

Python Web Scraping Introduction

Written by Matthew Yeager
10-minute read (1200 words)
Published: Sat Jul 06 2019
Updated: Mon Jul 08 2019
Web Scraping for BeginnersFinding the data and links we want to automate

Update: The Women's World Cup has completed for 2019. If you are looking for the Women's World Cup 2019 dataset you can find it here. Keep reading to find out how to use Python to scrape web data.


I am going to be using a Python Notebook (how to install?). This comes with all the libraries we will need, already installed with good versions. If you prefer to manage your environment and package versions yourself, you will still be able to follow along.

This tutorial focuses on the basics of a homegrown web scraping solution. Your project might be collecting and storing data for your own personal analysis. This won't be a large application, set to run on a predefined schedule, and integrating the results into a relational database. This is a practical walkthrough resulting in a CSV after only a few steps.

As the Women's World Cup 2019 wraps up this weekend, we'll look at the players statistics. The Fifa website provides information about players in a very standard layout. Each player page looks the same, except for different data. We need to think how we will find all the pages. We can start at the teams page. Next, just like when we browser the website ourself, we access each team's page, and then we can find a list of players. Once we visit each player's page, we need to store their statistics for our personal data analysis later on.

Fifa Women's World Cup 2019 France (www.fifa.com)

Women's World Cup 2019 website

Using Chrome DevTools to find the dataStep one, manually walk through the process you want

There are a few ways to open Chrome DevTools. The easiest is to right click on the page and select "Inspect". This opens DevTools, which sometimes starts off within the page. You can pop this out using the settings in the top right.

Chrome DevTools Dock Setting

Next, we want to find the underlying HTML, so use the Element Selector to click on a team. There are many different ways a webpage can handle a mouse click and send a user to the next page. We need to make sure we understand the format of this page.

Chrome DevTools Element Selector

Chrome DevTools Element Selector - Teams

Now we can see the format of the links (bottom of the image) within the page.
<a href="/womensworldcup/teams/team/1882883/"
Move onto the team page to see how player's pages are linked.

Chrome DevTools Element Selector - Players

<a href="/womensworldcup/players/player/298807/"

Finding the format of the links to follow should be straightforward like we saw. However, once we reach a page with the data we want, it can be more difficult to find a common pattern to extract the information.

To find how the data first got to the page, we should take a look at the network activity. We won't be going deep into all the web traffic or technical protocols. We are just hoping to find out if the data is available in any other format.

My first step is to change to the Network tab within DevTools (we have been working on the Elements tab) and visit (or refresh) a player's page. After a few seconds of loading, turn off recording network activity via the red recording icon in the top left. Sort by Type, and look to see if the data arrived directly within the webpage or if a separate request was made to fetch it (that we can also use!). For me, I'm looking at the names of the files for type "Document" and "xhr". If I see something interesting, clicking on it reveals a details panel where you can find the data your computer received.

Chrome DevTools Network Panel

Chrome DevTools Network Requests

Perfect! When we clicked on _player-profile-data we can see the response your computer received looks like the top part of the player's page. This is exactly like the data we are after. It will be easier to parse this subset than having to work with the entire page which could contain ads, videos, and tons of third party javascript. It looks like we have everything we need to start fetching and storing data.


"/womensworldcup/teams/team/1882883/"
"/womensworldcup/players/player/298807/"
"/womensworldcup/players/player/298807/_libraries/_player-profile-data"
"/womensworldcup/players/player/298807/_libraries/_player-statistics"
Automating web scraping with PythonFetch Webpages - Store HTML - Parse Values - Analyze Data!

It will be easier to understand the coding, now that we have manually walked through what we want to accomplish. Let's look at the different Python packages we will use:

  • requests - Making web requests for web pages
  • time - Allowing us to pause or sleep between requests to avoid fetching pages faster than when we normally browse the web.
  • re - Parsing the data will utilize a pattern matching language known as Regular Expressions. This topic could be complex, but we are keeping it simple for the first script.

So let's import these libraries and start by pulling the first webpage that has all the team links.

import time
import re
import requests

# Webpage addresses
TEAMS_ENDPOINT = r'https://www.fifa.com/womensworldcup/teams/'
TEAM_ENDPOINT = r'https://www.fifa.com/womensworldcup/teams/team/{}/'

response = requests.get(TEAMS_ENDPOINT)
response.text[:600]

What did you find in the response we got from the team page? Definitely looked like HTML from a webpage to me! Tough to see specifically because there will be so much content coming back. The next step is to extract just the team ids from all this data and then navigate to each team page. To extract out the pattern of the team webpage, we are going to use a Regular Expression. This won't an exhaustive resource on Regular Expression patterns - well it isn't even an introduction. Regular Expressions are a powerful matching library that allows us to use special tokens to generalize our searches. For insance, if we know team ids will always be numbers, we can ask the pattern to capture those. Remember, if you want to update the patterns, reference the documentation and test your patterns with an online tool.
The pattern

https://www.fifa.com/womensworldcup/teams/team/(\d+)/
uses the special expression \d+ to match any numbers it might find [\d means digit, + means 1 or more].

Now in a new cell, let's try out this pattern matching to see what we get back. If we use a new cell in the Python Notebook, it will allow us to iterate on our work without needing to wait for the network requests to come back with the HTML we already have. We are hoping to capture just the team ids.

RX_TEAM = r'https://www.fifa.com/womensworldcup/teams/team/(d+)/'
teams = list(set(re.findall(RX_TEAM, response.text)))

Explore the contents of teams list to make sure it is valid. You could try out some of the team ids in the team page URL.

response_teams = []
for team in [teams[0]]:
    time.sleep(0.65) # Wait 0.65 seconds between requests
    response_teams.append(requests.get(TEAM_ENDPOINT.format(team)))

We only ran for the first team id. We want to first flush out the process and ensure our data is correct before needing to spend time waiting for web pages to load. Taking an iterative approach allows us to test at each step of the process and understand what the data will look like. Now, parse each team page to find the player ids. This Regular Expression will look similar to the team one we used.

RX_PLAYER = r'/womensworldcup/players/player/(d+)/'

ALL_PLAYERS = {}
for response_text in response_teams:
    players = {i: {} for i in set(re.findall(RX_PLAYER, response_text))}
    ALL_PLAYERS.update(players)

Look at ALL_PLAYERS to check if we have valid player ids. Here we are setting up a dictionary that will allow us to store information about each player. Dictionaries are just ways to store data that can be looked up with a key. In a list, each item is stored one after another. If we wanted to find a specific player, we will need to look at each item in the list until we find a match. Here, dictionaries allow us to get the player immediately when we know their key value (player id). When looping over a dictionary you are given the key and will be able to access the value.

Hopefully these steps are making sense and you know we will be fetching the player data next. To keep the fetching and parsing separated, we will save each of the responses as attributes on the player. Next we will be able to iterate over each player and parse their values. Later if we realize there is a new value to parse or we want to update our parsing logic, we won't need to fetch the raw data.

PLAYER_ENDPOINT = r'https://www.fifa.com/womensworldcup/players/player/{}/_libraries/_player-statistics'
PLAYER_BIO_ENDPOINT = r'https://www.fifa.com/womensworldcup/players/player/{}/_libraries/_player-profile-data'

for i, player_key in enumerate(ALL_PLAYERS):
    if i % 10 == 0: print(f'{i}/{len(ALL_PLAYERS)}')

    time.sleep(0.65)
    response = requests.get(PLAYER_ENDPOINT.format(player_key))
    ALL_PLAYERS[str(player_key)]['player_stats_raw'] = response.text

    time.sleep(0.65)
    response = requests.get(PLAYER_BIO_ENDPOINT.format(player_key))
    ALL_PLAYERS[str(player_key)]['player_bio_raw'] = response.text

Always break up fetching and parsing data! This way you can continue to evolve your parsing without continuing to request the same static data. Repeated requests for the same resource may cause undue stress on the website and they may have automated means to block your future requests.

for player_key in ALL_PLAYERS:
    print(player_key)
    print('stats', ALL_PLAYERS[player_key]['player_stats_raw'][:300])
    print('bio', ALL_PLAYERS[player_key]['player_bio_raw'][:300])

Take a look at the resulting HTML to try and find common patterns in how the labels and data are presented. Often times the styling of the site requires a standard layout for grids and forms. Here we've already distilled the patterns which will capture the keys (name of the field) and values.

When we get the values, they will need to be cleaned further. Maybe we have different units, some values showing distance and others just numbers. How will we handle birthdays, empty values, and numerical data?

Review the individual values and look for common text to remove. All heights have "cm" suffix? We should replace that so the data appears as a number. Birthdays are a little more complicated, but we have a solution for you. This is more along the lines of a one-time script. It doesn't have a robust classification and reporting system... but it works for this :) Remove "cm", remove duplicate spaces, remove "span" tags, take out parentheses.. an iterative process of reviewing the results and taking action.

This next function is complex and handles all the values that have been seen from fetching and parsing data. It is not robust. What I mean is new data or formats could cause the entire script to no longer work. This is a trade-off we are making to produce a full web scraping application in a single script. If you want to dive deeper, you can print out the keys and values and start looking to rebuild this function yourself. Otherwise, take this function as-is and know that your data application will require a similar function which parses and normalizes the data when it comes in a few different formats.

import datetime
def player_stats_from_raw(stats_raw):
    player_stats = {}
    for stat in stats_raw:
        value, key = stat
        value = value.replace('%', '')
        value = value.replace('cm', '')
        # Remove duplicate spaces using a Regular Expression
        value = re.sub(r'\s\s+', ' ', value.replace('\n', '')).strip()
        value = re.sub(r'', '', value)

        if '.' in value and len(value) < 9:
            value = float(value)
        elif value == '':
            value = 0
        elif value.isdigit():
            value = int(value)
        elif any(char.isdigit() for char in value):
            value = datetime.datetime.strptime(value, '%d %B %Y')
        else:
            value = value.lower()

        key = key.replace(' ', '_').lower()
        key = key.replace('<', '')
        key = re.sub(r'\s\s+', ' ', key.replace('\n', '')).strip()
        key = re.sub(r'[\(\)\/\-\s]', '_', key)
        player_stats[key] = value

    return player_stats

Well we just saw the function which will further parse the keys and values, but now we have to extract those from the giant HTML responses we saved earlier. Each player will have many statistics associated with them. These next patterns are quite a bit more advanced. Be sure to checkout the Python 3 Regular Expression documentation. Here is the highest level overview of the patterns we are using now. (.*?) is a pattern which will match anything! It will keep matching until it runs into the next character in the matching expression. Most of the time we are using this pattern to match everything before the next closing div tag. (name|country|role) allows the matching to work with any of those values you see between the OR (|) token. So it will match "name" or "country" and so on.

RX_PLAYER_STAT = r'<div class="fi-p__profile-number__number">(.*?)</div>(.*?)</div>'
RX_PLAYER_BIO_1 = r'<div class="fi-p__(jerseyNum |name|country|role)">(.*?)</div>'
RX_PLAYER_BIO_2 = r'<div class="fi-p__profile.*?">(.*?)(?:<div class="fi-p__profile|span).*?>(.*?)</(?:div|span)>'

for player_key in ALL_PLAYERS:
    stats_raw = ALL_PLAYERS[player_key]['player_stats_raw']
    player_stats_raw = re.findall(RX_PLAYER_STAT, stats_raw, re.DOTALL)
    ALL_PLAYERS[str(player_key)].update(player_stats_from_raw(player_stats_raw))

    bio_raw = ALL_PLAYERS[player_key]['player_bio_raw']
    player_stats_raw = re.findall(RX_PLAYER_BIO_1, bio_raw, re.DOTALL)
    player_stats_raw = [(b, a) for (a, b) in player_stats_raw]
    ALL_PLAYERS[str(player_key)].update(player_stats_from_raw(player_stats_raw))

    player_stats_raw = re.findall(RX_PLAYER_BIO_2, bio_raw, re.DOTALL)
    player_stats_raw = [(b, a) for (a, b) in player_stats_raw]
    ALL_PLAYERS[str(player_key)].update(player_stats_from_raw(player_stats_raw))

We fetched, parsed, and now check the resulting data that was saved within ALL_PLAYERS. Our last step will be to store the player id and save the data as CSV!

headers = []
for player_key in ALL_PLAYERS:
    ALL_PLAYERS[player_key]['id'] = player_key
    headers += ALL_PLAYERS[player_key].keys()

headers = set(headers)
import csv

with open('world_cup.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=headers)
    writer.writeheader()
    writer.writerows(ALL_PLAYERS.values())

Beyond Python Visual Newsletter

Enjoying the content? We send step-by-step visual Python tutorials to your inbox! Be notified when new content is available by the Beyond Python team.



Women's World Cup dataset

You'll find world_cup.csv in the same location as your notebook. On review, it doesn't look too bad! We have numeric data in all the columns - most columns are filled - names are populated. Now you should spot check the data by referencing back to the web pages to ensure each statistic has made it under the correct label.



Women's World Cup Player Statistics

We made it through an entire Python web scraping application. We used builtin the builtin DevTools to find the underlying links and data locations. We used basic Python libraries to built a process that allowed us to iterate on each step separately. Fetching Webpages - Storing HTML - Parsing Values.. and now analyze the data!

Quick Python Data Visualization

Couldn't resist taking a look at how the data stacks up on a variety of attributes. These plots are the basics offered within the Python Notebook environment. We will explore how to build these plots and much more throughout our blog series.



World Cup Top 4 Teams

Women's World Cup Top 4 Teams



World Cup Statistic Histograms

Women's World Cup Statistic Histograms



World Cup Players by Role

Women's World Cup Players By Role



552 rows, 64 columns
Women's World Cup Player Dataset

import pandas as pd
df = pd.read_csv('http://bit.ly/ByPyWC19')



Questions, Comments, Concerns?

Thanks for reading! If you've made it this far then you are probably interested in the material that we will be producing. We have an idea of what we believe will be most valuable to our readers, but hearing from you directly would be even better.

Send us an email at questions@beyondpython.com or reach out to us on twitter @BeyondPython

If you have a topic that you are struggling with, a file that you can't seem to work with, or even a dataset that just seems impossible to wrangle, then please let us know. We want to provide you with useful and practical information so you can start using Python today.

Beyond Python Visual Newsletter

Enjoying the content? We send step-by-step visual Python tutorials to your inbox! Be notified when new content is available by the Beyond Python team.



Disclosures & Privacy
All Rights Reserved
© 2019 Beyond Python