Scraping a table from Wikipedia using Python
We are going to take a look at how to extract data from a table on a wikipedia page. First, lets take a look at the website we would like to scrape the data from. We will use this page.
Step 1 – Import Modules
We will need to use the following modules to complete our web-scraping:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Step 2 – Save URL into a variable:
Simply copy and paste the URL into a variable, here we name the variable “URL”. This will make it easier to refer to our website in our code.
URL='https://en.wikipedia.org/wiki/List_of_largest_law_firms_by_profits_per_partner'
Step 3 – Create a function to extract the data from wikipediua into a Pandas DataFrame:
def get_table(URL, n=0): #use n to get which table you want (0 is the default and will be the first table)
return pd.DataFrame([[cell.text for cell in row.find_all(["th","td"])]
for row in BeautifulSoup(requests.get(URL).content, 'html.parser').find_all("table")[(n)].find_all("tr")])
Step 4 – Create a function to format the resulting DataFrame
When the data is first extracted, the output will not be formatted correctly, so we need to write a code to automatically format the table:
def Format(df):
df.set_axis([x.rstrip() for x in df.iloc[0]], axis=1, inplace=True)
df.drop(0, inplace=True)
df.reset_index(inplace=True)
df.drop(['index'], axis=1, inplace=True)
return df
The webpage can be extracted by using both of these fucntions together:
Format(get_table(URL))
This code will work to extract any page on wikipedia, for example lets look at this page. If we look at the data section on this page, we see a chart showing economic data for the US. Let’s extract the data from this table:
URL='https://en.wikipedia.org/wiki/Economy_of_the_United_States?msclkid=3d62c767ae4111ecb1c6599eb44d5376#Data'
Format(get_table(URL,3)) #we use three because this is the third table on the wikipedia page.
Go try it yourself!