Web Scraping with Beautiful Soup and Pandas

Web scraping is the process of using bots to extract content and data from a website.

Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.

Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:

Search engine bots crawling a site, analyzing its content and then ranking it.
Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).

I. Making Database From Scratch With Beautiful Soup

There are a number of different packages available for web scraping, and one of the most popular is Beautiful Soup. Beautiful Soup parses web content into a Python object and makes the DOM queryable element by element. Used in conjunction with a requests package, it makes web scraping very easy!

Installation of Beautiful Soup (if haven't done so)

In the bash terminal or Anaconda Prompt,run:

conda install beautifulsoup4

# Standard imports
import pandas as pd

# For web scraping
import requests
import urllib.request
from bs4 import BeautifulSoup

# For performing regex operations
import re

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

For this tutorial, we'll be scraping the random functions names and usages from Python documentation from the website docs.python.org.

Scrape The Data

# Save the URL of the webpage we want to scrape to a variable
url = 'https://docs.python.org/3/library/random.html#module-random'

When web scraping, the first step is to pull down the content of the page into a Python (string) variable. For simpler webscraping tasks you can do this with the requests package, which is what we'll use here. For more complex tasks (involving, e.g., webpages with lots of Javascript or other elements that are rendered by the web browser) you may need to use something more advanced, like urllib or Selenium.

# Send a get request and assign the response to a variable
response = requests.get(url)

Let's take a look at what we have!

response

<Response [200]>

response.content

<<<<<<< HEAD b'\n\n\n\n \n \n <title>random \xe2\x80\x94 Generate pseudo-random numbers — Python 3.9.2 documentation</title>\n \n \n \n <script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>\n <script src="../_static/jquery.js"></script>\n <script src="../_static/underscore.js"></script>\n <script src="../_static/doctools.js"></script>\n <script src="../_static/language_data.js"></script>\n \n <script src="../_static/sidebar.js"></script>\n \n <link rel="search" type="application/opensearchdescription+xml"\n title="Search within Python 3.9.2 documentation"\n href="../_static/opensearch.xml"/>\n \n \n \n \n \n \n \n \n \n \n \n\n \n <style>\n @media only screen {\n table.full-width-table {\n width: 100%;\n }\n }\n </style>\n\n \n \n <script

That's a lot to look at! It's also pretty unreadable. This is where Beautiful Soup comes in. What Beautiful Soup does is helps us parse the page content properly, into a form that we can more easily use.

# Turn the undecoded content into a Beautiful Soup object and assign it to a variable
soup = BeautifulSoup(response.content)
type(soup)

bs4.BeautifulSoup

Now let's take a look at this.

# Check soup variable

soup

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<title>random — Generate pseudo-random numbers — Python 3.9.2 documentation</title>
<link href="../_static/pydoctheme.css" rel="stylesheet" type="text/css"/>
<link href="../_static/pygments.css" rel="stylesheet" type="text/css"/>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>

# Other way to load the html code using 'urllib.request.urlopen()'

#url = urllib.request.urlopen("https://docs.python.org/3/library/random.html#module-random")
#soup = BeautifulSoup(url)
#soup

Still very long, but a little easier to take in.

The real advantage of Beautiful Soup, however, is that it parses our webpage according to its structure and allows us to search for and extract elements within it. This is because it transforms the webpage from a string into a special Beautiful Soup object.

To extract HTML elements from our webpage, we can call the .find() method on our Beautiful Soup object. This method finds the first element that matches the criterion that we pass in. The criterion may be an element id, class, tag name, or even a function. (For a full list of search elements, see this page.)

But how do we know what element to search for? This is where your browser's Inspect or Inspect Element feature comes in handy. Simply right click on an object of interest on the web page and click Inspect on Chrome or Inspect Element on Firefox. This will then show you the corresponding place in the HTML code where the element appears. From there you should be able to find an id or class name that will allow you to locate the element with Beautiful Soup.

In this case, we want to target the tag/ element dt as below picture:

So it looks like we're looking for a dt element with id='random.___'. We can easily retrieve this with Beautiful Soup's .findAll command.

# Find all function names - we specify the name of the element in this case is 'dt'

names = soup.body.findAll('dt')

print(names)

[<dt id="random.seed">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">seed</code><span class="sig-paren">(</span><em class="sig-param">a=None</em>, <em class="sig-param">version=2</em><span class="sig-paren">)</span><a class="headerlink" href="#random.seed" title="Permalink to this definition">¶</a></dt>, <dt id="random.getstate">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">getstate</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#random.getstate" title="Permalink to this definition">¶</a></dt>, <dt id="random.setstate">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">setstate</code><span class="sig-paren">(</span><em class="sig-param">state</em><span class="sig-paren">)</span><a class="headerlink" href="#random.setstate" title="Permalink to this definition">¶</a></dt>, <dt id="random.randbytes">

There are still some works to do! This is when regex kicks in.

# Find all the information we're looking for with regex
# In this case, it would be every string at starts with id='random.'

function_names = re.findall('id="random.\w+' , str(names)) # '\w+' which means the string should end with the function name

# Let print the results
print(function_names)

['id="random.seed', 'id="random.getstate', 'id="random.setstate', 'id="random.randbytes', 'id="random.randrange', 'id="random.randint', 'id="random.getrandbits', 'id="random.choice', 'id="random.choices', 'id="random.shuffle', 'id="random.sample', 'id="random.random', 'id="random.uniform', 'id="random.triangular', 'id="random.betavariate', 'id="random.expovariate', 'id="random.gammavariate', 'id="random.gauss', 'id="random.lognormvariate', 'id="random.normalvariate', 'id="random.vonmisesvariate', 'id="random.paretovariate', 'id="random.weibullvariate', 'id="random.Random', 'id="random.SystemRandom']

We are almost there! We just need to remove the first few characters from each string.

# Using list comprehension to edit our values:

function_names = [item[4:] for item in function_names]

# Let print the results
print(function_names)

['random.seed', 'random.getstate', 'random.setstate', 'random.randbytes', 'random.randrange', 'random.randint', 'random.getrandbits', 'random.choice', 'random.choices', 'random.shuffle', 'random.sample', 'random.random', 'random.uniform', 'random.triangular', 'random.betavariate', 'random.expovariate', 'random.gammavariate', 'random.gauss', 'random.lognormvariate', 'random.normalvariate', 'random.vonmisesvariate', 'random.paretovariate', 'random.weibullvariate', 'random.Random', 'random.SystemRandom']

Perfect! Now we need to do the same with the function description. We have to target the description details with tag - dd

# Find all the function description

description = soup.body.findAll('dd')

#print(description)

Wow it looks very complicated! There are lots of tags here (<em> tags). These unnecessary elements from the above method would take a long time to get rid of manually.

Luckily, BeautifulSoup is not only beautiful, it's also smart. Let's look at the .text method:

# Create a list

function_usage = []

# Create a loop

for item in description:
    item = item.text      #  Save the extracted text to a variable
    item = item.replace('\n', ' ')     # to get rid of the next line operator which is `\n` 
    function_usage.append(item)
    
#print(function_usage)  # Don't get overwhelmed! they are just all the function description from the above function names

# Let's check the length of the function_names and function_usage

print(f' Length of function_names: {len(function_names)}')
print(f' Length of function_usage: {len(function_usage)}')

 Length of function_names: 25
 Length of function_usage: 25

Make A Database

# Create a dataframe since the length of both variables are equal!

data = pd.DataFrame( {  'function name': function_names, 
                      'function usage' : function_usage  } )

data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


........


  <th>21</th>
  <td>random.paretovariate</td>
  <td>Pareto distribution.  alpha is the shape param...</td>
</tr>
<tr>
  <th>22</th>
  <td>random.weibullvariate</td>
  <td>Weibull distribution.  alpha is the scale para...</td>
</tr>
<tr>
  <th>23</th>
  <td>random.Random</td>
  <td>Class that implements the default pseudo-rando...</td>

# Let make a CSV file from the dataframe

data.to_csv('random_function.csv')

BONUS: if you want to target a specific attributes, for example id="bookeeping-functions" you can use the following code:

# Target specific attributes

#example = soup.body.findAll ('div', attrs = {'id' : 'bookeeping-functions'})
#print(example)    # you can get very specific result with BeautifulSoup

II. Web Scraping Using Pandas

Pandas is very useful! We can easily scrape data using the pandas read_html() function for your data science project.

We will be web scraping NBA player stats data and perform a quick data exploration from the website basketball-reference.com.

Get The URL

First, we want to check out the specific URL that we are going to scrape the data - the NBA Player Stats of season 2019-2020.

# Method 1: only 1 year

# URL of the player stats in 2020

url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
url

'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'

# Method 2: multiple years

years = ['2016', '2017', '2018', '2019', '2020']
str = 'https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'

for year in years:
    url = str.format(year)
    print(url)

https://www.basketball-reference.com/leagues/NBA_2016_per_game.html
https://www.basketball-reference.com/leagues/NBA_2017_per_game.html
https://www.basketball-reference.com/leagues/NBA_2018_per_game.html
https://www.basketball-reference.com/leagues/NBA_2019_per_game.html
https://www.basketball-reference.com/leagues/NBA_2020_per_game.html

Read The HTML Webpage Into Pandas

# Let check URL of the player stats in 2020

url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'

# Using pd.read_html()

df = pd.read_html(url, header = 0)

print(df)

[      Rk                    Player Pos Age   Tm   G  GS    MP   FG   FGA  ...  \
0      1              Steven Adams   C  26  OKC  63  63  26.7  4.5   7.6  ...   
1      2               Bam Adebayo  PF  22  MIA  72  72  33.6  6.1  11.0  ...   
2      3         LaMarcus Aldridge   C  34  SAS  53  53  33.1  7.4  15.0  ...   
3      4            Kyle Alexander   C  23  MIA   2   0   6.5  0.5   1.0  ...   
4      5  Nickeil Alexander-Walker  SG  21  NOP  47   1  12.6  2.1   5.7  ...   
..   ...                       ...  ..  ..  ...  ..  ..   ...  ...   ...  ...   
672  525                Trae Young  PG  21  ATL  60  60  35.3  9.1  20.8  ...   
673  526               Cody Zeller   C  27  CHO  58  39  23.1  4.3   8.3  ...   
674  527              Tyler Zeller   C  30  SAS   2   0   2.0  0.5   2.0  ...   
675  528                Ante Žižić   C  23  CLE  22   0  10.0  1.9   3.3  ...   
676  529               Ivica Zubac   C  22  LAC  72  70  18.4  3.3   5.3  ...   

      FT%  ORB  DRB   TRB  AST  STL  BLK  TOV   PF   PTS  
0    .582  3.3  6.0   9.3  2.3  0.8  1.1  1.5  1.9  10.9  
1    .691  2.4  7.8  10.2  5.1  1.1  1.3  2.8  2.5  15.9  
2    .827  1.9  5.5   7.4  2.4  0.7  1.6  1.4  2.4  18.9  
3     NaN  1.0  0.5   1.5  0.0  0.0  0.0  0.5  0.5   1.0  
4    .676  0.2  1.6   1.8  1.9  0.4  0.2  1.1  1.2   5.7  
..    ...  ...  ...   ...  ...  ...  ...  ...  ...   ...  
672  .860  0.5  3.7   4.3  9.3  1.1  0.1  4.8  1.7  29.6  
673  .682  2.8  4.3   7.1  1.5  0.7  0.4  1.3  2.4  11.1  
674   NaN  1.5  0.5   2.0  0.0  0.0  0.0  0.0  0.0   1.0  
675  .737  0.8  2.2   3.0  0.3  0.3  0.2  0.5  1.2   4.4  
676  .747  2.7  4.8   7.5  1.1  0.2  0.9  0.8  2.3   8.3  

[677 rows x 30 columns]]

It looks a little bit messy. What we actually have here is a list of DataFrames. We can beautify this object using Pandas (without any additional libraries!)

# Check number of DataFrames in this list

print(f'number of tables in df: {len(df)}') 

print('================')

# Since there is only 1, pull out the 0th element:
df[0].head(20)

number of tables in df: 1
================

	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
0	1	Steven Adams	C	26	OKC	63	63	26.7	4.5	7.6	...	.582	3.3	6.0	9.3	2.3	0.8	1.1	1.5	1.9	10.9
1	2	Bam Adebayo	PF	22	MIA	72	72	33.6	6.1	11.0	...	.691	2.4	7.8	10.2	5.1	1.1	1.3	2.8	2.5	15.9
2	3	LaMarcus Aldridge	C	34	SAS	53	53	33.1	7.4	15.0	...	.827	1.9	5.5	7.4	2.4	0.7	1.6	1.4	2.4	18.9
3	4	Kyle Alexander	C	23	MIA	2	0	6.5	0.5	1.0	...	NaN	1.0	0.5	1.5	0.0	0.0	0.0	0.5	0.5	1.0
4	5	Nickeil Alexander-Walker	SG	21	NOP	47	1	12.6	2.1	5.7	...	.676	0.2	1.6	1.8	1.9	0.4	0.2	1.1	1.2	5.7
5	6	Grayson Allen	SG	24	MEM	38	0	18.9	3.1	6.6	...	.867	0.2	2.0	2.2	1.4	0.3	0.1	0.9	1.4	8.7
6	7	Jarrett Allen	C	21	BRK	70	64	26.5	4.3	6.6	...	.633	3.1	6.5	9.6	1.6	0.6	1.3	1.1	2.3	11.1
7	8	Kadeem Allen	PG	27	NYK	10	0	11.7	1.9	4.4	...	.636	0.2	0.7	0.9	2.1	0.5	0.2	0.8	0.7	5.0
8	9	Al-Farouq Aminu	PF	29	ORL	18	2	21.1	1.4	4.8	...	.655	1.3	3.5	4.8	1.2	1.0	0.4	0.9	1.5	4.3
9	10	Justin Anderson	SG	26	BRK	10	1	10.7	1.0	3.8	...	.500	0.1	2.0	2.1	0.8	0.0	0.6	0.4	1.3	2.8
10	11	Kyle Anderson	SF	26	MEM	67	28	19.9	2.3	4.9	...	.667	0.9	3.4	4.3	2.4	0.8	0.6	1.0	1.7	5.8
11	12	Ryan Anderson	C	31	HOU	2	0	7.0	1.0	3.5	...	NaN	0.0	3.5	3.5	1.0	0.5	0.0	0.5	0.5	2.5
12	13	Giannis Antetokounmpo	PF	25	MIL	63	63	30.4	10.9	19.7	...	.633	2.2	11.4	13.6	5.6	1.0	1.0	3.7	3.1	29.5
13	14	Kostas Antetokounmpo	PF	22	LAL	5	0	4.0	0.6	0.6	...	.500	0.4	0.2	0.6	0.4	0.0	0.0	0.2	0.4	1.4
14	15	Thanasis Antetokounmpo	SF	27	MIL	20	2	6.5	1.2	2.4	...	.412	0.6	0.6	1.2	0.8	0.4	0.1	0.6	0.9	2.8
15	16	Carmelo Anthony	PF	35	POR	58	58	32.8	5.8	13.5	...	.845	1.2	5.1	6.3	1.5	0.8	0.5	1.7	2.9	15.4
16	17	OG Anunoby	SF	22	TOR	69	68	29.9	4.1	8.2	...	.706	1.2	4.1	5.3	1.6	1.4	0.7	1.1	2.4	10.6
17	18	Ryan Arcidiacono	PG	25	CHI	58	4	16.0	1.6	3.8	...	.711	0.3	1.6	1.9	1.7	0.5	0.1	0.6	1.7	4.5
18	19	Trevor Ariza	SF	34	TOT	53	21	28.2	2.7	6.1	...	.838	0.6	4.0	4.6	1.7	1.3	0.3	1.1	2.1	8.0
19	19	Trevor Ariza	SF	34	SAC	32	0	24.7	2.0	5.2	...	.778	0.7	3.9	4.6	1.6	1.1	0.2	0.9	2.0	6.0
20	19	Trevor Ariza	SF	34	POR	21	21	33.4	3.7	7.6	...	.872	0.6	4.1	4.8	2.0	1.6	0.4	1.3	2.3	11.0
21	20	D.J. Augustin	PG	32	ORL	57	13	24.9	3.2	8.1	...	.890	0.4	1.8	2.1	4.6	0.6	0.0	1.5	1.3	10.5
22	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
23	21	Deandre Ayton	C	21	PHO	38	32	32.5	8.2	14.9	...	.753	3.9	7.6	11.5	1.9	0.7	1.5	2.1	3.1	18.2
24	22	Dwayne Bacon	SG	24	CHO	39	11	17.6	2.2	6.3	...	.660	0.4	2.2	2.6	1.3	0.6	0.1	0.9	1.3	5.7

25 rows × 30 columns

Wow! You'll notice that there are some missing values (NaN) and multiple occurences of some player names because they have been a part of different teams in the same year.

Data Cleaning

We can see on the website that the header repeats itself in every 20 players. We'll have to remove the subsequent headers and keep only the first header:

# Assigns the table in a variable df_2020

df_2020 = df[0]

# Let check the table header which is presented multiple times in several rows

df_2020[df_2020.Age == 'Age'].head() #  All the subsequent table header selected for this entire dataframe!

	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
22	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
53	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
76	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
101	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
130	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS

5 rows × 30 columns

# Checking the length for how many header we have:

print(f' total numbers of redundant headers: {len(df_2020[df_2020.Age == "Age"])} ')

# Drop the redundant headers in the dataframe:
df_2020_new = df_2020.drop(df_2020[df_2020.Age == 'Age'].index)

# Compare before and after dropping redundant headers with numbers of rows:

print(f' total rows of df_2020:     {df_2020.shape[0]} ')
print(f' total rows of df_2020_new: {df_2020_new.shape[0]} ')
print('===========================================')

df_2020_new.head(20)

 total numbers of redundant headers: 26 
 total rows of df_2020:     677 
 total rows of df_2020_new: 651 
===========================================

	Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	...	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
0	1	Steven Adams	C	26	OKC	63	63	26.7	4.5	7.6	...	.582	3.3	6.0	9.3	2.3	0.8	1.1	1.5	1.9	10.9
1	2	Bam Adebayo	PF	22	MIA	72	72	33.6	6.1	11.0	...	.691	2.4	7.8	10.2	5.1	1.1	1.3	2.8	2.5	15.9
2	3	LaMarcus Aldridge	C	34	SAS	53	53	33.1	7.4	15.0	...	.827	1.9	5.5	7.4	2.4	0.7	1.6	1.4	2.4	18.9
3	4	Kyle Alexander	C	23	MIA	2	0	6.5	0.5	1.0	...	NaN	1.0	0.5	1.5	0.0	0.0	0.0	0.5	0.5	1.0
4	5	Nickeil Alexander-Walker	SG	21	NOP	47	1	12.6	2.1	5.7	...	.676	0.2	1.6	1.8	1.9	0.4	0.2	1.1	1.2	5.7
5	6	Grayson Allen	SG	24	MEM	38	0	18.9	3.1	6.6	...	.867	0.2	2.0	2.2	1.4	0.3	0.1	0.9	1.4	8.7
6	7	Jarrett Allen	C	21	BRK	70	64	26.5	4.3	6.6	...	.633	3.1	6.5	9.6	1.6	0.6	1.3	1.1	2.3	11.1
7	8	Kadeem Allen	PG	27	NYK	10	0	11.7	1.9	4.4	...	.636	0.2	0.7	0.9	2.1	0.5	0.2	0.8	0.7	5.0
8	9	Al-Farouq Aminu	PF	29	ORL	18	2	21.1	1.4	4.8	...	.655	1.3	3.5	4.8	1.2	1.0	0.4	0.9	1.5	4.3
9	10	Justin Anderson	SG	26	BRK	10	1	10.7	1.0	3.8	...	.500	0.1	2.0	2.1	0.8	0.0	0.6	0.4	1.3	2.8
10	11	Kyle Anderson	SF	26	MEM	67	28	19.9	2.3	4.9	...	.667	0.9	3.4	4.3	2.4	0.8	0.6	1.0	1.7	5.8
11	12	Ryan Anderson	C	31	HOU	2	0	7.0	1.0	3.5	...	NaN	0.0	3.5	3.5	1.0	0.5	0.0	0.5	0.5	2.5
12	13	Giannis Antetokounmpo	PF	25	MIL	63	63	30.4	10.9	19.7	...	.633	2.2	11.4	13.6	5.6	1.0	1.0	3.7	3.1	29.5
13	14	Kostas Antetokounmpo	PF	22	LAL	5	0	4.0	0.6	0.6	...	.500	0.4	0.2	0.6	0.4	0.0	0.0	0.2	0.4	1.4
14	15	Thanasis Antetokounmpo	SF	27	MIL	20	2	6.5	1.2	2.4	...	.412	0.6	0.6	1.2	0.8	0.4	0.1	0.6	0.9	2.8
15	16	Carmelo Anthony	PF	35	POR	58	58	32.8	5.8	13.5	...	.845	1.2	5.1	6.3	1.5	0.8	0.5	1.7	2.9	15.4
16	17	OG Anunoby	SF	22	TOR	69	68	29.9	4.1	8.2	...	.706	1.2	4.1	5.3	1.6	1.4	0.7	1.1	2.4	10.6
17	18	Ryan Arcidiacono	PG	25	CHI	58	4	16.0	1.6	3.8	...	.711	0.3	1.6	1.9	1.7	0.5	0.1	0.6	1.7	4.5
18	19	Trevor Ariza	SF	34	TOT	53	21	28.2	2.7	6.1	...	.838	0.6	4.0	4.6	1.7	1.3	0.3	1.1	2.1	8.0
19	19	Trevor Ariza	SF	34	SAC	32	0	24.7	2.0	5.2	...	.778	0.7	3.9	4.6	1.6	1.1	0.2	0.9	2.0	6.0

20 rows × 30 columns

Quick Exploratory Data Analysis

# Making a simple histogram

plt.figure(figsize=(10,8))

sns.distplot(df_2020_new.PTS,    # Checking frequency of the player points
            kde= False,          # Should be False because we want to retain the original frequency ( "kde=True" => it will be the probability)
            hist_kws = dict( edgecolor = 'black', linewidth=2))  

plt.title('HISTOGRAM OF PLAYER POINTS PER GAME IN THE 2020 NBA SEASON')
plt.ylabel('NUMBERS OF PLAYERS')
plt.xlabel('POINTS PER GAME')
plt.show()

From the histogram, we can see:

There are about 57 players having between 0 and 1 point.
There are less than 10 players who had 30 points.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
Web_scraping_with_beautiful_soup_and_pandas_complete.ipynb		Web_scraping_with_beautiful_soup_and_pandas_complete.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping with Beautiful Soup and Pandas

Table of Contents

I. Making Database From Scratch With Beautiful Soup

Installation of Beautiful Soup (if haven't done so)

Scrape The Data

Make A Database

II. Web Scraping Using Pandas

Get The URL

Read The HTML Webpage Into Pandas

Data Cleaning

Quick Exploratory Data Analysis

About

Releases

Packages

Languages

Andy-Pham-72/Web-Scraping-with-BeautifulSoup-and-Pandas

Folders and files

Latest commit

History

Repository files navigation

Web Scraping with Beautiful Soup and Pandas

Table of Contents

I. Making Database From Scratch With Beautiful Soup

Installation of Beautiful Soup (if haven't done so)

Scrape The Data

Make A Database

II. Web Scraping Using Pandas

Get The URL

Read The HTML Webpage Into Pandas

Data Cleaning

Quick Exploratory Data Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages