Skip to content

A quick webscraping tutorial with BeautifulSoup and Pandas 🕷️ 🐼

Notifications You must be signed in to change notification settings

Andy-Pham-72/Web-Scraping-with-BeautifulSoup-and-Pandas

Repository files navigation

Web Scraping with Beautiful Soup and Pandas

Web scraping is the process of using bots to extract content and data from a website.

Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.

Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:

  • Search engine bots crawling a site, analyzing its content and then ranking it.
  • Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
  • Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).

Table of Contents

1. Making Database From Scratch With Beautiful Soup

2. Web Scraping Using Pandas

I. Making Database From Scratch With Beautiful Soup

There are a number of different packages available for web scraping, and one of the most popular is Beautiful Soup. Beautiful Soup parses web content into a Python object and makes the DOM queryable element by element. Used in conjunction with a requests package, it makes web scraping very easy!


Installation of Beautiful Soup (if haven't done so)

In the bash terminal or Anaconda Prompt,run:

conda install beautifulsoup4

# Standard imports
import pandas as pd

# For web scraping
import requests
import urllib.request
from bs4 import BeautifulSoup

# For performing regex operations
import re

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

For this tutorial, we'll be scraping the random functions names and usages from Python documentation from the website docs.python.org.

Scrape The Data

# Save the URL of the webpage we want to scrape to a variable
url = 'https://docs.python.org/3/library/random.html#module-random'

When web scraping, the first step is to pull down the content of the page into a Python (string) variable. For simpler webscraping tasks you can do this with the requests package, which is what we'll use here. For more complex tasks (involving, e.g., webpages with lots of Javascript or other elements that are rendered by the web browser) you may need to use something more advanced, like urllib or Selenium.

# Send a get request and assign the response to a variable
response = requests.get(url)

Let's take a look at what we have!

response
<Response [200]>
response.content

<<<<<<< HEAD b'\n\n\n\n \n \n <title>random \xe2\x80\x94 Generate pseudo-random numbers — Python 3.9.2 documentation</title>\n \n \n \n <script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>\n <script src="../_static/jquery.js"></script>\n <script src="../_static/underscore.js"></script>\n <script src="../_static/doctools.js"></script>\n <script src="../_static/language_data.js"></script>\n \n <script src="../_static/sidebar.js"></script>\n \n <link rel="search" type="application/opensearchdescription+xml"\n title="Search within Python 3.9.2 documentation"\n href="../_static/opensearch.xml"/>\n \n \n \n \n \n \n \n \n \n \n \n\n \n <style>\n @media only screen {\n table.full-width-table {\n width: 100%;\n }\n }\n </style>\n\n \n \n <script

That's a lot to look at! It's also pretty unreadable. This is where Beautiful Soup comes in. What Beautiful Soup does is helps us parse the page content properly, into a form that we can more easily use.

# Turn the undecoded content into a Beautiful Soup object and assign it to a variable
soup = BeautifulSoup(response.content)
type(soup)
bs4.BeautifulSoup

Now let's take a look at this.

# Check soup variable

soup
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<title>random — Generate pseudo-random numbers — Python 3.9.2 documentation</title>
<link href="../_static/pydoctheme.css" rel="stylesheet" type="text/css"/>
<link href="../_static/pygments.css" rel="stylesheet" type="text/css"/>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
# Other way to load the html code using 'urllib.request.urlopen()'

#url = urllib.request.urlopen("https://docs.python.org/3/library/random.html#module-random")
#soup = BeautifulSoup(url)
#soup

Still very long, but a little easier to take in.

The real advantage of Beautiful Soup, however, is that it parses our webpage according to its structure and allows us to search for and extract elements within it. This is because it transforms the webpage from a string into a special Beautiful Soup object.

To extract HTML elements from our webpage, we can call the .find() method on our Beautiful Soup object. This method finds the first element that matches the criterion that we pass in. The criterion may be an element id, class, tag name, or even a function. (For a full list of search elements, see this page.)

But how do we know what element to search for? This is where your browser's Inspect or Inspect Element feature comes in handy. Simply right click on an object of interest on the web page and click Inspect on Chrome or Inspect Element on Firefox. This will then show you the corresponding place in the HTML code where the element appears. From there you should be able to find an id or class name that will allow you to locate the element with Beautiful Soup.

In this case, we want to target the tag/ element dt as below picture:





So it looks like we're looking for a dt element with id='random.___'. We can easily retrieve this with Beautiful Soup's .findAll command.

# Find all function names - we specify the name of the element in this case is 'dt'

names = soup.body.findAll('dt')

print(names)
[<dt id="random.seed">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">seed</code><span class="sig-paren">(</span><em class="sig-param">a=None</em>, <em class="sig-param">version=2</em><span class="sig-paren">)</span><a class="headerlink" href="#random.seed" title="Permalink to this definition">¶</a></dt>, <dt id="random.getstate">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">getstate</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#random.getstate" title="Permalink to this definition">¶</a></dt>, <dt id="random.setstate">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">setstate</code><span class="sig-paren">(</span><em class="sig-param">state</em><span class="sig-paren">)</span><a class="headerlink" href="#random.setstate" title="Permalink to this definition">¶</a></dt>, <dt id="random.randbytes">

There are still some works to do! This is when regex kicks in.

# Find all the information we're looking for with regex
# In this case, it would be every string at starts with id='random.'

function_names = re.findall('id="random.\w+' , str(names)) # '\w+' which means the string should end with the function name

# Let print the results
print(function_names)
['id="random.seed', 'id="random.getstate', 'id="random.setstate', 'id="random.randbytes', 'id="random.randrange', 'id="random.randint', 'id="random.getrandbits', 'id="random.choice', 'id="random.choices', 'id="random.shuffle', 'id="random.sample', 'id="random.random', 'id="random.uniform', 'id="random.triangular', 'id="random.betavariate', 'id="random.expovariate', 'id="random.gammavariate', 'id="random.gauss', 'id="random.lognormvariate', 'id="random.normalvariate', 'id="random.vonmisesvariate', 'id="random.paretovariate', 'id="random.weibullvariate', 'id="random.Random', 'id="random.SystemRandom']

We are almost there! We just need to remove the first few characters from each string.

# Using list comprehension to edit our values:

function_names = [item[4:] for item in function_names]

# Let print the results
print(function_names)
['random.seed', 'random.getstate', 'random.setstate', 'random.randbytes', 'random.randrange', 'random.randint', 'random.getrandbits', 'random.choice', 'random.choices', 'random.shuffle', 'random.sample', 'random.random', 'random.uniform', 'random.triangular', 'random.betavariate', 'random.expovariate', 'random.gammavariate', 'random.gauss', 'random.lognormvariate', 'random.normalvariate', 'random.vonmisesvariate', 'random.paretovariate', 'random.weibullvariate', 'random.Random', 'random.SystemRandom']

Perfect! Now we need to do the same with the function description. We have to target the description details with tag - dd




# Find all the function description

description = soup.body.findAll('dd')

#print(description)

Wow it looks very complicated! There are lots of tags here (<em> tags). These unnecessary elements from the above method would take a long time to get rid of manually.

Luckily, BeautifulSoup is not only beautiful, it's also smart. Let's look at the .text method:

# Create a list

function_usage = []

# Create a loop

for item in description:
    item = item.text      #  Save the extracted text to a variable
    item = item.replace('\n', ' ')     # to get rid of the next line operator which is `\n` 
    function_usage.append(item)
    
#print(function_usage)  # Don't get overwhelmed! they are just all the function description from the above function names
# Let's check the length of the function_names and function_usage

print(f' Length of function_names: {len(function_names)}')
print(f' Length of function_usage: {len(function_usage)}')
 Length of function_names: 25
 Length of function_usage: 25

Make A Database

# Create a dataframe since the length of both variables are equal!

data = pd.DataFrame( {  'function name': function_names, 
                      'function usage' : function_usage  } )

data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


........


  <th>21</th>
  <td>random.paretovariate</td>
  <td>Pareto distribution.  alpha is the shape param...</td>
</tr>
<tr>
  <th>22</th>
  <td>random.weibullvariate</td>
  <td>Weibull distribution.  alpha is the scale para...</td>
</tr>
<tr>
  <th>23</th>
  <td>random.Random</td>
  <td>Class that implements the default pseudo-rando...</td>
# Let make a CSV file from the dataframe

data.to_csv('random_function.csv')

BONUS: if you want to target a specific attributes, for example id="bookeeping-functions" you can use the following code:

# Target specific attributes

#example = soup.body.findAll ('div', attrs = {'id' : 'bookeeping-functions'})
#print(example)    # you can get very specific result with BeautifulSoup

II. Web Scraping Using Pandas

Pandas is very useful! We can easily scrape data using the pandas read_html() function for your data science project.

We will be web scraping NBA player stats data and perform a quick data exploration from the website basketball-reference.com.

Get The URL

First, we want to check out the specific URL that we are going to scrape the data - the NBA Player Stats of season 2019-2020.

# Method 1: only 1 year

# URL of the player stats in 2020

url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
url
'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
# Method 2: multiple years

years = ['2016', '2017', '2018', '2019', '2020']
str = 'https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'

for year in years:
    url = str.format(year)
    print(url)
https://www.basketball-reference.com/leagues/NBA_2016_per_game.html
https://www.basketball-reference.com/leagues/NBA_2017_per_game.html
https://www.basketball-reference.com/leagues/NBA_2018_per_game.html
https://www.basketball-reference.com/leagues/NBA_2019_per_game.html
https://www.basketball-reference.com/leagues/NBA_2020_per_game.html

Read The HTML Webpage Into Pandas

# Let check URL of the player stats in 2020

url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'

# Using pd.read_html()

df = pd.read_html(url, header = 0)

print(df)
[      Rk                    Player Pos Age   Tm   G  GS    MP   FG   FGA  ...  \
0      1              Steven Adams   C  26  OKC  63  63  26.7  4.5   7.6  ...   
1      2               Bam Adebayo  PF  22  MIA  72  72  33.6  6.1  11.0  ...   
2      3         LaMarcus Aldridge   C  34  SAS  53  53  33.1  7.4  15.0  ...   
3      4            Kyle Alexander   C  23  MIA   2   0   6.5  0.5   1.0  ...   
4      5  Nickeil Alexander-Walker  SG  21  NOP  47   1  12.6  2.1   5.7  ...   
..   ...                       ...  ..  ..  ...  ..  ..   ...  ...   ...  ...   
672  525                Trae Young  PG  21  ATL  60  60  35.3  9.1  20.8  ...   
673  526               Cody Zeller   C  27  CHO  58  39  23.1  4.3   8.3  ...   
674  527              Tyler Zeller   C  30  SAS   2   0   2.0  0.5   2.0  ...   
675  528                Ante Žižić   C  23  CLE  22   0  10.0  1.9   3.3  ...   
676  529               Ivica Zubac   C  22  LAC  72  70  18.4  3.3   5.3  ...   

      FT%  ORB  DRB   TRB  AST  STL  BLK  TOV   PF   PTS  
0    .582  3.3  6.0   9.3  2.3  0.8  1.1  1.5  1.9  10.9  
1    .691  2.4  7.8  10.2  5.1  1.1  1.3  2.8  2.5  15.9  
2    .827  1.9  5.5   7.4  2.4  0.7  1.6  1.4  2.4  18.9  
3     NaN  1.0  0.5   1.5  0.0  0.0  0.0  0.5  0.5   1.0  
4    .676  0.2  1.6   1.8  1.9  0.4  0.2  1.1  1.2   5.7  
..    ...  ...  ...   ...  ...  ...  ...  ...  ...   ...  
672  .860  0.5  3.7   4.3  9.3  1.1  0.1  4.8  1.7  29.6  
673  .682  2.8  4.3   7.1  1.5  0.7  0.4  1.3  2.4  11.1  
674   NaN  1.5  0.5   2.0  0.0  0.0  0.0  0.0  0.0   1.0  
675  .737  0.8  2.2   3.0  0.3  0.3  0.2  0.5  1.2   4.4  
676  .747  2.7  4.8   7.5  1.1  0.2  0.9  0.8  2.3   8.3  

[677 rows x 30 columns]]

It looks a little bit messy. What we actually have here is a list of DataFrames. We can beautify this object using Pandas (without any additional libraries!)

# Check number of DataFrames in this list

print(f'number of tables in df: {len(df)}') 

print('================')

# Since there is only 1, pull out the 0th element:
df[0].head(20)
number of tables in df: 1
================
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Steven Adams C 26 OKC 63 63 26.7 4.5 7.6 ... .582 3.3 6.0 9.3 2.3 0.8 1.1 1.5 1.9 10.9
1 2 Bam Adebayo PF 22 MIA 72 72 33.6 6.1 11.0 ... .691 2.4 7.8 10.2 5.1 1.1 1.3 2.8 2.5 15.9
2 3 LaMarcus Aldridge C 34 SAS 53 53 33.1 7.4 15.0 ... .827 1.9 5.5 7.4 2.4 0.7 1.6 1.4 2.4 18.9
3 4 Kyle Alexander C 23 MIA 2 0 6.5 0.5 1.0 ... NaN 1.0 0.5 1.5 0.0 0.0 0.0 0.5 0.5 1.0
4 5 Nickeil Alexander-Walker SG 21 NOP 47 1 12.6 2.1 5.7 ... .676 0.2 1.6 1.8 1.9 0.4 0.2 1.1 1.2 5.7
5 6 Grayson Allen SG 24 MEM 38 0 18.9 3.1 6.6 ... .867 0.2 2.0 2.2 1.4 0.3 0.1 0.9 1.4 8.7
6 7 Jarrett Allen C 21 BRK 70 64 26.5 4.3 6.6 ... .633 3.1 6.5 9.6 1.6 0.6 1.3 1.1 2.3 11.1
7 8 Kadeem Allen PG 27 NYK 10 0 11.7 1.9 4.4 ... .636 0.2 0.7 0.9 2.1 0.5 0.2 0.8 0.7 5.0
8 9 Al-Farouq Aminu PF 29 ORL 18 2 21.1 1.4 4.8 ... .655 1.3 3.5 4.8 1.2 1.0 0.4 0.9 1.5 4.3
9 10 Justin Anderson SG 26 BRK 10 1 10.7 1.0 3.8 ... .500 0.1 2.0 2.1 0.8 0.0 0.6 0.4 1.3 2.8
10 11 Kyle Anderson SF 26 MEM 67 28 19.9 2.3 4.9 ... .667 0.9 3.4 4.3 2.4 0.8 0.6 1.0 1.7 5.8
11 12 Ryan Anderson C 31 HOU 2 0 7.0 1.0 3.5 ... NaN 0.0 3.5 3.5 1.0 0.5 0.0 0.5 0.5 2.5
12 13 Giannis Antetokounmpo PF 25 MIL 63 63 30.4 10.9 19.7 ... .633 2.2 11.4 13.6 5.6 1.0 1.0 3.7 3.1 29.5
13 14 Kostas Antetokounmpo PF 22 LAL 5 0 4.0 0.6 0.6 ... .500 0.4 0.2 0.6 0.4 0.0 0.0 0.2 0.4 1.4
14 15 Thanasis Antetokounmpo SF 27 MIL 20 2 6.5 1.2 2.4 ... .412 0.6 0.6 1.2 0.8 0.4 0.1 0.6 0.9 2.8
15 16 Carmelo Anthony PF 35 POR 58 58 32.8 5.8 13.5 ... .845 1.2 5.1 6.3 1.5 0.8 0.5 1.7 2.9 15.4
16 17 OG Anunoby SF 22 TOR 69 68 29.9 4.1 8.2 ... .706 1.2 4.1 5.3 1.6 1.4 0.7 1.1 2.4 10.6
17 18 Ryan Arcidiacono PG 25 CHI 58 4 16.0 1.6 3.8 ... .711 0.3 1.6 1.9 1.7 0.5 0.1 0.6 1.7 4.5
18 19 Trevor Ariza SF 34 TOT 53 21 28.2 2.7 6.1 ... .838 0.6 4.0 4.6 1.7 1.3 0.3 1.1 2.1 8.0
19 19 Trevor Ariza SF 34 SAC 32 0 24.7 2.0 5.2 ... .778 0.7 3.9 4.6 1.6 1.1 0.2 0.9 2.0 6.0
20 19 Trevor Ariza SF 34 POR 21 21 33.4 3.7 7.6 ... .872 0.6 4.1 4.8 2.0 1.6 0.4 1.3 2.3 11.0
21 20 D.J. Augustin PG 32 ORL 57 13 24.9 3.2 8.1 ... .890 0.4 1.8 2.1 4.6 0.6 0.0 1.5 1.3 10.5
22 Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
23 21 Deandre Ayton C 21 PHO 38 32 32.5 8.2 14.9 ... .753 3.9 7.6 11.5 1.9 0.7 1.5 2.1 3.1 18.2
24 22 Dwayne Bacon SG 24 CHO 39 11 17.6 2.2 6.3 ... .660 0.4 2.2 2.6 1.3 0.6 0.1 0.9 1.3 5.7

25 rows × 30 columns

Wow! You'll notice that there are some missing values (NaN) and multiple occurences of some player names because they have been a part of different teams in the same year.

Data Cleaning

We can see on the website that the header repeats itself in every 20 players. We'll have to remove the subsequent headers and keep only the first header:





# Assigns the table in a variable df_2020

df_2020 = df[0]

# Let check the table header which is presented multiple times in several rows

df_2020[df_2020.Age == 'Age'].head() #  All the subsequent table header selected for this entire dataframe!
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
22 Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
53 Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
76 Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
101 Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
130 Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS

5 rows × 30 columns

# Checking the length for how many header we have:

print(f' total numbers of redundant headers: {len(df_2020[df_2020.Age == "Age"])} ')

# Drop the redundant headers in the dataframe:
df_2020_new = df_2020.drop(df_2020[df_2020.Age == 'Age'].index)

# Compare before and after dropping redundant headers with numbers of rows:

print(f' total rows of df_2020:     {df_2020.shape[0]} ')
print(f' total rows of df_2020_new: {df_2020_new.shape[0]} ')
print('===========================================')

df_2020_new.head(20)
 total numbers of redundant headers: 26 
 total rows of df_2020:     677 
 total rows of df_2020_new: 651 
===========================================
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Steven Adams C 26 OKC 63 63 26.7 4.5 7.6 ... .582 3.3 6.0 9.3 2.3 0.8 1.1 1.5 1.9 10.9
1 2 Bam Adebayo PF 22 MIA 72 72 33.6 6.1 11.0 ... .691 2.4 7.8 10.2 5.1 1.1 1.3 2.8 2.5 15.9
2 3 LaMarcus Aldridge C 34 SAS 53 53 33.1 7.4 15.0 ... .827 1.9 5.5 7.4 2.4 0.7 1.6 1.4 2.4 18.9
3 4 Kyle Alexander C 23 MIA 2 0 6.5 0.5 1.0 ... NaN 1.0 0.5 1.5 0.0 0.0 0.0 0.5 0.5 1.0
4 5 Nickeil Alexander-Walker SG 21 NOP 47 1 12.6 2.1 5.7 ... .676 0.2 1.6 1.8 1.9 0.4 0.2 1.1 1.2 5.7
5 6 Grayson Allen SG 24 MEM 38 0 18.9 3.1 6.6 ... .867 0.2 2.0 2.2 1.4 0.3 0.1 0.9 1.4 8.7
6 7 Jarrett Allen C 21 BRK 70 64 26.5 4.3 6.6 ... .633 3.1 6.5 9.6 1.6 0.6 1.3 1.1 2.3 11.1
7 8 Kadeem Allen PG 27 NYK 10 0 11.7 1.9 4.4 ... .636 0.2 0.7 0.9 2.1 0.5 0.2 0.8 0.7 5.0
8 9 Al-Farouq Aminu PF 29 ORL 18 2 21.1 1.4 4.8 ... .655 1.3 3.5 4.8 1.2 1.0 0.4 0.9 1.5 4.3
9 10 Justin Anderson SG 26 BRK 10 1 10.7 1.0 3.8 ... .500 0.1 2.0 2.1 0.8 0.0 0.6 0.4 1.3 2.8
10 11 Kyle Anderson SF 26 MEM 67 28 19.9 2.3 4.9 ... .667 0.9 3.4 4.3 2.4 0.8 0.6 1.0 1.7 5.8
11 12 Ryan Anderson C 31 HOU 2 0 7.0 1.0 3.5 ... NaN 0.0 3.5 3.5 1.0 0.5 0.0 0.5 0.5 2.5
12 13 Giannis Antetokounmpo PF 25 MIL 63 63 30.4 10.9 19.7 ... .633 2.2 11.4 13.6 5.6 1.0 1.0 3.7 3.1 29.5
13 14 Kostas Antetokounmpo PF 22 LAL 5 0 4.0 0.6 0.6 ... .500 0.4 0.2 0.6 0.4 0.0 0.0 0.2 0.4 1.4
14 15 Thanasis Antetokounmpo SF 27 MIL 20 2 6.5 1.2 2.4 ... .412 0.6 0.6 1.2 0.8 0.4 0.1 0.6 0.9 2.8
15 16 Carmelo Anthony PF 35 POR 58 58 32.8 5.8 13.5 ... .845 1.2 5.1 6.3 1.5 0.8 0.5 1.7 2.9 15.4
16 17 OG Anunoby SF 22 TOR 69 68 29.9 4.1 8.2 ... .706 1.2 4.1 5.3 1.6 1.4 0.7 1.1 2.4 10.6
17 18 Ryan Arcidiacono PG 25 CHI 58 4 16.0 1.6 3.8 ... .711 0.3 1.6 1.9 1.7 0.5 0.1 0.6 1.7 4.5
18 19 Trevor Ariza SF 34 TOT 53 21 28.2 2.7 6.1 ... .838 0.6 4.0 4.6 1.7 1.3 0.3 1.1 2.1 8.0
19 19 Trevor Ariza SF 34 SAC 32 0 24.7 2.0 5.2 ... .778 0.7 3.9 4.6 1.6 1.1 0.2 0.9 2.0 6.0

20 rows × 30 columns

Quick Exploratory Data Analysis

# Making a simple histogram

plt.figure(figsize=(10,8))

sns.distplot(df_2020_new.PTS,    # Checking frequency of the player points
            kde= False,          # Should be False because we want to retain the original frequency ( "kde=True" => it will be the probability)
            hist_kws = dict( edgecolor = 'black', linewidth=2))  

plt.title('HISTOGRAM OF PLAYER POINTS PER GAME IN THE 2020 NBA SEASON')
plt.ylabel('NUMBERS OF PLAYERS')
plt.xlabel('POINTS PER GAME')
plt.show()

image

From the histogram, we can see:

  • There are about 57 players having between 0 and 1 point.
  • There are less than 10 players who had 30 points.

About

A quick webscraping tutorial with BeautifulSoup and Pandas 🕷️ 🐼

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published