Web scraping is the process of using bots to extract content and data from a website.
Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.
Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:
- Search engine bots crawling a site, analyzing its content and then ranking it.
- Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
- Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).
1. Making Database From Scratch With Beautiful Soup
There are a number of different packages available for web scraping, and one of the most popular is Beautiful Soup. Beautiful Soup parses web content into a Python object and makes the DOM queryable element by element. Used in conjunction with a requests package, it makes web scraping very easy!
In the bash
terminal or Anaconda Prompt
,run:
conda install beautifulsoup4
# Standard imports
import pandas as pd
# For web scraping
import requests
import urllib.request
from bs4 import BeautifulSoup
# For performing regex operations
import re
# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
For this tutorial, we'll be scraping the random functions names and usages from Python documentation from the website docs.python.org.
# Save the URL of the webpage we want to scrape to a variable
url = 'https://docs.python.org/3/library/random.html#module-random'
When web scraping, the first step is to pull down the content of the page into a Python (string) variable. For simpler webscraping tasks you can do this with the requests
package, which is what we'll use here. For more complex tasks (involving, e.g., webpages with lots of Javascript or other elements that are rendered by the web browser) you may need to use something more advanced, like urllib
or Selenium.
# Send a get request and assign the response to a variable
response = requests.get(url)
Let's take a look at what we have!
response
<Response [200]>
response.content
<<<<<<< HEAD b'\n\n\n\n \n \n <title>random \xe2\x80\x94 Generate pseudo-random numbers — Python 3.9.2 documentation</title>\n \n \n \n <script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>\n <script src="../_static/jquery.js"></script>\n <script src="../_static/underscore.js"></script>\n <script src="../_static/doctools.js"></script>\n <script src="../_static/language_data.js"></script>\n \n <script src="../_static/sidebar.js"></script>\n \n <link rel="search" type="application/opensearchdescription+xml"\n title="Search within Python 3.9.2 documentation"\n href="../_static/opensearch.xml"/>\n \n \n \n \n \n \n \n \n \n \n \n\n \n <style>\n @media only screen {\n table.full-width-table {\n width: 100%;\n }\n }\n </style>\n\n \n \n <script
That's a lot to look at! It's also pretty unreadable. This is where Beautiful Soup comes in. What Beautiful Soup does is helps us parse the page content properly, into a form that we can more easily use.
# Turn the undecoded content into a Beautiful Soup object and assign it to a variable
soup = BeautifulSoup(response.content)
type(soup)
bs4.BeautifulSoup
Now let's take a look at this.
# Check soup variable
soup
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<title>random — Generate pseudo-random numbers — Python 3.9.2 documentation</title>
<link href="../_static/pydoctheme.css" rel="stylesheet" type="text/css"/>
<link href="../_static/pygments.css" rel="stylesheet" type="text/css"/>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
# Other way to load the html code using 'urllib.request.urlopen()'
#url = urllib.request.urlopen("https://docs.python.org/3/library/random.html#module-random")
#soup = BeautifulSoup(url)
#soup
Still very long, but a little easier to take in.
The real advantage of Beautiful Soup, however, is that it parses our webpage according to its structure and allows us to search for and extract elements within it. This is because it transforms the webpage from a string into a special Beautiful Soup object.
To extract HTML elements from our webpage, we can call the .find()
method on our Beautiful Soup object. This method finds the first element that matches the criterion that we pass in. The criterion may be an element id
, class
, tag name
, or even a function. (For a full list of search elements, see this page.)
But how do we know what element to search for? This is where your browser's Inspect
or Inspect Element
feature comes in handy. Simply right click on an object of interest on the web page and click Inspect
on Chrome or Inspect Element
on Firefox. This will then show you the corresponding place in the HTML code where the element appears. From there you should be able to find an id or class name that will allow you to locate the element with Beautiful Soup.
In this case, we want to target the tag/ element dt
as below picture:
So it looks like we're looking for a dt
element with id='random.___'
. We can easily retrieve this with Beautiful Soup's .findAll
command.
# Find all function names - we specify the name of the element in this case is 'dt'
names = soup.body.findAll('dt')
print(names)
[<dt id="random.seed">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">seed</code><span class="sig-paren">(</span><em class="sig-param">a=None</em>, <em class="sig-param">version=2</em><span class="sig-paren">)</span><a class="headerlink" href="#random.seed" title="Permalink to this definition">¶</a></dt>, <dt id="random.getstate">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">getstate</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#random.getstate" title="Permalink to this definition">¶</a></dt>, <dt id="random.setstate">
<code class="sig-prename descclassname">random.</code><code class="sig-name descname">setstate</code><span class="sig-paren">(</span><em class="sig-param">state</em><span class="sig-paren">)</span><a class="headerlink" href="#random.setstate" title="Permalink to this definition">¶</a></dt>, <dt id="random.randbytes">
There are still some works to do! This is when regex kicks in.
# Find all the information we're looking for with regex
# In this case, it would be every string at starts with id='random.'
function_names = re.findall('id="random.\w+' , str(names)) # '\w+' which means the string should end with the function name
# Let print the results
print(function_names)
['id="random.seed', 'id="random.getstate', 'id="random.setstate', 'id="random.randbytes', 'id="random.randrange', 'id="random.randint', 'id="random.getrandbits', 'id="random.choice', 'id="random.choices', 'id="random.shuffle', 'id="random.sample', 'id="random.random', 'id="random.uniform', 'id="random.triangular', 'id="random.betavariate', 'id="random.expovariate', 'id="random.gammavariate', 'id="random.gauss', 'id="random.lognormvariate', 'id="random.normalvariate', 'id="random.vonmisesvariate', 'id="random.paretovariate', 'id="random.weibullvariate', 'id="random.Random', 'id="random.SystemRandom']
We are almost there! We just need to remove the first few characters from each string.
# Using list comprehension to edit our values:
function_names = [item[4:] for item in function_names]
# Let print the results
print(function_names)
['random.seed', 'random.getstate', 'random.setstate', 'random.randbytes', 'random.randrange', 'random.randint', 'random.getrandbits', 'random.choice', 'random.choices', 'random.shuffle', 'random.sample', 'random.random', 'random.uniform', 'random.triangular', 'random.betavariate', 'random.expovariate', 'random.gammavariate', 'random.gauss', 'random.lognormvariate', 'random.normalvariate', 'random.vonmisesvariate', 'random.paretovariate', 'random.weibullvariate', 'random.Random', 'random.SystemRandom']
Perfect! Now we need to do the same with the function description.
We have to target the description details with tag - dd
# Find all the function description
description = soup.body.findAll('dd')
#print(description)
Wow it looks very complicated! There are lots of tags here (<em>
tags). These unnecessary elements from the above method would take a long time to get rid of manually.
Luckily, BeautifulSoup is not only beautiful, it's also smart. Let's look at the .text
method:
# Create a list
function_usage = []
# Create a loop
for item in description:
item = item.text # Save the extracted text to a variable
item = item.replace('\n', ' ') # to get rid of the next line operator which is `\n`
function_usage.append(item)
#print(function_usage) # Don't get overwhelmed! they are just all the function description from the above function names
# Let's check the length of the function_names and function_usage
print(f' Length of function_names: {len(function_names)}')
print(f' Length of function_usage: {len(function_usage)}')
Length of function_names: 25
Length of function_usage: 25
# Create a dataframe since the length of both variables are equal!
data = pd.DataFrame( { 'function name': function_names,
'function usage' : function_usage } )
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
........
<th>21</th>
<td>random.paretovariate</td>
<td>Pareto distribution. alpha is the shape param...</td>
</tr>
<tr>
<th>22</th>
<td>random.weibullvariate</td>
<td>Weibull distribution. alpha is the scale para...</td>
</tr>
<tr>
<th>23</th>
<td>random.Random</td>
<td>Class that implements the default pseudo-rando...</td>
# Let make a CSV file from the dataframe
data.to_csv('random_function.csv')
BONUS: if you want to target a specific attributes, for example id="bookeeping-functions"
you can use the following code:
# Target specific attributes
#example = soup.body.findAll ('div', attrs = {'id' : 'bookeeping-functions'})
#print(example) # you can get very specific result with BeautifulSoup
Pandas is very useful! We can easily scrape data using the pandas read_html()
function for your data science project.
We will be web scraping NBA player stats data and perform a quick data exploration from the website basketball-reference.com.
First, we want to check out the specific URL that we are going to scrape the data - the NBA Player Stats of season 2019-2020.
# Method 1: only 1 year
# URL of the player stats in 2020
url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
url
'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
# Method 2: multiple years
years = ['2016', '2017', '2018', '2019', '2020']
str = 'https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'
for year in years:
url = str.format(year)
print(url)
https://www.basketball-reference.com/leagues/NBA_2016_per_game.html
https://www.basketball-reference.com/leagues/NBA_2017_per_game.html
https://www.basketball-reference.com/leagues/NBA_2018_per_game.html
https://www.basketball-reference.com/leagues/NBA_2019_per_game.html
https://www.basketball-reference.com/leagues/NBA_2020_per_game.html
# Let check URL of the player stats in 2020
url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
# Using pd.read_html()
df = pd.read_html(url, header = 0)
print(df)
[ Rk Player Pos Age Tm G GS MP FG FGA ... \
0 1 Steven Adams C 26 OKC 63 63 26.7 4.5 7.6 ...
1 2 Bam Adebayo PF 22 MIA 72 72 33.6 6.1 11.0 ...
2 3 LaMarcus Aldridge C 34 SAS 53 53 33.1 7.4 15.0 ...
3 4 Kyle Alexander C 23 MIA 2 0 6.5 0.5 1.0 ...
4 5 Nickeil Alexander-Walker SG 21 NOP 47 1 12.6 2.1 5.7 ...
.. ... ... .. .. ... .. .. ... ... ... ...
672 525 Trae Young PG 21 ATL 60 60 35.3 9.1 20.8 ...
673 526 Cody Zeller C 27 CHO 58 39 23.1 4.3 8.3 ...
674 527 Tyler Zeller C 30 SAS 2 0 2.0 0.5 2.0 ...
675 528 Ante Žižić C 23 CLE 22 0 10.0 1.9 3.3 ...
676 529 Ivica Zubac C 22 LAC 72 70 18.4 3.3 5.3 ...
FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 .582 3.3 6.0 9.3 2.3 0.8 1.1 1.5 1.9 10.9
1 .691 2.4 7.8 10.2 5.1 1.1 1.3 2.8 2.5 15.9
2 .827 1.9 5.5 7.4 2.4 0.7 1.6 1.4 2.4 18.9
3 NaN 1.0 0.5 1.5 0.0 0.0 0.0 0.5 0.5 1.0
4 .676 0.2 1.6 1.8 1.9 0.4 0.2 1.1 1.2 5.7
.. ... ... ... ... ... ... ... ... ... ...
672 .860 0.5 3.7 4.3 9.3 1.1 0.1 4.8 1.7 29.6
673 .682 2.8 4.3 7.1 1.5 0.7 0.4 1.3 2.4 11.1
674 NaN 1.5 0.5 2.0 0.0 0.0 0.0 0.0 0.0 1.0
675 .737 0.8 2.2 3.0 0.3 0.3 0.2 0.5 1.2 4.4
676 .747 2.7 4.8 7.5 1.1 0.2 0.9 0.8 2.3 8.3
[677 rows x 30 columns]]
It looks a little bit messy. What we actually have here is a list of DataFrames. We can beautify this object using Pandas (without any additional libraries!)
# Check number of DataFrames in this list
print(f'number of tables in df: {len(df)}')
print('================')
# Since there is only 1, pull out the 0th element:
df[0].head(20)
number of tables in df: 1
================
Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Steven Adams | C | 26 | OKC | 63 | 63 | 26.7 | 4.5 | 7.6 | ... | .582 | 3.3 | 6.0 | 9.3 | 2.3 | 0.8 | 1.1 | 1.5 | 1.9 | 10.9 |
1 | 2 | Bam Adebayo | PF | 22 | MIA | 72 | 72 | 33.6 | 6.1 | 11.0 | ... | .691 | 2.4 | 7.8 | 10.2 | 5.1 | 1.1 | 1.3 | 2.8 | 2.5 | 15.9 |
2 | 3 | LaMarcus Aldridge | C | 34 | SAS | 53 | 53 | 33.1 | 7.4 | 15.0 | ... | .827 | 1.9 | 5.5 | 7.4 | 2.4 | 0.7 | 1.6 | 1.4 | 2.4 | 18.9 |
3 | 4 | Kyle Alexander | C | 23 | MIA | 2 | 0 | 6.5 | 0.5 | 1.0 | ... | NaN | 1.0 | 0.5 | 1.5 | 0.0 | 0.0 | 0.0 | 0.5 | 0.5 | 1.0 |
4 | 5 | Nickeil Alexander-Walker | SG | 21 | NOP | 47 | 1 | 12.6 | 2.1 | 5.7 | ... | .676 | 0.2 | 1.6 | 1.8 | 1.9 | 0.4 | 0.2 | 1.1 | 1.2 | 5.7 |
5 | 6 | Grayson Allen | SG | 24 | MEM | 38 | 0 | 18.9 | 3.1 | 6.6 | ... | .867 | 0.2 | 2.0 | 2.2 | 1.4 | 0.3 | 0.1 | 0.9 | 1.4 | 8.7 |
6 | 7 | Jarrett Allen | C | 21 | BRK | 70 | 64 | 26.5 | 4.3 | 6.6 | ... | .633 | 3.1 | 6.5 | 9.6 | 1.6 | 0.6 | 1.3 | 1.1 | 2.3 | 11.1 |
7 | 8 | Kadeem Allen | PG | 27 | NYK | 10 | 0 | 11.7 | 1.9 | 4.4 | ... | .636 | 0.2 | 0.7 | 0.9 | 2.1 | 0.5 | 0.2 | 0.8 | 0.7 | 5.0 |
8 | 9 | Al-Farouq Aminu | PF | 29 | ORL | 18 | 2 | 21.1 | 1.4 | 4.8 | ... | .655 | 1.3 | 3.5 | 4.8 | 1.2 | 1.0 | 0.4 | 0.9 | 1.5 | 4.3 |
9 | 10 | Justin Anderson | SG | 26 | BRK | 10 | 1 | 10.7 | 1.0 | 3.8 | ... | .500 | 0.1 | 2.0 | 2.1 | 0.8 | 0.0 | 0.6 | 0.4 | 1.3 | 2.8 |
10 | 11 | Kyle Anderson | SF | 26 | MEM | 67 | 28 | 19.9 | 2.3 | 4.9 | ... | .667 | 0.9 | 3.4 | 4.3 | 2.4 | 0.8 | 0.6 | 1.0 | 1.7 | 5.8 |
11 | 12 | Ryan Anderson | C | 31 | HOU | 2 | 0 | 7.0 | 1.0 | 3.5 | ... | NaN | 0.0 | 3.5 | 3.5 | 1.0 | 0.5 | 0.0 | 0.5 | 0.5 | 2.5 |
12 | 13 | Giannis Antetokounmpo | PF | 25 | MIL | 63 | 63 | 30.4 | 10.9 | 19.7 | ... | .633 | 2.2 | 11.4 | 13.6 | 5.6 | 1.0 | 1.0 | 3.7 | 3.1 | 29.5 |
13 | 14 | Kostas Antetokounmpo | PF | 22 | LAL | 5 | 0 | 4.0 | 0.6 | 0.6 | ... | .500 | 0.4 | 0.2 | 0.6 | 0.4 | 0.0 | 0.0 | 0.2 | 0.4 | 1.4 |
14 | 15 | Thanasis Antetokounmpo | SF | 27 | MIL | 20 | 2 | 6.5 | 1.2 | 2.4 | ... | .412 | 0.6 | 0.6 | 1.2 | 0.8 | 0.4 | 0.1 | 0.6 | 0.9 | 2.8 |
15 | 16 | Carmelo Anthony | PF | 35 | POR | 58 | 58 | 32.8 | 5.8 | 13.5 | ... | .845 | 1.2 | 5.1 | 6.3 | 1.5 | 0.8 | 0.5 | 1.7 | 2.9 | 15.4 |
16 | 17 | OG Anunoby | SF | 22 | TOR | 69 | 68 | 29.9 | 4.1 | 8.2 | ... | .706 | 1.2 | 4.1 | 5.3 | 1.6 | 1.4 | 0.7 | 1.1 | 2.4 | 10.6 |
17 | 18 | Ryan Arcidiacono | PG | 25 | CHI | 58 | 4 | 16.0 | 1.6 | 3.8 | ... | .711 | 0.3 | 1.6 | 1.9 | 1.7 | 0.5 | 0.1 | 0.6 | 1.7 | 4.5 |
18 | 19 | Trevor Ariza | SF | 34 | TOT | 53 | 21 | 28.2 | 2.7 | 6.1 | ... | .838 | 0.6 | 4.0 | 4.6 | 1.7 | 1.3 | 0.3 | 1.1 | 2.1 | 8.0 |
19 | 19 | Trevor Ariza | SF | 34 | SAC | 32 | 0 | 24.7 | 2.0 | 5.2 | ... | .778 | 0.7 | 3.9 | 4.6 | 1.6 | 1.1 | 0.2 | 0.9 | 2.0 | 6.0 |
20 | 19 | Trevor Ariza | SF | 34 | POR | 21 | 21 | 33.4 | 3.7 | 7.6 | ... | .872 | 0.6 | 4.1 | 4.8 | 2.0 | 1.6 | 0.4 | 1.3 | 2.3 | 11.0 |
21 | 20 | D.J. Augustin | PG | 32 | ORL | 57 | 13 | 24.9 | 3.2 | 8.1 | ... | .890 | 0.4 | 1.8 | 2.1 | 4.6 | 0.6 | 0.0 | 1.5 | 1.3 | 10.5 |
22 | Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
23 | 21 | Deandre Ayton | C | 21 | PHO | 38 | 32 | 32.5 | 8.2 | 14.9 | ... | .753 | 3.9 | 7.6 | 11.5 | 1.9 | 0.7 | 1.5 | 2.1 | 3.1 | 18.2 |
24 | 22 | Dwayne Bacon | SG | 24 | CHO | 39 | 11 | 17.6 | 2.2 | 6.3 | ... | .660 | 0.4 | 2.2 | 2.6 | 1.3 | 0.6 | 0.1 | 0.9 | 1.3 | 5.7 |
25 rows × 30 columns
Wow! You'll notice that there are some missing values (NaN) and multiple occurences of some player names because they have been a part of different teams in the same year.
We can see on the website that the header repeats itself in every 20 players. We'll have to remove the subsequent headers and keep only the first header:
# Assigns the table in a variable df_2020
df_2020 = df[0]
# Let check the table header which is presented multiple times in several rows
df_2020[df_2020.Age == 'Age'].head() # All the subsequent table header selected for this entire dataframe!
Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
53 | Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
76 | Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
101 | Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
130 | Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
5 rows × 30 columns
# Checking the length for how many header we have:
print(f' total numbers of redundant headers: {len(df_2020[df_2020.Age == "Age"])} ')
# Drop the redundant headers in the dataframe:
df_2020_new = df_2020.drop(df_2020[df_2020.Age == 'Age'].index)
# Compare before and after dropping redundant headers with numbers of rows:
print(f' total rows of df_2020: {df_2020.shape[0]} ')
print(f' total rows of df_2020_new: {df_2020_new.shape[0]} ')
print('===========================================')
df_2020_new.head(20)
total numbers of redundant headers: 26
total rows of df_2020: 677
total rows of df_2020_new: 651
===========================================
Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Steven Adams | C | 26 | OKC | 63 | 63 | 26.7 | 4.5 | 7.6 | ... | .582 | 3.3 | 6.0 | 9.3 | 2.3 | 0.8 | 1.1 | 1.5 | 1.9 | 10.9 |
1 | 2 | Bam Adebayo | PF | 22 | MIA | 72 | 72 | 33.6 | 6.1 | 11.0 | ... | .691 | 2.4 | 7.8 | 10.2 | 5.1 | 1.1 | 1.3 | 2.8 | 2.5 | 15.9 |
2 | 3 | LaMarcus Aldridge | C | 34 | SAS | 53 | 53 | 33.1 | 7.4 | 15.0 | ... | .827 | 1.9 | 5.5 | 7.4 | 2.4 | 0.7 | 1.6 | 1.4 | 2.4 | 18.9 |
3 | 4 | Kyle Alexander | C | 23 | MIA | 2 | 0 | 6.5 | 0.5 | 1.0 | ... | NaN | 1.0 | 0.5 | 1.5 | 0.0 | 0.0 | 0.0 | 0.5 | 0.5 | 1.0 |
4 | 5 | Nickeil Alexander-Walker | SG | 21 | NOP | 47 | 1 | 12.6 | 2.1 | 5.7 | ... | .676 | 0.2 | 1.6 | 1.8 | 1.9 | 0.4 | 0.2 | 1.1 | 1.2 | 5.7 |
5 | 6 | Grayson Allen | SG | 24 | MEM | 38 | 0 | 18.9 | 3.1 | 6.6 | ... | .867 | 0.2 | 2.0 | 2.2 | 1.4 | 0.3 | 0.1 | 0.9 | 1.4 | 8.7 |
6 | 7 | Jarrett Allen | C | 21 | BRK | 70 | 64 | 26.5 | 4.3 | 6.6 | ... | .633 | 3.1 | 6.5 | 9.6 | 1.6 | 0.6 | 1.3 | 1.1 | 2.3 | 11.1 |
7 | 8 | Kadeem Allen | PG | 27 | NYK | 10 | 0 | 11.7 | 1.9 | 4.4 | ... | .636 | 0.2 | 0.7 | 0.9 | 2.1 | 0.5 | 0.2 | 0.8 | 0.7 | 5.0 |
8 | 9 | Al-Farouq Aminu | PF | 29 | ORL | 18 | 2 | 21.1 | 1.4 | 4.8 | ... | .655 | 1.3 | 3.5 | 4.8 | 1.2 | 1.0 | 0.4 | 0.9 | 1.5 | 4.3 |
9 | 10 | Justin Anderson | SG | 26 | BRK | 10 | 1 | 10.7 | 1.0 | 3.8 | ... | .500 | 0.1 | 2.0 | 2.1 | 0.8 | 0.0 | 0.6 | 0.4 | 1.3 | 2.8 |
10 | 11 | Kyle Anderson | SF | 26 | MEM | 67 | 28 | 19.9 | 2.3 | 4.9 | ... | .667 | 0.9 | 3.4 | 4.3 | 2.4 | 0.8 | 0.6 | 1.0 | 1.7 | 5.8 |
11 | 12 | Ryan Anderson | C | 31 | HOU | 2 | 0 | 7.0 | 1.0 | 3.5 | ... | NaN | 0.0 | 3.5 | 3.5 | 1.0 | 0.5 | 0.0 | 0.5 | 0.5 | 2.5 |
12 | 13 | Giannis Antetokounmpo | PF | 25 | MIL | 63 | 63 | 30.4 | 10.9 | 19.7 | ... | .633 | 2.2 | 11.4 | 13.6 | 5.6 | 1.0 | 1.0 | 3.7 | 3.1 | 29.5 |
13 | 14 | Kostas Antetokounmpo | PF | 22 | LAL | 5 | 0 | 4.0 | 0.6 | 0.6 | ... | .500 | 0.4 | 0.2 | 0.6 | 0.4 | 0.0 | 0.0 | 0.2 | 0.4 | 1.4 |
14 | 15 | Thanasis Antetokounmpo | SF | 27 | MIL | 20 | 2 | 6.5 | 1.2 | 2.4 | ... | .412 | 0.6 | 0.6 | 1.2 | 0.8 | 0.4 | 0.1 | 0.6 | 0.9 | 2.8 |
15 | 16 | Carmelo Anthony | PF | 35 | POR | 58 | 58 | 32.8 | 5.8 | 13.5 | ... | .845 | 1.2 | 5.1 | 6.3 | 1.5 | 0.8 | 0.5 | 1.7 | 2.9 | 15.4 |
16 | 17 | OG Anunoby | SF | 22 | TOR | 69 | 68 | 29.9 | 4.1 | 8.2 | ... | .706 | 1.2 | 4.1 | 5.3 | 1.6 | 1.4 | 0.7 | 1.1 | 2.4 | 10.6 |
17 | 18 | Ryan Arcidiacono | PG | 25 | CHI | 58 | 4 | 16.0 | 1.6 | 3.8 | ... | .711 | 0.3 | 1.6 | 1.9 | 1.7 | 0.5 | 0.1 | 0.6 | 1.7 | 4.5 |
18 | 19 | Trevor Ariza | SF | 34 | TOT | 53 | 21 | 28.2 | 2.7 | 6.1 | ... | .838 | 0.6 | 4.0 | 4.6 | 1.7 | 1.3 | 0.3 | 1.1 | 2.1 | 8.0 |
19 | 19 | Trevor Ariza | SF | 34 | SAC | 32 | 0 | 24.7 | 2.0 | 5.2 | ... | .778 | 0.7 | 3.9 | 4.6 | 1.6 | 1.1 | 0.2 | 0.9 | 2.0 | 6.0 |
20 rows × 30 columns
# Making a simple histogram
plt.figure(figsize=(10,8))
sns.distplot(df_2020_new.PTS, # Checking frequency of the player points
kde= False, # Should be False because we want to retain the original frequency ( "kde=True" => it will be the probability)
hist_kws = dict( edgecolor = 'black', linewidth=2))
plt.title('HISTOGRAM OF PLAYER POINTS PER GAME IN THE 2020 NBA SEASON')
plt.ylabel('NUMBERS OF PLAYERS')
plt.xlabel('POINTS PER GAME')
plt.show()
From the histogram, we can see:
- There are about 57 players having between 0 and 1 point.
- There are less than 10 players who had 30 points.