In the other repository, I introduced how to use Beautiful Soup to do a variety of quick and simple web-scraping tasks. Beautiful Soup is great, but it has its limitations, because sometimes the content we want to scrape is hidden behind buttons and links that we cannot access directly through a URL. There are also a variety of more sophisticated browsing tasks, like filling in forms and text boxes, that we might have the need to automate. For these purposes it is good to reach for another package: selenium
Selenium is actually much more than a Python package; it's a whole framework for automating web browsers for the purposes of testing web applications, and it's been ported to a variety of programming languages in addition to Python. The main reason you should be aware of this is because, if you ever need to Google something about Selenium, you should include Selenium and Python in your search query; otherwise you will probably get a lot of results in Java (and who wants that). Also, if you ever have any questions about Selenium, their unofficial documentation is always a good place to start.
Open bash and run:
pip install selenium
# Standard imports
import pandas as pd
# For web scraping
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
import time
In order to use Selenium, you must download a driver to interface with your chosen browser. Currently, Selenium supports Chrome, Firefox, Safari, and Edge. You can find a link to the browser of your choice here. Be sure to download the driver that matches the version of your chosen browser!
For the purposes of this kickoff, we'll be using Selenium with Chrome. To do this, we must first download ChromeDriver. For example, if your current Chrome browser version is 89, you have to download Chromedriver version 89.
The ChromeDriver file, once unzipped, is a single executable file called chromedriver
. You may keep this file anywhere on your computer, but it is best to place it in an easy-to-reference location where you know. For example, you save chromedriver
in the Download
folder, you have to get the directory path as /Users/Download/chromedriver
. Thus, we can assign that directory path to a variable:
# Save path to chromedriver executable file to variable
chrome_path = '/Users/quanganhpham/Downloads/chromedriver' # for example, /Users/Download/chromedriver
#Directory path to your chromedriver
Today we will be scraping the product Solaray, Vitamin D3 + K2, Soy-Free, 125 mcg (5000 IU), 60 VegCaps
information (names / brand / review content) from this LINK.
# Initializes the Chrome Driver to access the website
driver = webdriver.Chrome(chrome_path)
driver1 = webdriver.Chrome(chrome_path)
# Assigns url into a variable
product_url = "https://ca.iherb.com/pr/Solaray-Vitamin-D3-K2-Soy-Free-125-mcg-5000-IU-60-VegCaps/70098"
# Initializes the Chrome Driver to access the URL
driver.get(product_url)
First, we can get the product brand in the main product page, by using this syntax: find_element_by_xpath('.//*[@id="brand"]/a/span/bdi').get_attribute('textContent')
which means it will find the element with the XPATH starting from id='brand'
to the order of following tags a
-> span
-> bdi
; then we can use the funtion get_attribute()
with the textContent
to parse the product brand which is Solaray
.
(We can also parse the product name from the main page; However, we won't do it because we want to try another function find_element_by_css_selector
later)
We want to scrape the review contents of the product. However, the all the review contents are located in this View All Reviews instead of the main product page url. Therefore, we have to get that link by using the Seleninum functions to parse the attribute which contains the url to all the reviews as in the below picture:
And for the product name, we can find it in the View All Reviews. This is when this function find_element_by_css_selector
comes in handy! using the following syntax: find_element_by_css_selector('[class="nav-product-link-text"] span').text)
which means it can find the element starting with class="nav-product-link-text"
-> span
tag -> extract the text from that tag with .text
as in the below picture:
# Set a waiting time for the Driver
wait = WebDriverWait(driver, 4)
# Locate `View All Reviews` link
link = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR,"span.all-reviews-link > a")))
# Get `View All Reviews` link
x = link.get_attribute("href")
# Check the link
x
'https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098'
Now we have to create 2 for loops:
- In the first loop, we will get the link of different review pages which we want to scrape.
- In the second loop, we will scrape the data that we need.
# Create lists for the dataframe:
item_name = []
item_brand = []
review_contents = []
# Scrape maximum 3 pages in the review section
max_page_num = 3
for page_num in range(1, max_page_num + 1):
review_url = x + "?&p=" + str(page_num)
print(review_url)
# Initializes the Chrome Driver to access review_url
driver1.get(review_url)
# Get the all the review elements
review_containers = driver1.find_elements_by_class_name('review-row')
for container in review_containers:
# Add the review contents
review_contents.append(container.find_element_by_class_name('review-text').text)
# Add the product name
item_name.append(driver1.find_element_by_css_selector('[class="nav-product-link-text"] span').text)
# Add the product brand
item_brand.append(driver.find_element_by_xpath('.//*[@id="brand"]/a/span/bdi').get_attribute('textContent'))
# Sleep
time.sleep(4)
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=1
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=2
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=3
# Create a dataframe
df_product = pd.DataFrame({'item_brand' : item_brand,
'item_name' : item_name,
'review_contents' : review_contents })
# Check the dataframe shape
df_product.shape
(20, 3)
# Check the dataframe
df_product.head(15)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
item_brand | item_name | review_contents | |
---|---|---|---|
0 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | Everyone around said that in Russia everyone, ... |
1 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | So far I can not appreciate the dignity of thi... |
2 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | I am surprised by the reviews of people who de... |
3 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | very cool product, I |
4 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | Very cool product, I recommend it to everyone |
5 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | cool very cool product recommend it |
6 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | I recommend |
7 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | Because of the large cans, they noticed that t... |
8 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | Very, very cool product, I recommend |
9 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | After a course of these vitamins, as my nutrit... |
10 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | I love that this supplement contains vitamin K... |
11 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | The excellent formula of this drug will provid... |
12 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | Simply the best vitamin D3 complex! The dosage... |
13 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | After reading reviews about the lack of vitami... |
14 | Solaray | Vitamin D3 + K2, Soy-Free, 60 VegCaps | They drank the whole family. Raises vitamin D ... |
# Let make a CSV file from the dataframe
df_product.to_csv ('product_review.csv', index = False, header=True)
Lastly, you can also use Selenium to close your browser. (Or you just need to simply close the browser.)
driver.quit()
driver1.quit()
END