Web Scraping with Selenium.

In the other repository, I introduced how to use Beautiful Soup to do a variety of quick and simple web-scraping tasks. Beautiful Soup is great, but it has its limitations, because sometimes the content we want to scrape is hidden behind buttons and links that we cannot access directly through a URL. There are also a variety of more sophisticated browsing tasks, like filling in forms and text boxes, that we might have the need to automate. For these purposes it is good to reach for another package: selenium

Selenium is actually much more than a Python package; it's a whole framework for automating web browsers for the purposes of testing web applications, and it's been ported to a variety of programming languages in addition to Python. The main reason you should be aware of this is because, if you ever need to Google something about Selenium, you should include Selenium and Python in your search query; otherwise you will probably get a lot of results in Java (and who wants that). Also, if you ever have any questions about Selenium, their unofficial documentation is always a good place to start.

Install Selenium

Open bash and run:

pip install selenium

# Standard imports
import pandas as pd

# For web scraping
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
import time

Download driver

In order to use Selenium, you must download a driver to interface with your chosen browser. Currently, Selenium supports Chrome, Firefox, Safari, and Edge. You can find a link to the browser of your choice here. Be sure to download the driver that matches the version of your chosen browser!

For the purposes of this kickoff, we'll be using Selenium with Chrome. To do this, we must first download ChromeDriver. For example, if your current Chrome browser version is 89, you have to download Chromedriver version 89.

The ChromeDriver file, once unzipped, is a single executable file called chromedriver. You may keep this file anywhere on your computer, but it is best to place it in an easy-to-reference location where you know. For example, you save chromedriver in the Download folder, you have to get the directory path as /Users/Download/chromedriver . Thus, we can assign that directory path to a variable:

# Save path to chromedriver executable file to variable
chrome_path = '/Users/quanganhpham/Downloads/chromedriver' # for example, /Users/Download/chromedriver
#Directory path to your chromedriver

Today we will be scraping the product Solaray, Vitamin D3 + K2, Soy-Free, 125 mcg (5000 IU), 60 VegCaps information (names / brand / review content) from this LINK.

# Initializes the Chrome Driver to access the website
driver = webdriver.Chrome(chrome_path)
driver1 = webdriver.Chrome(chrome_path)


# Assigns url into a variable
product_url = "https://ca.iherb.com/pr/Solaray-Vitamin-D3-K2-Soy-Free-125-mcg-5000-IU-60-VegCaps/70098"

# Initializes the Chrome Driver to access the URL
driver.get(product_url)

First, we can get the product brand in the main product page, by using this syntax: find_element_by_xpath('.//*[@id="brand"]/a/span/bdi').get_attribute('textContent') which means it will find the element with the XPATH starting from id='brand' to the order of following tags a -> span -> bdi ; then we can use the funtion get_attribute() with the textContent to parse the product brand which is Solaray.

(We can also parse the product name from the main page; However, we won't do it because we want to try another function find_element_by_css_selector later)

We want to scrape the review contents of the product. However, the all the review contents are located in this View All Reviews instead of the main product page url. Therefore, we have to get that link by using the Seleninum functions to parse the attribute which contains the url to all the reviews as in the below picture:

And for the product name, we can find it in the View All Reviews. This is when this function find_element_by_css_selector comes in handy! using the following syntax: find_element_by_css_selector('[class="nav-product-link-text"] span').text) which means it can find the element starting with class="nav-product-link-text" -> span tag -> extract the text from that tag with .text as in the below picture:

# Set a waiting time for the Driver
wait = WebDriverWait(driver, 4)

# Locate `View All Reviews` link
link = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR,"span.all-reviews-link > a")))

# Get `View All Reviews` link
x = link.get_attribute("href")

# Check the link
x

'https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098'

Now we have to create 2 for loops:
- In the first loop, we will get the link of different review pages which we want to scrape.
- In the second loop, we will scrape the data that we need.

# Create lists for the dataframe:
item_name = []
item_brand = []
review_contents = []

# Scrape maximum 3 pages in the review section
max_page_num = 3

for page_num in range(1, max_page_num + 1):
    
    review_url = x + "?&p=" + str(page_num)
    print(review_url)
    
    # Initializes the Chrome Driver to access review_url
    driver1.get(review_url)
    
    # Get the all the review elements
    review_containers = driver1.find_elements_by_class_name('review-row')
    
    for container in review_containers:
        # Add the review contents
        review_contents.append(container.find_element_by_class_name('review-text').text)
        # Add the product name
        item_name.append(driver1.find_element_by_css_selector('[class="nav-product-link-text"] span').text)
        # Add the product brand
        item_brand.append(driver.find_element_by_xpath('.//*[@id="brand"]/a/span/bdi').get_attribute('textContent'))
    
    # Sleep
    time.sleep(4)

https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=1
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=2
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=3

# Create a dataframe
df_product = pd.DataFrame({'item_brand'   : item_brand, 
                            'item_name'   : item_name, 
                        'review_contents' : review_contents }) 

# Check the dataframe shape
df_product.shape

(20, 3)

# Check the dataframe
df_product.head(15)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	item_brand	item_name	review_contents
0	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	Everyone around said that in Russia everyone, ...
1	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	So far I can not appreciate the dignity of thi...
2	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	I am surprised by the reviews of people who de...
3	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	very cool product, I
4	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	Very cool product, I recommend it to everyone
5	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	cool very cool product recommend it
6	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	I recommend
7	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	Because of the large cans, they noticed that t...
8	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	Very, very cool product, I recommend
9	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	After a course of these vitamins, as my nutrit...
10	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	I love that this supplement contains vitamin K...
11	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	The excellent formula of this drug will provid...
12	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	Simply the best vitamin D3 complex! The dosage...
13	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	After reading reviews about the lack of vitami...
14	Solaray	Vitamin D3 + K2, Soy-Free, 60 VegCaps	They drank the whole family. Raises vitamin D ...

# Let make a CSV file from the dataframe
df_product.to_csv ('product_review.csv', index = False, header=True)

Lastly, you can also use Selenium to close your browser. (Or you just need to simply close the browser.)

driver.quit()
driver1.quit()

END

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebook		notebook
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping with Selenium.

Install Selenium

Download driver

About

Releases

Packages

Languages

Andy-Pham-72/Web-Scraping-with-Selenium

Folders and files

Latest commit

History

Repository files navigation

Web Scraping with Selenium.

Install Selenium

Download driver

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages