In this project, we’re going to programmatically extract customer reviews from Walmart’s webpage using Python. The script will automatically navigate to the next page simulating user interaction. We’d also see how to bypass the webpage’s bot-challenge
Table of Contents
- Selenium
- Pandas
The page implements bot-challenge in two ways:
- The bot-challenge page loads on top of the main page and prevents users from interacting with the webpage until the user presses and hold the mouse button for 4-5 second. Once the action is complete, the script loads the main page and allows users to interact with the page.
- The bot-challenge script blocks the main url and doesn’t load the main page at all.
Scenario 1: We’re going to use JavaScript to locate and remove the DIV element which contains the bot challenge. This would let us have access to the main page, however, the page wouldn’t still allow the users to interact with the page yet. For this, we need to remove the CLASS tag from the BODY element. Since the bot challenge shows up on every page, it would be sensible to create a function to perform this operation.
Scenario 2: To tackle the second scenario, We're simply going to refresh the webpage until the url is unblocked
Link: scraper.py
Disclaimer: This script and information provided in this project is for educational purposes only