Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Event Source] Pedal in Tandem #40

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

surreal30
Copy link
Contributor

@surreal30 surreal30 commented Aug 28, 2024

Source: https://www.pedalintandem.com/experiences/

A few notes:

  • Uses beautifulsoup to parse through the HTML.
  • Uses the pagination on the experiences listing to fetch all the events.
  • Attempts to fetch event attributes such as difficulty, distance as well
  • Fetch accurate dates and cost structure.

Currently does not:

  1. Fetch event categories
  2. Costs for optionals are not considered.
  3. Events happening far from Bangalore might be included.

date_opts = options_selector.find('div', class_ = 'product-variations-variety').find('select').find_all('option')
for date_opt in date_opts:
booking_begin = str(datetime.strptime(date_opt['data-booking-begin-at'], "%Y-%m-%d %H:%M:%S %Z"))
event_date = str(datetime.strptime(date_opt.get_text().strip(), "%d-%b-%Y"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a little complicated method as I converting the string to datetime then converting to string and then again to datetime before storing in json.

Should I drop _date() and use them here directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract to datetime or date objects, and then convert to strings at the last step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.



def fetch_events_links(session):
res = session.get(f"{BASE_URL}/experiences", impersonate="chrome")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm if the code breaks without chrome impersonation. If not, we should just use requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will remove impersonate if code doesn't break.

Comment on lines 19 to 21
load_more = soup.find('div', class_ = 'products-loadmore')
while load_more != None:
url = load_more.find('a').get('href')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to a single CSS selector here using select with a selector like div.products-loadmore a[href^="/experiences/"] to select all anchor tags that start with /experiences/. This gives category pages as well, so we can drop them after that.

You can also write a regex for the attribute selector, which bs4 supports..

Copy link
Contributor Author

@surreal30 surreal30 Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this completely. Can you please rephrase it?
Got it now.

url = load_more.find('a').get('href')

new_page = session.get(f"{BASE_URL}{url}")
soup = BeautifulSoup(new_page.text, 'html.parser')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not reuse soup here, causes confusion on the next few lookups on whether the content is on the first page or the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will change it

load_more = soup.find('div', class_ = 'products-loadmore')

event_links = []
for event_div in event_divs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure about the need for event_divs. Just iterate over anchor tags instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check this and update it.


return events

def make_event(event):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments as well please.

Suggested change
def make_event(event):
def make_event(soup):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change and add comments.

soup = BeautifulSoup(res.text, 'html.parser')
event_divs = soup.find_all('div', class_ = 'single-experience')

# Fetch events from other pages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while to figure out how/why this works. Would be nice to document with a comment:

PIT supports their infinite scroll without Javascript by pagination. We paginate through all these pages and collect the event URLs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will add more comments.

Yes. Javascript was turned off that is why infinite scroll was not working. I will change this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the assumption that important events (happening in bangalore, featured, or upcoming) will always show up on the main page. And drop the entire pagination code since that just adds complexity. So the code becomes:

  1. Fetch home page
  2. find all experience links
  3. Iterate/Transform
  4. Generate

src/pedalintandem.py Outdated Show resolved Hide resolved
src/pedalintandem.py Outdated Show resolved Hide resolved
soup = BeautifulSoup(res.text, 'html.parser')
event_divs = soup.find_all('div', class_ = 'single-experience')

# Fetch events from other pages.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will add more comments.

Yes. Javascript was turned off that is why infinite scroll was not working. I will change this logic.


return events

def make_event(event):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change and add comments.

load_more = soup.find('div', class_ = 'products-loadmore')

event_links = []
for event_div in event_divs:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check this and update it.

url = load_more.find('a').get('href')

new_page = session.get(f"{BASE_URL}{url}")
soup = BeautifulSoup(new_page.text, 'html.parser')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will change it

Comment on lines 19 to 21
load_more = soup.find('div', class_ = 'products-loadmore')
while load_more != None:
url = load_more.find('a').get('href')
Copy link
Contributor Author

@surreal30 surreal30 Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this completely. Can you please rephrase it?
Got it now.

Copy link
Contributor

@captn3m0 captn3m0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the comments are minor code babbles - go through them for code clarity, but in a small codebase like this they only matter for one reason - readability. When the code breaks (and it inevitably will), we want the next person who edits this code, to be able to very very quickly debug where it went wrong. For that we want to emphasize readability, and operating on objects instead of strings.

The other problem definitely needs correction: We need the data to be as per the https://schema.org/Event schema.

In particular:

  1. metrics is not a valid field. It will have to be converted to a list inside the description
  2. options will need to be converted into offers. The price cannot use , and must instead use [priceCurrency](https://schema.org/priceCurrency)
  3. A url is mandatory
  4. Location needs to match Place/PostalAddress if possible. Hardcode the "PIT" office address, and the others could be text addresses perhaps.
  5. dates.eventDate should instead convert to startDate, endDate, which should be accurate datetimes. I noticed eventDate currently is set to midnight, which shouldn't be the case.
  6. bookingBeginDate should be instead the offer's availabilityStarts instead
  7. Convert the long --------- to newlines for better readability.
  8. Optional, but nice to have. Find a way to drop the imprecise offers. For the rides, L1 and l2 are real offers, but the Rental Bike/Transport my Bike are an addon. If we keep all four, it appears as if the lowest price for the experience is 300 INR (Transport my Bike) while the truth is 2250+300. Lots of different ways, but hopefully you can pick something that has a good chance of working across events.

@captn3m0 captn3m0 added the Event Source Issues to do with a single Event Source label Aug 28, 2024
@captn3m0 captn3m0 changed the title Feat: Create python script to scrap pedalintandem [Event Source] Pedal in Tandem Aug 28, 2024
src/pedalintandem.py Outdated Show resolved Hide resolved
@captn3m0
Copy link
Contributor

  1. I've updated the description.
  2. Check if you can find a way to include events happening in and around bangalore only. We don't want blr.today to list events happening in AP.
  3. See if event categories can be picked up as keywords.
  4. Add a keyword NOTINBLR in any events outside of BLR city boundaries. Some nuance here that I need to document, but weekend getaways, multi-day trips etc are not events that I want to highlight, so tagging them as NOTINBLR will help me filter those out.

src/pedalintandem.py Outdated Show resolved Hide resolved
Copy link
Contributor

@captn3m0 captn3m0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break into more functions, add some comments, and use the parse_date code from trove.

src/pedalintandem.py Outdated Show resolved Hide resolved
src/pedalintandem.py Outdated Show resolved Hide resolved
src/pedalintandem.py Outdated Show resolved Hide resolved
src/pedalintandem.py Show resolved Hide resolved
src/pedalintandem.py Outdated Show resolved Hide resolved
out/pedalintandem.json Outdated Show resolved Hide resolved
@captn3m0
Copy link
Contributor

Please add @context: https://schema.org and validate your schema properly.

image

@surreal30
Copy link
Contributor Author

This is error free. But I am not how to add permanent address in this.
Screenshot 2024-09-11 at 8 35 52 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Event Source Issues to do with a single Event Source
Projects
Status: WIP
Development

Successfully merging this pull request may close these issues.

2 participants