[Event Source] Pedal in Tandem #40

surreal30 · 2024-08-28T10:45:35Z

Source: https://www.pedalintandem.com/experiences/

A few notes:

Uses beautifulsoup to parse through the HTML.
Uses the pagination on the experiences listing to fetch all the events.
Attempts to fetch event attributes such as difficulty, distance as well
Fetch accurate dates and cost structure.

Currently does not:

Fetch event categories
Costs for optionals are not considered.
Events happening far from Bangalore might be included.

surreal30 · 2024-08-28T10:47:36Z

src/pedalintandem.py

+	date_opts = options_selector.find('div', class_ = 'product-variations-variety').find('select').find_all('option')
+	for date_opt in date_opts:
+		booking_begin = str(datetime.strptime(date_opt['data-booking-begin-at'], "%Y-%m-%d %H:%M:%S %Z"))
+		event_date = str(datetime.strptime(date_opt.get_text().strip(), "%d-%b-%Y"))


This seems to be a little complicated method as I converting the string to datetime then converting to string and then again to datetime before storing in json.

Should I drop _date() and use them here directly?

extract to datetime or date objects, and then convert to strings at the last step.

captn3m0 · 2024-08-28T11:04:08Z

src/pedalintandem.py

+
+
+def fetch_events_links(session):
+	res = session.get(f"{BASE_URL}/experiences", impersonate="chrome")


Please confirm if the code breaks without chrome impersonation. If not, we should just use requests.

Ok. Will remove impersonate if code doesn't break.

captn3m0 · 2024-08-28T11:09:16Z

src/pedalintandem.py

+	load_more = soup.find('div', class_ = 'products-loadmore')
+	while load_more != None:
+		url = load_more.find('a').get('href')


Switch to a single CSS selector here using select with a selector like div.products-loadmore a[href^="/experiences/"] to select all anchor tags that start with /experiences/. This gives category pages as well, so we can drop them after that.

You can also write a regex for the attribute selector, which bs4 supports..

~~I don't understand this completely. Can you please rephrase it?~~
Got it now.

captn3m0 · 2024-08-28T11:10:00Z

src/pedalintandem.py

+		url = load_more.find('a').get('href')
+
+		new_page = session.get(f"{BASE_URL}{url}")
+		soup = BeautifulSoup(new_page.text, 'html.parser')


do not reuse soup here, causes confusion on the next few lookups on whether the content is on the first page or the others.

Ok. Will change it

captn3m0 · 2024-08-28T11:11:48Z

src/pedalintandem.py

+		load_more = soup.find('div', class_ = 'products-loadmore')
+
+	event_links = []
+	for event_div in event_divs:


unsure about the need for event_divs. Just iterate over anchor tags instead.

Will check this and update it.

captn3m0 · 2024-08-28T11:12:56Z

src/pedalintandem.py

+
+	return events
+
+def make_event(event):


Add some comments as well please.

Suggested change

def make_event(event):

def make_event(soup):

Will change and add comments.

captn3m0 · 2024-08-28T11:16:29Z

src/pedalintandem.py

+	soup = BeautifulSoup(res.text, 'html.parser')
+	event_divs = soup.find_all('div', class_ = 'single-experience')
+
+	# Fetch events from other pages.


Took me a while to figure out how/why this works. Would be nice to document with a comment:

PIT supports their infinite scroll without Javascript by pagination. We paginate through all these pages and collect the event URLs

Ok. Will add more comments.

Yes. Javascript was turned off that is why infinite scroll was not working. I will change this logic.

I think we can make the assumption that important events (happening in bangalore, featured, or upcoming) will always show up on the main page. And drop the entire pagination code since that just adds complexity. So the code becomes:

Fetch home page

find all experience links

Iterate/Transform

Generate

src/pedalintandem.py

surreal30 · 2024-08-28T11:20:48Z

src/pedalintandem.py

+	soup = BeautifulSoup(res.text, 'html.parser')
+	event_divs = soup.find_all('div', class_ = 'single-experience')
+
+	# Fetch events from other pages.


Ok. Will add more comments.

Yes. Javascript was turned off that is why infinite scroll was not working. I will change this logic.

surreal30 · 2024-08-28T11:21:10Z

src/pedalintandem.py

+
+	return events
+
+def make_event(event):


Will change and add comments.

surreal30 · 2024-08-28T11:22:46Z

src/pedalintandem.py

+		load_more = soup.find('div', class_ = 'products-loadmore')
+
+	event_links = []
+	for event_div in event_divs:


Will check this and update it.

surreal30 · 2024-08-28T11:23:01Z

src/pedalintandem.py

+		url = load_more.find('a').get('href')
+
+		new_page = session.get(f"{BASE_URL}{url}")
+		soup = BeautifulSoup(new_page.text, 'html.parser')


Ok. Will change it

surreal30 · 2024-08-28T11:26:14Z

src/pedalintandem.py

+	load_more = soup.find('div', class_ = 'products-loadmore')
+	while load_more != None:
+		url = load_more.find('a').get('href')


~~I don't understand this completely. Can you please rephrase it?~~
Got it now.

captn3m0

Most of the comments are minor code babbles - go through them for code clarity, but in a small codebase like this they only matter for one reason - readability. When the code breaks (and it inevitably will), we want the next person who edits this code, to be able to very very quickly debug where it went wrong. For that we want to emphasize readability, and operating on objects instead of strings.

The other problem definitely needs correction: We need the data to be as per the https://schema.org/Event schema.

In particular:

metrics is not a valid field. It will have to be converted to a list inside the description
options will need to be converted into offers. The price cannot use ₹, and must instead use [priceCurrency](https://schema.org/priceCurrency)
A url is mandatory
Location needs to match Place/PostalAddress if possible. Hardcode the "PIT" office address, and the others could be text addresses perhaps.
dates.eventDate should instead convert to startDate, endDate, which should be accurate datetimes. I noticed eventDate currently is set to midnight, which shouldn't be the case.
bookingBeginDate should be instead the offer's availabilityStarts instead
Convert the long --------- to newlines for better readability.
Optional, but nice to have. Find a way to drop the imprecise offers. For the rides, L1 and l2 are real offers, but the Rental Bike/Transport my Bike are an addon. If we keep all four, it appears as if the lowest price for the experience is 300 INR (Transport my Bike) while the truth is 2250+300. Lots of different ways, but hopefully you can pick something that has a good chance of working across events.

src/pedalintandem.py

captn3m0 · 2024-08-28T11:43:48Z

I've updated the description.
Check if you can find a way to include events happening in and around bangalore only. We don't want blr.today to list events happening in AP.
See if event categories can be picked up as keywords.
Add a keyword NOTINBLR in any events outside of BLR city boundaries. Some nuance here that I need to document, but weekend getaways, multi-day trips etc are not events that I want to highlight, so tagging them as NOTINBLR will help me filter those out.

src/pedalintandem.py

captn3m0

break into more functions, add some comments, and use the parse_date code from trove.

src/pedalintandem.py

out/pedalintandem.json

captn3m0 · 2024-09-10T08:03:23Z

Please add @context: https://schema.org and validate your schema properly.

surreal30 · 2024-09-11T03:07:18Z

This is error free. But I am not how to add permanent address in this.

surreal30 added 2 commits August 28, 2024 16:10

feat: create script to scrap pedalintander events

1f509fe

feat: events data in json format

3da8ca5

surreal30 commented Aug 28, 2024

View reviewed changes

captn3m0 reviewed Aug 28, 2024

View reviewed changes

src/pedalintandem.py Outdated Show resolved Hide resolved

surreal30 commented Aug 28, 2024

View reviewed changes

captn3m0 requested changes Aug 28, 2024

View reviewed changes

captn3m0 added the Event Source Issues to do with a single Event Source label Aug 28, 2024

captn3m0 changed the title ~~Feat: Create python script to scrap pedalintandem~~ [Event Source] Pedal in Tandem Aug 28, 2024

captn3m0 reviewed Aug 28, 2024

View reviewed changes

src/pedalintandem.py Outdated Show resolved Hide resolved

surreal30 added 14 commits August 28, 2024 18:41

refac: replace soup with new_page_body

653e179

fix: remove impersonate chrome

abbe234

refac: use select and map for finding events and url

789a08e

refac: use select select_one where necessary

5b136a8

refac: use dict for metrics and options

b960f43

refac: use select for fetching metrics

bebc246

refac: remove events from other pages

904e915

refac: remove _date()

06343a4

fix: remove event init as array

1ad08b3

refac: use css selector for options

fe64820

refac: convert eventDate to startDate endDate

5d6c204

feat: pick events happening in blr only

8e50f4b

feat: only add single day events

33de4d5

feat: create addOn attribute

c38239b

surreal30 added 2 commits September 2, 2024 22:44

fix

bf3ebbc

latest json

6ad1453

captn3m0 reviewed Sep 2, 2024

View reviewed changes

src/pedalintandem.py Outdated Show resolved Hide resolved

captn3m0 requested changes Sep 2, 2024

View reviewed changes

src/pedalintandem.py Outdated Show resolved Hide resolved

src/pedalintandem.py Outdated Show resolved Hide resolved

src/pedalintandem.py Outdated Show resolved Hide resolved

src/pedalintandem.py Show resolved Hide resolved

src/pedalintandem.py Outdated Show resolved Hide resolved

surreal30 added 15 commits September 6, 2024 16:33

refac: timings

a82fbc9

fix: bugs in dates

5bc2ffd

refac: use subtring instead of regex for location

a637cd8

feat: add url attribute

137458e

feat: add keywords attribute

710c10d

fix: add currency attr correctly and create get_offers

2ab431c

refac: use map instead of for in make event

9e6a65c

feat: add @type attr

c514cf2

refac: conv desc to str and put bangalore[] out of func

1cff296

fix: timings issue

af16bb5

fix: duration 0

6664e2e

refac: return map directly and add comment

0666f74

refac: create find_location()

8c70e7b

updated json

35b7204

feat: refac duration_in_hours()

2f1d737

captn3m0 reviewed Sep 10, 2024

View reviewed changes

out/pedalintandem.json Outdated Show resolved Hide resolved

surreal30 added 5 commits September 10, 2024 14:58

fix: add @context

7683e6d

fix: type offers to offer

1ed048d

refac: dates and offers

a532a8f

refac: location logic

a5f9a03

json file update

ee67189

fix: remove commented code

aea9311

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Event Source] Pedal in Tandem #40

[Event Source] Pedal in Tandem #40

surreal30 commented Aug 28, 2024 •

edited by captn3m0

Loading

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024 •

edited

Loading

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

captn3m0 Aug 28, 2024

surreal30 Aug 28, 2024

surreal30 Aug 28, 2024

surreal30 Aug 28, 2024

surreal30 Aug 28, 2024

surreal30 Aug 28, 2024 •

edited

Loading

captn3m0 left a comment

captn3m0 commented Aug 28, 2024

captn3m0 left a comment

captn3m0 commented Sep 10, 2024

surreal30 commented Sep 11, 2024



		def fetch_events_links(session):
		res = session.get(f"{BASE_URL}/experiences", impersonate="chrome")

[Event Source] Pedal in Tandem #40

Are you sure you want to change the base?

[Event Source] Pedal in Tandem #40

Conversation

surreal30 commented Aug 28, 2024 • edited by captn3m0 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

surreal30 Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

surreal30 Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

captn3m0 left a comment

Choose a reason for hiding this comment

captn3m0 commented Aug 28, 2024

captn3m0 left a comment

Choose a reason for hiding this comment

captn3m0 commented Sep 10, 2024

surreal30 commented Sep 11, 2024

surreal30 commented Aug 28, 2024 •

edited by captn3m0

Loading

surreal30 Aug 28, 2024 •

edited

Loading

surreal30 Aug 28, 2024 •

edited

Loading