-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Event Source] Pedal in Tandem #40
base: main
Are you sure you want to change the base?
Conversation
src/pedalintandem.py
Outdated
date_opts = options_selector.find('div', class_ = 'product-variations-variety').find('select').find_all('option') | ||
for date_opt in date_opts: | ||
booking_begin = str(datetime.strptime(date_opt['data-booking-begin-at'], "%Y-%m-%d %H:%M:%S %Z")) | ||
event_date = str(datetime.strptime(date_opt.get_text().strip(), "%d-%b-%Y")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a little complicated method as I converting the string to datetime then converting to string and then again to datetime before storing in json.
Should I drop _date()
and use them here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract to datetime or date objects, and then convert to strings at the last step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
src/pedalintandem.py
Outdated
|
||
|
||
def fetch_events_links(session): | ||
res = session.get(f"{BASE_URL}/experiences", impersonate="chrome") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm if the code breaks without chrome impersonation. If not, we should just use requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will remove impersonate
if code doesn't break.
src/pedalintandem.py
Outdated
load_more = soup.find('div', class_ = 'products-loadmore') | ||
while load_more != None: | ||
url = load_more.find('a').get('href') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch to a single CSS selector here using select
with a selector like div.products-loadmore a[href^="/experiences/"]
to select all anchor tags that start with /experiences/
. This gives category pages as well, so we can drop them after that.
You can also write a regex for the attribute selector, which bs4 supports..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this completely. Can you please rephrase it?
Got it now.
src/pedalintandem.py
Outdated
url = load_more.find('a').get('href') | ||
|
||
new_page = session.get(f"{BASE_URL}{url}") | ||
soup = BeautifulSoup(new_page.text, 'html.parser') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not reuse soup here, causes confusion on the next few lookups on whether the content is on the first page or the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will change it
src/pedalintandem.py
Outdated
load_more = soup.find('div', class_ = 'products-loadmore') | ||
|
||
event_links = [] | ||
for event_div in event_divs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsure about the need for event_divs. Just iterate over anchor tags instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check this and update it.
src/pedalintandem.py
Outdated
|
||
return events | ||
|
||
def make_event(event): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments as well please.
def make_event(event): | |
def make_event(soup): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change and add comments.
src/pedalintandem.py
Outdated
soup = BeautifulSoup(res.text, 'html.parser') | ||
event_divs = soup.find_all('div', class_ = 'single-experience') | ||
|
||
# Fetch events from other pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to figure out how/why this works. Would be nice to document with a comment:
PIT supports their infinite scroll without Javascript by pagination. We paginate through all these pages and collect the event URLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will add more comments.
Yes. Javascript was turned off that is why infinite scroll was not working. I will change this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the assumption that important events (happening in bangalore, featured, or upcoming) will always show up on the main page. And drop the entire pagination code since that just adds complexity. So the code becomes:
- Fetch home page
- find all experience links
- Iterate/Transform
- Generate
src/pedalintandem.py
Outdated
soup = BeautifulSoup(res.text, 'html.parser') | ||
event_divs = soup.find_all('div', class_ = 'single-experience') | ||
|
||
# Fetch events from other pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will add more comments.
Yes. Javascript was turned off that is why infinite scroll was not working. I will change this logic.
src/pedalintandem.py
Outdated
|
||
return events | ||
|
||
def make_event(event): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change and add comments.
src/pedalintandem.py
Outdated
load_more = soup.find('div', class_ = 'products-loadmore') | ||
|
||
event_links = [] | ||
for event_div in event_divs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check this and update it.
src/pedalintandem.py
Outdated
url = load_more.find('a').get('href') | ||
|
||
new_page = session.get(f"{BASE_URL}{url}") | ||
soup = BeautifulSoup(new_page.text, 'html.parser') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will change it
src/pedalintandem.py
Outdated
load_more = soup.find('div', class_ = 'products-loadmore') | ||
while load_more != None: | ||
url = load_more.find('a').get('href') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this completely. Can you please rephrase it?
Got it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the comments are minor code babbles - go through them for code clarity, but in a small codebase like this they only matter for one reason - readability. When the code breaks (and it inevitably will), we want the next person who edits this code, to be able to very very quickly debug where it went wrong. For that we want to emphasize readability, and operating on objects instead of strings.
The other problem definitely needs correction: We need the data to be as per the https://schema.org/Event schema.
In particular:
metrics
is not a valid field. It will have to be converted to a list inside the descriptionoptions
will need to be converted intooffers
. The price cannot use₹
, and must instead use[priceCurrency](https://schema.org/priceCurrency)
- A
url
is mandatory - Location needs to match Place/PostalAddress if possible. Hardcode the "PIT" office address, and the others could be text addresses perhaps.
dates.eventDate
should instead convert tostartDate
,endDate
, which should be accurate datetimes. I noticed eventDate currently is set to midnight, which shouldn't be the case.bookingBeginDate
should be instead the offer's availabilityStarts instead- Convert the long
---------
to newlines for better readability. - Optional, but nice to have. Find a way to drop the imprecise offers. For the rides, L1 and l2 are real offers, but the Rental Bike/Transport my Bike are an addon. If we keep all four, it appears as if the lowest price for the experience is 300 INR (Transport my Bike) while the truth is
2250+300
. Lots of different ways, but hopefully you can pick something that has a good chance of working across events.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break into more functions, add some comments, and use the parse_date code from trove.
Source: https://www.pedalintandem.com/experiences/
A few notes:
Currently does not: