Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Important dates scraper #52

Open
arkon opened this issue Apr 17, 2016 · 9 comments
Open

Important dates scraper #52

arkon opened this issue Apr 17, 2016 · 9 comments

Comments

@arkon
Copy link
Contributor

arkon commented Apr 17, 2016

We should scrape the important dates info off of places like the Faculty of Arts & Science or UTM websites.

EDIT: This is a better list

@anderson202
Copy link
Contributor

i'd love to give the utm scraper a try, is there anything I should know/read about before I start?

@qasim
Copy link
Member

qasim commented May 11, 2016

@anderson202 yes please! Give it a go and if you have any questions, we can answer them.

I have a very basic wiki here with information: https://github.com/cobalt-uoft/uoft-scrapers/wiki but it really isn't a lot. Have a look around at other scrapers to see whats up.

For this one, UTMDates as the scraper name sounds appropriate.

We can also discuss the schema format we want to go with. Any ideas?

@anderson202
Copy link
Contributor

@qasim I'm definitely a newbie to this so I'm not too sure how the format should be like.

Basic info we need would be the date and the detailed information regarding the day. Maybe we can list which academic session the date falls in as well.

A quick question, how should the scraper function? Should it scrape everything it can for upcoming dates, scrape only a specific session or a specific date?

@kashav
Copy link
Member

kashav commented May 11, 2016

+1 on including the session, I'm thinking something like:

{
  "date":String,
  "session":String,
  "events":[String]
}

It looks like the UTM mobile site has links to two years worth of data. I think the scraper can take a year parameter and then it'll scrape <year>5 and <year>9 for the two sessions available.

For example (year = 2016):

Edit:

Looks like they actually have data since the 2010-11 school year - http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20105

@anderson202
Copy link
Contributor

anderson202 commented May 11, 2016

Wow I didn't even think of using the mobile site. It's so much cleaner.

I'll start working on it and see if I can contribute to this. Thanks.

Edit:
@kshvmdn if I follow your format, wouldn't that return a bunch of files corresponding to each day? Would it be better to alter it some way and return a file for each session instead?

For example, would this work?
{ “session”:String, “dates”: [{“date”:String, “events”:String}, ...] }

@kashav
Copy link
Member

kashav commented May 11, 2016

I'll take the UTSGDates scraper!

@kashav
Copy link
Member

kashav commented May 12, 2016

@anderson202 That's actually what we want! Take a look at the athletics and shuttle scrapers, they work the same way.

I got started on the UTSG scraper and I found it might be better to use the following format instead:

"date":String,
"session":String,
"events":[{
  "end_date"String, // some go on for more than a single day (i.e. winter break)
  "campus":String,
  "description":String
}]

This will allow us to merge events across campuses for each date, like we do with the athletics scraper (take a look at this). The API ends up being a lot cleaner this way.

@anderson202
Copy link
Contributor

anderson202 commented May 12, 2016

I think I have the UTM scraper done. But I'm not sure how the JSON files should be named. The ones I have currently is simply the date (or period) of the event as shown on the mobile site. Should I change it to a specific format before making a pull request?

@kashav
Copy link
Member

kashav commented May 12, 2016

We use the ISO 8601 format for dates. It isn't too hard to convert regular dates to this format, we do it in a lot of our scrapers, using Python's datetime module.

The files can take this date as the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants