Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add audio/video metadata #72

Closed
rperlin-ela opened this issue Sep 2, 2020 · 38 comments · Fixed by #116
Closed

Add audio/video metadata #72

rperlin-ela opened this issue Sep 2, 2020 · 38 comments · Fixed by #116
Labels
comm: ☝️ good first issue☝️ Community: good for newcomers effort: 2 🥈 (med)🥈 Average amount of work focus: ⚙️ functionality⚙️ Requires script changes (TypeScript or JS) priority: 3 (high) Most critical. Should be completed first. type: ✨ enhancement✨ New feature, improvement, or functionality

Comments

@rperlin-ela
Copy link
Collaborator

Description

We'd like to import metadata automatically from Youtube and/or Archive.org and show it under the language name above the audio/video embed in the dialog. Probably the most important thing would be what both Youtube and Archive.org call the Description, but maybe a few other fields like rights, attribution. We also have this info in spreadsheet form if that’s easier.

Resolution

@abettermap thinks Youtube may have an API we can use

@abettermap abettermap added effort: 2 🥈 (med)🥈 Average amount of work priority: 0 (wishlist) Some day! type: ✨ enhancement✨ New feature, improvement, or functionality comm: ☝️ good first issue☝️ Community: good for newcomers labels Sep 17, 2020
@abettermap abettermap added focus: ⚙️ functionality⚙️ Requires script changes (TypeScript or JS) priority: 3 (high) Most critical. Should be completed first. and removed priority: 0 (wishlist) Some day! labels Oct 13, 2020
@abettermap
Copy link
Contributor

@rperlin-ela

YouTube

We will definitely need an API key for this. Here are the steps, which are very similar to my instructions for setting up the Sheets API token except you can use the same project and even the API key, you just need to enable YouTube API:

  1. https://console.developers.google.com/apis/dashboard

  2. Select your project if needed:

    image

  3. If needed, click the project in the popup.

  4. Click this:

    image

  5. Type youtube

  6. Click the result with v3 in it:

    image

  7. Click ENABLE

Should be all set. Go back to your Dashboard and confirm it's there then let me know it's ready:

image

Archive.org

Question: why don't any of the records use this for audio yet? It looks like ELA has quite a few, just wondering why they're not in the dataset.

Anyhoo, no API key required. Here's an example of the meta for your "Kabardian comparative" for example:

https://archive.org/metadata/ela_kabardian_comparative/metadata

Which returns:

image

Or go straight for the kill with one level deeper: https://archive.org/metadata/ela_kabardian_comparative/metadata/description

That's the info we need, right? Easy on my end, BUT how will those files be embedded as audio? I see a bunch of formats listed, but also archive.org's docs suggest an iframe, which is normally used for videos.

The audio embed I have set up in the code is not going to do anything with an API URL like that, and vice versa. So, will I need to do a check to see if the Audio url includes archive.org then process it accordingly to make it work with one of these?

  1. The webplayer/video thing to use as an iframe embed (I don't know enough about these formats)

  2. Sift through the full API results for that file to find the WAVE?

    image

I see how to get the full list as well, but I assume we'd only need one of them?

Here's a full list for another of your audio items as a comparison: https://archive.org/download/mid-2003-06-13c

I'm not sure we'll get away without me doing some kind of check/processing on my end, so I could see this being the flow of code:

  1. You provide me with a consistent URL in Audio column to the wav file or whatever
  2. I check Audio to see if it contains archive.org
  3. If so, parse it out so I can get just the item ID (I don't like this already)
  4. Use the ID to hit their API so that I can get the Description

OR, we treat it as a video/webplayer:

  1. You populate the Video field with the embed URL, e.g. https://archive.org/embed/mid-2003-06-13c
  2. I follow steps 3-4 above (3 is easier this way)

The obvious drawback of the "video" approach is that you couldn't have audio AND video for the same record.

OR:

  1. you give me the metadata API URL and I sift through the results until I find the WAVE format (assuming it's always there).
  2. I grab the Description while I'm there.

I'm going to be hitting the API no matter what so whatever is cleanest for that. Assuming there is enough consistency of results then I'm sure I could find whatever I need there and deliver it to the user in whatever format you want. Personally I would go with the "video" embed since it has a lot more options and could be useful for those languages that have like 100+ wave files:

image

If you've got instances of Video AND Audio though, then would be a harder sell for the embed approach. 🤔

For Future Jason

here is the archive's metadata API: https://blog.archive.org/2013/07/04/metadata-api/

@abettermap
Copy link
Contributor

@rperlin-ela

Hope that made sense, kind just learning it as I go, so let me know what you think.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 13, 2020 via email

@abettermap
Copy link
Contributor

Great, thanks for these instructions. I’ll get on them once we’re ready to get started.

Feel free to start any time, this might be one of the first things I work on since it's fresh in my brain.

Maybe this will be obvious once we’re working on it, but can we choose what metadata gets pulled by the API? Want to clarify this with Dan, whose request this was, but I think Description is the key thing, maybe one or two others.

If I'm understanding correctly, yes. Archive.org doesn't have more than what I showed in the screenshot I think, but as for YouTube there may be other metadata. You can skim through the options here: https://developers.google.com/youtube/v3/docs/videos#properties

Highlights from today's call

  • Embeds are fine. Aka the Archive.org web player, aka <iframe>, aka Jason can remove the audio player component.
  • Once the code is ready, Ross will point the Audio URL(s) to the embed URL format rather than the .wav. Example: https://archive.org/embed/ela_kabardian_comparative
  • Jason will assume all audio is from achive.org

Transcript files

Unless these are part of something in the videos or audio files already in existence, I don't think this is an easy option. If you had an external file somewhere, like Dropbox, we'd have to either:

  1. Add a new column like Transcript in the spreadsheet
  2. Throw it at the end of Description
  3. Use a CMS like Sanity

@abettermap
Copy link
Contributor

@rperlin-ela should I assume that all the archive.org instances will have a playlist? Or maybe a better question is, is this format ok when they don't?

image

I don't think it's harming anything when there's only a single item, and it keeps the UI consistent with the instances which do have playlists.

@abettermap
Copy link
Contributor

Re: additional metadata fields, I think you're also going to want title for both YouTube and Archive instances. Up until now I think I've been using the Endo, so title would be a nice upgrade.

@abettermap
Copy link
Contributor

...and I think that's about all that's useful in the YouTube API. I know this just looks like code but you get the idea:

image

@rperlin-ela
Copy link
Collaborator Author

Yes, I think what's in there is now just Language, but Title followed by Description would be great.

I reckon some archive.org instances will have playlists but not all (as with Youtube). Format from the screenshot seems fine to me and good to keep consistent.

I believe the Youtube API is now enabled — let me know if not or if you need anything else from my end.

In case it helps I uploaded a new tileset just now with what I think should be the correct embed URL from Archive for Neo-Mandaic. (Sorry if that was jumping the gun—I see for the time being it's giving an error.)

@abettermap
Copy link
Contributor

Yes, I think what's in there is now just Language, but Title followed by Description would be great.

Awesome. Looks like YouTube API also has some kind of caption/transcript endpoint but I'm not going down that road because:

  1. It doesn't look to be part of the regular endpoint that provides the Title/Descrip
  2. ...and may involve a second (or third) API call.
  3. ...which counts even more against your API account quota (I assume you didn't put any billing info in, which should be fine but I wouldn't want to thread the needle with 3-4 requests for each time a video is opened)
    1 ...and based on a skim of ELA's uploads, many do not have transcripts anyway.
  4. ...and since archive.org does not have the same API functionality, we'd really be mixing up the UI consistency even more
  5. ...only to basically be recreating YouTube, which already supports transcripts within the full view: image
  6. ...and my understanding for the metadata request in this SOW was just that, metadata. Not a full-on YouTube replacement, which captions/transcript seems to be leaning towards in a small way.

I believe the Youtube API is now enabled — let me know if not or if you need anything else from my end.

Thanks for doing that although I'm not sure it's set up properly as I'm getting an error:

Requests to this API youtube method youtube.api.v3.V3DataVideoService.List are blocked.

Did you create a new key or just enable the YouTube API for the existing key? If you enabled for existing key then you should see both the Sheets and YouTube APIs in the restrictions list:

image

I'm 99% sure I have the correct API URL because I get what I expect when I put my own key in with the YouTube API enabled.

In case it helps I uploaded a new tileset just now with what I think should be the correct embed URL from Archive for Neo-Mandaic. (Sorry if that was jumping the gun—I see for the time being it's giving an error.)

Nice, I looked at your URL in Final Output and it looks correct and is working for me locally (I made some progress switching over to the embed-only approach), including with the playlist parameter appended to it (even though there's only one file for this guy):

image

Tomorrow I'd like to start working on parsing out your URLs so I can get the video IDs to use in the API.

@rperlin-ela
Copy link
Collaborator Author

I think I'm seeing captions on all the videos that have them — check Wakhi for example. So all seems good. But from other experiences with Youtube I wonder if people who haven't enable some kind of setting will see it.

I followed what you had in the instructions, so I think all I had done was enable the key. But I see in the dashboard it's failing 100% of the 13 times it was tried. Sorry if this was off base, but something I saw made it seem like I needed to "create credentials", specifying a little further info about the use and then getting the opportunity to restrict key, but that seemed to result in a new API key, which I'm sending you by email. Kind of fumbling around in the dark, but hopefully I can unravel if necessary.

@abettermap
Copy link
Contributor

I think I'm seeing captions on all the videos that have them — check Wakhi for example. So all seems good.

Yeah all the captions will be there, we might be talking about two different things. You mentioned something about transcripts and in YouTube the transcript is a textual list of placemarks within the video's captions/subtitles I think? Like:

image

or

image

But from other experiences with Youtube I wonder if people who haven't enable some kind of setting will see it.

Yeah if you click the gear icon it has a cc option. This seems to coincide with but remain independent from cc subtitles etc.

image

Anyhoo probably not relevant I guess.

As for the new API key, sorry if my instructions steered you off course, I may have missed something as it's hard to know what's on your screen (I have multiple projects in Google so mine might look different). It's probably not a bad idea to have two API keys anyway (one youtube, one sheets), and it looks like the youtube key you just made is working so we should be set but I'll let you know if not!

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 15, 2020 via email

@abettermap
Copy link
Contributor

Definitely not going to do that but yes I think with an extra week of time something like that could be achieved. It's just a bunch of data in the API, someone would have to create a very very complex component to sync with the video.

Shouldn't have brought it up, it's so far beyond everything else.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 15, 2020 via email

@abettermap
Copy link
Contributor

it's just a bunch of XML, it's not like they give you a component or HTML to work with. someone would have to hand-roll all that.

image

@abettermap
Copy link
Contributor

if it's of use, the cc can be forced on:

image

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 15, 2020 via email

@abettermap
Copy link
Contributor

yeah should be, kinda falls along same lines as how i'm playlistifying the Archive videos. Just make sure to continue to leave all your video URLs as bare as possible. I think these should be the only three scenarios?

https://www.youtube.com/embed/VIDEO_ID
https://www.youtube.com/embed/videoseries?list=PLAYLIST_ID
https://archive.org/embed/ela_kabardian_comparative

I can append parameters to the URL as needed.

The approach is kind of fragile since it relies on parsing on my end and consistency on yours, not to mention Google's APIs are kind of volatile. But in lieu of a CMS or additional column in the data, it's probably the best we can do for now.

As for embed parameters, let me know if there are any others of interest:

None are difficult to add, it's not part of the API, just some extra cruft to append to the embed URLs.

@abettermap
Copy link
Contributor

Not sure what the odds of this are but the video I was testing (Wakhi) appears to be the only one with an incorrect URL:

image

If it's a playlist then it needs videoseries like the others:

image

@abettermap
Copy link
Contributor

I got the single-video API connection wired up, wasn't terribly hard and it definitely looks better than just the language for title.

For descrip, you thinking above the video or below? Here it is on laptop:

image

And mobile (whatever that is!):

image

I don't know how long the average descrip is and that might dictate placement a bit, so let me know what you think.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 15, 2020 via email

@abettermap
Copy link
Contributor

I assume all the others you see are ok then?

Ones I've looked at, yes, but that's something you'll want to QC on your end for this iteration and ongoing. If you're taking notes for the Big Comprehensive Manual, include the formats I mentioned above:

If any other variations are used accidentally, it won't work. If I'm missing any other scenarios, let me know. If Google changes their API, let Google let me know. :)

Re: QC in general- I think that's something that could be automated on your end. If you had a separate tab/sheet with formulas pointing to the main source tab, you could use formulas that automatically check for the usual Bad Data suspects:

  • Unacceptable values, e.g. "South Americaa" or "SOUTH AMERICA" in World Region. Bad example but you get the idea.
  • Trailing whitespace
  • Empty cells which should not be empty
  • Font Image Alt that does not start with https://www.dropbox.com/s/ and end with ?raw=1
  • Video/audio values that do not begin with "https://archive.org/embed/" or "https://www.youtube.com/embed/"

Just brainstorming but those are several very-real things we've encountered on more than one occasion and it adds time to troubleshooting, communication, and data maintenance, so might be worth pursuing. Could just start with one column at a time to get some practice and see how it goes, and I think it would pay for itself in the long run.

@abettermap
Copy link
Contributor

For description in the video modal, are you thinking of having it above the video or below? See my screenshots a few comments up.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 16, 2020

The way you have it looks good to me: title on top and then description below. Length of Acehnese description is probably representative but it would be good to allow for the possibility of longer ones.

Thanks for the QC tips, you're obviously right— I just need someone a little more expert with Google Sheets, will ask.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 16, 2020

(Sorry forgot that playlists and Archive were still to-do!)

@abettermap
Copy link
Contributor

No worries. Have you had a chance to check on those extra parameters yet?

YouTube: https://developers.google.com/youtube/player_parameters#cc_lang_pref
Archive: https://archive.org/help/audio.php

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 16, 2020 via email

@abettermap
Copy link
Contributor

Sounds good.

Questions

Youtube playlists meta title

The title differs from the video titles, so should there be some indication that it's a playlist? Aside from the playlist icon in the top-right, it's not immediately obvious that the title refers to a playlist:

yt-playlist

Could be something subtle like this:

yt-playlist-w-suffix

Maybe the (playlist) part would be less necessary when there is a description though (although I don't see many playlists with a description)? If the description mentions anything like "this collection of..." then the "(playlist)" suffix probably isn't needed.

ID/format issue?

I'm wondering if maybe this is a byproduct of standardizing the URLs, or a true typo, but this one isn't working (Neo-Aramaic). Your syntax is fine, just like I requested it:

https://www.youtube.com/embed/videoseries?list=PLcXFPx-z7B0pfN0ZhGj1bVpE9foHBR1nC

But that link doesn't work. If I found the correct playlist, then the URL would be this:

https://www.youtube.com/embed/videoseries?list=PL2BF759B18CCE5DD2

@abettermap
Copy link
Contributor

I'd like to add some basic error catching for non-existent videos and playlists in case we come across another one, but I'm not sure which scenario to use:

  1. If the API doesn't find the match (in other words we won't have a title or description) then should I assume the video won't play either?
  2. ...or should I trust that your URLs are 100% always perfect (except for Neo-Aramaic, ha) and try to load the embed anyway?

Option 1 could just show "video not found" or something and not even show the player, while Option 2 (which I'm currently using) shows this:

image

followed by this if i try to play the video:

image

If my URL parsing (to extract the playlist/video ID) is accurate for one video and one playlist, then it should be accurate for all of them, so I'm leaning towards assuming that if the API doesn't find anything for the given video/playlist ID, then your URL must be incorrect.

On my end (the code side, not what the user sees) I set up a Sentry message to let us know if the YouTube API returned nothing for that URL. This doesn't tell me whether it's my fault or yours, but at least it will silently tell us which URL failed without anything crashing on the user.

@abettermap
Copy link
Contributor

If you want to see what it looks like, here is the deploy: https://deploy-preview-116--languagemapping.netlify.app/details?id=645

archive.org stuff not ready but YT vids and playlists are working.

@rperlin-ela
Copy link
Collaborator Author

All looking good. Error catching sounds good, go with your judgment on it.

Maybe the (playlist) part would be less necessary when there is a description though (although I don't see many playlists with a description)? If the description mentions anything like "this collection of..." then the "(playlist)" suffix probably isn't needed.

Yep, good call, this gives a whole new visibility to playlists, so I will work on filling in all the description for playlists, don't think anything additional is needed.

I'm wondering if maybe this is a byproduct of standardizing the URLs, or a true typo, but this one isn't working (Neo-Aramaic).

Fixed now — link was right, but playlist was temporarily/inadvertently private.

@abettermap
Copy link
Contributor

Error handling

it was naive of me to ask "can we rely on 100% accurate data". the answer to
that question in the absence of validation and QC/QA, is always NO when
humans are responsible for entering the data. so, with that in mind i took
Sentry up a couple notches with the err handling now catching 3 scenarios:

  1. Incorrect URL format, which needs to be one of the three formats we
    discussed: YouTube playlist, embed, or Internet Archive embed. Could be a
    typo on your end, for example.
  2. Format was fine but no video/embed/etc. was found, e.g. your Neo-Aramaic
  3. General fetch failure catch-all for the scary unknown scenarios.

Great?

Yes it is. Big softball thrown in your direction to deal with video errors
falling in one of those 3 scenarios. I will use the Sentry info I added today
("tags" relevant to the scenarios) to create an alert so you'll get notified if
someone opens a bogus video URL. You will definitely want to subscribe to those alerts.

Hopefully there's a way in Sentry to pick and choose which
alerts (instead of the full mess of error events) but it's worth it either way.
If you don't want them to go to your shared ELA acct then we can add another
acct as a separate user, but this will save me the trouble of notifying you any
time there is a video issue. It's already an automated process so might as well
take advantage of it.

Closed captioning

...cannot be forced on for videos or playlists which don't already have it, including those where the cc is auto-generated.

Questions

HTML description support?

i turned it on since i saw a <span> in one of your Internet Archive examples but might be a good idea if i disable it since there's no guarantee it will match our UI:

image

Next steps

Sentry alerts

Will get these wired up in the UI tomorrow or soon, otherwise kind of a waste of time if I have to forward the emails to you about data stuff. I also have it broken down into production and deploy environments, so you can see if they occurred during our internal testing/PR stuff or live site.

I haven't tested it with much besides local dev so far, there's a lot of moving parts to all the error stuff so fingers crossed. it's working great on my local env though so no reason shouldn't be same in The Outside World.

Also can't really test much without a known error (i was leaning on your broken one to test), meaning i can't test the Sentry stuff including alerts either. might have to do a dummy one just so the setup doesn't get left in the dust until a real error happens.

More things I'm missing when

...it's not 12:54am? Most likely.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 20, 2020 via email

@abettermap abettermap pinned this issue Oct 20, 2020
@abettermap
Copy link
Contributor

HTML from Archive not important, better to stay consistent.

Ok I will remove it, just make sure to update your Archive descrips otherwise you'll see the HTML as HTML.

Sorry if I wasn’t clear — I don’t think we need the word “Playlist” for the playlists.

No worries, I'll remove it.

I seem to getting be an error now with the deploy when I go to play any (non-playlist) video.

Is it like this?

image

If so, it's happening regardless of where it's played, e.g. paste the URL into browser bar vs in the project, but if I remove cc_load_policy it seems to work fine:

I'm wondering if it has something to do with not setting the language via URL? I'm assuming it defaults to en, but that video for example only has English auto-generated, with Quechua legit:

image

How important is it to force the cc on? Since it's super inconsistent anyway should we just leave it up to the user?

@abettermap
Copy link
Contributor

The language parameter is cc_lang_pref but might not work anymore (see highlighted text):

image

@abettermap
Copy link
Contributor

Messing with the iframe parameters has nothing to do with the YouTube API or the metadata we are adding, so in order to stay focused on the SOW I'm going to say let's drop the cc efforts for now. If it's important to you then feel free to create a wishlist issue.

@rperlin-ela
Copy link
Collaborator Author

rperlin-ela commented Oct 20, 2020 via email

@abettermap
Copy link
Contributor

Sooo guess what. The problem was actually from the playlist parameter I'm throwing onto the URL for Archive player. I was just being lazy leaving it in for YouTube because it "worked" but I must have only been testing on playlists. 🤦

cc_load_policy doesn't break anything if I remove that parameter.

@abettermap abettermap mentioned this issue Oct 20, 2020
6 tasks
@abettermap abettermap unpinned this issue Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comm: ☝️ good first issue☝️ Community: good for newcomers effort: 2 🥈 (med)🥈 Average amount of work focus: ⚙️ functionality⚙️ Requires script changes (TypeScript or JS) priority: 3 (high) Most critical. Should be completed first. type: ✨ enhancement✨ New feature, improvement, or functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants