Skip to content

Podcast Detection

Ethan Marshall edited this page Jan 4, 2022 · 1 revision

Podbit detects which podcast each downloaded episode belongs to based on a configurable regular expression. However, these regular expressions may be slightly different to what you are used to - see the resources below for help if you can't get them to work. In general, however, just a little knowledge of plain regex is required.

How does detection work?

On your system, podbit automatically creates a blank text file called "db" in your $XDG_DATA directory (on most systems this is at .local/share/podbit). This db file will be read in line-by-line. Each line will be split by spaces. The text before the first space will be the regex pattern to match. After this, all text (including spaces) will be the name of the podcast. The format is generally as follows:

{pattern} {name0} {name1} {etc...}

For each loaded episode from the download cache, the URL which it was downloaded from will be compared to the regex on each line of db. If the URL is a full match (not a partial match!), the episode is considered part of the podcast named. If not, the next line is considered until there are none left. If none match, the episode will be considered its own podcast and kept in URL form.

For example, if I have a podcast which comes from the URL "https://podcast.somesite.com/episode0.mp3" and another from the same podcast named "https://podcast.somesite.com/episode1.mp3", I can deduce that the first part of the URL will remain the same and only the media file name changes. So, we can construct a regex which matches the domain and ignores the media file name, like so: https://podcast.somesite.com/.*\.mp3.

Why do this?

This is useful as barely any podcast providers use the "artist", "performer" or "album" tags to document what podcast they come from. However, most podcasts come from systematic URLs, such as auto-generated LibSyn or Soundcloud URLs. This makes them ideal for the pattern-based power of regular expressions.

This is also useful as it avoids podbit having to parse RSS feeds, violating the entire purpose of loading podcasts from the newsboat queue file in the first place, as we would load the file and then undo all the work newsboat did again, just to show a podcast name. Even then, it would be difficult to determine which podcasts come from where, as we would have to traverse the entire RSS cache (which may not even be enabled). Doing this would also mean that you could not enqueue podcasts from other sources, script the player or do UNIX-like things in general. This method avoids all that complexity for the simple cost of a regex match - which is in the Go standard library anyway!

Common pitfalls

Go regex matching is slightly different to the matching you may be used to in the shell. The full syntax documentation is available at https://github.com/google/re2/wiki/Syntax or by running go doc regexp/syntax. As such, here are some key things to watch out for:

  1. The Kleene Wildcard Star works differently The wildcard star strictly matches the previous token. So https://*.adsf.com/ only matches zero or more forward slashes - you probably meant https://.*.adsf.com/. If you want to match anything, be sure to use the period character
  2. Special characters always have special meaning Unlike in BPREs (Basic POSIX REs), special characters do not have to be escaped to have special meaning. So, to match one or more of the letter "e", you write e+ not e\+. This means that, if your pattern has a question mark in it, you might have to escape it.
  3. db is refreshed on program exit The db file will not be re-read when refreshing the queue or at any point during program execution. However, on program exit, the current contents is written to disk. So, you should edit the db file while podbit is not running

More help

If you are new to regular expressions, run man 7 regex to get an overview, or see this guide.

Podbit reports syntax errors on failure to parse the regular expression or if there are not enough columns in the db file. Each line must contain at least the regular expression and the podcast name. The following, for instance, is not valid:

.*

or

askdfjnasdf
Clone this wiki locally