This app is currently being used by GBH on the New Jersey Nightly News Collection. The app has only been evaluated in terms of precision. On a small set of videos (<100) the tool had 100% precision when detecting bars in the first minute of a video. So far, for GBH it has made the most sense to only detect the first instance of bars in a video because they plan to use the metadata to start a video after the first bars have ended. This reduces the dataset that we've tested on to the first 5 minutes or till the end of the first bars, whichever is first, for the test set.
In order to conduct a more thorough evaluation, we would need to construct an evaluation dataset of various programs from different news stations across many different years. We would annotate the videos, labeling any segments that are bars. Then calculate the performance of the tool using IoU. We may be able to combine this data collection/annotation with the annotation for chaptering. The tool is not trained, it is simply comparing a known bars frame to frames in the video. Consequently, we dont need to worry about a training/test split.