Files containing the results generated by the source code in the code
folder.
datatable.csv
: Information provided in 0_prepare_datatable
Depiction of bio-metrics. The signal detection model (SDM) (top-left) shows the sensitivity related to a single threshold tg. Translated to error rate (bottom- left), a direct trade-off between FNP and FPR. In prac- tice, this is entirely application dependent. Specifically, the specific chose in desired FPR (%) (bottom-right).
SDM across subgroups. Scores of imposters have medians around 0.3 but with variations in upper percentiles; genuine pairs vary in mean and spread (e.g., AF has more area of overlap). A threshold varying across different subgroups yields a constant FPR.
Confusion matrix. Error (Rank 1, %) for all BFW faces versus all others. Errors concentrate intra-subgroup - consistent with the SDM (above [1]). Although subgroups are challenging to define, this shows the ones chosen are meaningful for FR.
DET curves. Top-left: per gender.Top-Right: per ethnic groups.Bottom:per subgroup. Lower is better. On the gender-based DET curve, two dashed lined connects points at a constan threshold, which shows a rough factor of 3 difference inFPR at the same threshold. Similar differences in FPR exist in the other plots at a constant threshold θconst.
TAR at intervals of FAR. FAR, listed are the TAR scores for a global threshold (top) and the proposed category-based threshold (bottom). Higher is better.
Percent difference from intended FPR. Top: tg yields FPR the span as large as 2× (i.e., 200%) that intended (i.e., WM for 1e-4). Furthermore, F subgroups tend to perform worse than intended for all cases (while M tend to overshoot intended performance, with excep- tion of IM in for FPR=1e-4). Bottom: Subgroup-specific thresholds reduces this difference to near zero, where there are small differences, the percent difference across different subgroups is fair (i.e., FPR=1e-4).
DET curves for different CNNs. FNR (%) (vertical) vs FPR (horizontal, log-scale) for VGG2 [6] models with different backbones (VGG16, Resnet50, SEnet50). Lower is better. For each plot, WM is the top-performer, while AF is the worst. The ordering of the curves is roughly the same for each backbone.
We evaluated human on face pairs focusing on two racial groups: Chinese and Caucasians. To focus on the experiment, we honed-in on two groups, white Americans (W) and Chinese from China (C). The purpose was to the minimize variability by only analyzing the subsets of the broader groups of whites and Asians.
Samples were collected by recruiting subjects from multiple sources (e.g., social media, email lists, and family/friends)– a total of 120 par- ticipants were sampled at random from all the submissions that were (1) complete and (2) from a W or C participant. Specifically, there were 60 W and 60 C, both with Male (M) and Female (F) split evenly. A total of 50 face pairs of non-famous
“look-alikes” were collected from the internet, with 20 (WA) and 20 (C) pairs (male and female split evenly). The other 10 pairs are of others (e.g., Hispanic/ Latino, Japanese, African). Survey was created, distributed, and recorded via PaperForm.
Human assessment (quantitative). Sub- groups listed per row (i.e., human) and column (i.e., im- age). Note, most do the best intra-subgroup (blue), and second-best (red) intra-subgroup but inter-gender. WF performs the best; WF pairs are most correctly matched.
Human assessment (qualitative). checkmarks for match; × for non-match. Accuracy scores shown as bar plots. Humans are most successful at recognizing their own subgroup, with a few exceptions (e.g., bottom).