Results generated in this paper

Files containing the results generated by the source code in the code folder.

datatable.csv: Information provided in 0_prepare_datatable

Background

0. Summary of metrics

Depiction of bio-metrics. The signal detection model (SDM) (top-left) shows the sensitivity related to a single threshold tg. Translated to error rate (bottom- left), a direct trade-off between FNP and FPR. In prac- tice, this is entirely application dependent. Specifically, the specific chose in desired FPR (%) (bottom-right).

Experiments

1. Signal detection models (SDM)

SDM across subgroups. Scores of imposters have medians around 0.3 but with variations in upper percentiles; genuine pairs vary in mean and spread (e.g., AF has more area of overlap). A threshold varying across different subgroups yields a constant FPR.

2. NN (i.e., rank 1) error analysis

Confusion matrix. Error (Rank 1, %) for all BFW faces versus all others. Errors concentrate intra-subgroup - consistent with the SDM (above [1]). Although subgroups are challenging to define, this shows the ones chosen are meaningful for FR.

3. detection error trade-Off (DET) curves

DET curves. Top-left: per gender.Top-Right: per ethnic groups.Bottom:per subgroup. Lower is better. On the gender-based DET curve, two dashed lined connects points at a constan threshold, which shows a rough factor of 3 difference inFPR at the same threshold. Similar differences in FPR exist in the other plots at a constant threshold θ_const.

4. TAR at FAR

TAR at intervals of FAR. FAR, listed are the TAR scores for a global threshold (top) and the proposed category-based threshold (bottom). Higher is better.

5. Intended FPR

Percent difference from intended FPR. Top: tg yields FPR the span as large as 2× (i.e., 200%) that intended (i.e., WM for 1e-4). Furthermore, F subgroups tend to perform worse than intended for all cases (while M tend to overshoot intended performance, with excep- tion of IM in for FPR=1e-4). Bottom: Subgroup-specific thresholds reduces this difference to near zero, where there are small differences, the percent difference across different subgroups is fair (i.e., FPR=1e-4).

5. Different models

DET curves for different CNNs. FNR (%) (vertical) vs FPR (horizontal, log-scale) for VGG2 [6] models with different backbones (VGG16, Resnet50, SEnet50). Lower is better. For each plot, WM is the top-performer, while AF is the worst. The ordering of the curves is roughly the same for each backbone.

6. Human evaluation

We evaluated human on face pairs focusing on two racial groups: Chinese and Caucasians. To focus on the experiment, we honed-in on two groups, white Americans (W) and Chinese from China (C). The purpose was to the minimize variability by only analyzing the subsets of the broader groups of whites and Asians. Samples were collected by recruiting subjects from multiple sources (e.g., social media, email lists, and family/friends)– a total of 120 par- ticipants were sampled at random from all the submissions that were (1) complete and (2) from a W or C participant. Specifically, there were 60 W and 60 C, both with Male (M) and Female (F) split evenly. A total of 50 face pairs of non-famous “look-alikes” were collected from the internet, with 20 (WA) and 20 (C) pairs (male and female split evenly). The other 10 pairs are of others (e.g., Hispanic/ Latino, Japanese, African). Survey was created, distributed, and recorded via PaperForm.

Human assessment (quantitative). Sub- groups listed per row (i.e., human) and column (i.e., im- age). Note, most do the best intra-subgroup (blue), and second-best (red) intra-subgroup but inter-gender. WF performs the best; WF pairs are most correctly matched.

Human assessment (qualitative). checkmarks for match; × for non-match. Accuracy scores shown as bar plots. Humans are most successful at recognizing their own subgroup, with a few exceptions (e.g., bottom).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Results generated in this paper

Background

0. Summary of metrics

Experiments

1. Signal detection models (SDM)

2. NN (i.e., rank 1) error analysis

3. detection error trade-Off (DET) curves

4. TAR at FAR

5. Intended FPR

5. Different models

6. Human evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Results generated in this paper

Background

0. Summary of metrics

Experiments

1. Signal detection models (SDM)

2. NN (i.e., rank 1) error analysis

3. detection error trade-Off (DET) curves

4. TAR at FAR

5. Intended FPR

5. Different models

6. Human evaluation