December 2014

Introduction

I used to write a lot of Python but…

despite the title there are no more references to python in this talk.

…of late

  • I have been working almost exclusively in R
  • Trying to improve my reproducible research practice
    • writing everything in rmarkdown (or sweave)
    • write documentation with embedded analysis

      vs
    • write code with embedded documentation
  • working mainly on NGS
    • ChIP-seq
    • RNA-seq
    • Assembly

…but

This talk is about something else hence the title but…

  • it is written in rmarkdown
  • it is reproducible research
    • it points at the data (in an SQLite database)
    • it contains the R code chunks to perform the analysis

Facial Recognition Project

Facial Recognition

  • CAHID - Centre for Anatomy and Human Identification
  • Assist - PhD Student-
  • Project - Identification of children from images
    • identification of missing children from images after a time lapse.

… a so it begins…

Suposedly a set of

  • 4727 images categorised by age and sex
  • mix of subject sets of images over a ranges of ages 0-15
    • 166 female subjects
    • 165 male subject
  • 2680 single images of unknown individuals

Face Similarity Scores

  • Using NeoFace by NEC
  • CAST (Centre for Applied Science and Technology) at the Home Office
  • Detect a face in each of two images
  • Provide a similarity score for the faces

Image Quality Scores

Image quality metrics from PreFace by Aware Software

  • 48 quality metrics covering
    • background quality and clutter
    • facial lighting and colour saturation
    • facial geometry and pose angle
    • features and expression glasses and smile likelihood

The Data

the data looks like

Query.Name Query.ID Target.Name Target.ID Score
A0000_0000_00_C_G_Q.jpg 1 A0000_0000_00_C_G_Q.jpg 1 1.000000
A0000_0000_00_C_G_Q.jpg 1 A0005_0000_00_C_G_Q.jpg 6 0.996440
A0000_0000_00_C_G_Q.jpg 1 A0199_0000_02_C_G_Q.jpg 200 0.567633
A0000_0000_00_C_G_Q.jpg 1 A0118_0000_02_C_G_Q.jpg 119 0.560839
A0000_0000_00_C_G_Q.jpg 1 A0138_0000_02_C_G_Q.jpg 139 0.560240
A0000_0000_00_C_G_Q.jpg 1 A0665_0000_05_C_G_Q.jpg 666 0.551213

A number of images failed quality control, face could not be detected which left

  • 4634 query images Q
  • 4636 target images T
    • 166 female subjects
    • 1516 images
    • 165 imales subjects
    • 738 images
    • 2380 non-subject images
  • but \(Q \subsetneq T\)

so

extracted subset with common target and query filenames convert to a matric

to give

\[ \left( \begin{array}{ccc} s_{1,1} & s_{1,2} & s_{1,3} & ... & s_{1,i} & ... & s_{1,n} \\ s_{2,1} & s_{2,2} & s_{2,3} & ... & s_{2,i} & ... & s_{2,n} \\ s_{3,1} & s_{3,2} & s_{2,3} & ... & s_{3,i} & ... & s_{3,n} \\ ... & ... & ... & ... & ... & ... & ... \\ s_{i,1} & s_{i,2} & s_{i,3} & ... & s_{i,i} & ... & s_{i,n} \\ ... & ... & ... & ... & ... & ... & ... \\ s_{n,1} & s_{n,2} & s_{n,3} & ... & s_{n,i} & ... & s_{n,n} \end{array} \right) \]

where \(s_{i,j}\) is the similarity score of image \(i\) as query with image \(j\) as target

but…

  • \(\sum\limits_{i=1}^n s_{i,i} != n\) not all images are the same as themself
  • \(\sum\limits_{i=1}^n \sum\limits_{j=1}^n s_{i,j} - s_{j,i} !=0\) the matrix is not symmetric
  • the score appears to depend on the order

at which point

  • double and triple check I am not doing anything stupid
  • the image filename pairs are correct
    • they were
  • contact the Home Office to double check their data
  • contact the software authors to see if it was stochastic
    • then WAIT

in the meantime

  • we looked at an older set of data
  • this was symmetric but had much fewer of the subject images
  • so could we cleaned up the bigger set
  • work with this until CAST and the software authors get back

the clean data

Select images where

  • \(s_{i,i}=1\)

and in doing this it results in data where

  • \(s_{i,j}=s_{j,i} \forall j\)

i.e. the matrix is symmetric

this gives

A set of 3155 Images

  • 154 female subjects
  • 155 male subjects
  • 1677 non-subject images

What does it look like?

Cluster it

If first we convert the 0-1 similarity score

to

a 0-1 distance score with

\(d_{i,j}=1-s_{i,j}\)

then cluster the results with hclust

the resulting dendrogram the resulting subtree

What makes a good set of target images?

  • age difference
  • pose angle
  • quality

Some images have no age

Removing images with no age gives

A set of 1726 Images

  • 90 female subjects
  • 130 male subjects
  • 912 non-subject images

Similarity vs Age difference

Confounding Issues

Linear Model \(similarity \thicksim age * \delta age * sex\) has an adjusted \(R^{2}\) of 0.32 for subject self comparisons scores

Does it get them right

Missing child:

  • set of known subject images
  • final cut off age before they went missing
  • image of individual of unknown age

0 1 2 3 4 5 6 7 8
0 14:28 0:0 0:0 0:0 0:0 0:0 0:0 0:0 0:0
1 0:3 0:3 0:0 0:0 0:0 0:0 0:0 0:0 0:0
3 0:1 0:1 0:1 0:1 0:0 0:0 0:0 0:0 0:0
4 2:2 2:2 1:1 1:1 1:1 0:0 0:0 0:0 0:0
5 64:88 7:23 7:23 8:23 8:23 9:23 0:0 0:0 0:0
6 131:146 88:106 11:30 11:30 11:29 14:29 13:29 0:0 0:0
7 1:1 1:1 1:1 1:1 1:1 0:0 0:0 0:0 0:0
8 1:1 1:1 1:1 1:1 1:1 1:1 0:0 0:0 0:0
9 0:0 0:1 0:1 0:1 0:1 0:1 0:1 0:1 0:1

Problems

The dataset is too sparse and small, partly due to error of unknown origin outside the control of the student.

In order to know at what ages which age differences are problematic

  • needs a very large data set
  • with enough individuals covering full age range
  • and a larger background set

Thank you for listening

  • this talk was experimental in its production
  • this talk has not really been about face recognition delibarately
    • gave no detail on NeoFace and how it works
    • did not show a single facial image