And now for something completely different

December 2014

Introduction

I used to write a lot of Python but…

despite the title there are no more references to python in this talk.

…of late

I have been working almost exclusively in R
Trying to improve my reproducible research practice
- writing everything in rmarkdown (or sweave)
- write documentation with embedded analysis
```
vs
```
- write code with embedded documentation
working mainly on NGS
- ChIP-seq
- RNA-seq
- Assembly

…but

This talk is about something else hence the title but…

it is written in rmarkdown
it is reproducible research
- it points at the data (in an SQLite database)
- it contains the R code chunks to perform the analysis

Facial Recognition Project

Facial Recognition

CAHID - Centre for Anatomy and Human Identification
Assist - PhD Student-
Project - Identification of children from images
- identification of missing children from images after a time lapse.

… a so it begins…

Suposedly a set of

4727 images categorised by age and sex
mix of subject sets of images over a ranges of ages 0-15
- 166 female subjects
- 165 male subject
2680 single images of unknown individuals

Face Similarity Scores

Using NeoFace by NEC
CAST (Centre for Applied Science and Technology) at the Home Office
Detect a face in each of two images
Provide a similarity score for the faces

Image Quality Scores

Image quality metrics from PreFace by Aware Software

48 quality metrics covering
- background quality and clutter
- facial lighting and colour saturation
- facial geometry and pose angle
- features and expression glasses and smile likelihood

The Data

the data looks like

Query.Name	Query.ID	Target.Name	Target.ID	Score
A0000_0000_00_C_G_Q.jpg	1	A0000_0000_00_C_G_Q.jpg	1	1.000000
A0000_0000_00_C_G_Q.jpg	1	A0005_0000_00_C_G_Q.jpg	6	0.996440
A0000_0000_00_C_G_Q.jpg	1	A0199_0000_02_C_G_Q.jpg	200	0.567633
A0000_0000_00_C_G_Q.jpg	1	A0118_0000_02_C_G_Q.jpg	119	0.560839
A0000_0000_00_C_G_Q.jpg	1	A0138_0000_02_C_G_Q.jpg	139	0.560240
A0000_0000_00_C_G_Q.jpg	1	A0665_0000_05_C_G_Q.jpg	666	0.551213

A number of images failed quality control, face could not be detected which left

4634 query images Q
4636 target images T
- 166 female subjects
- 1516 images
- 165 imales subjects
- 738 images
- 2380 non-subject images
but \(Q \subsetneq T\)

so

extracted subset with common target and query filenames convert to a matric

to give

\[ \left( \begin{array}{ccc} s_{1,1} & s_{1,2} & s_{1,3} & ... & s_{1,i} & ... & s_{1,n} \\ s_{2,1} & s_{2,2} & s_{2,3} & ... & s_{2,i} & ... & s_{2,n} \\ s_{3,1} & s_{3,2} & s_{2,3} & ... & s_{3,i} & ... & s_{3,n} \\ ... & ... & ... & ... & ... & ... & ... \\ s_{i,1} & s_{i,2} & s_{i,3} & ... & s_{i,i} & ... & s_{i,n} \\ ... & ... & ... & ... & ... & ... & ... \\ s_{n,1} & s_{n,2} & s_{n,3} & ... & s_{n,i} & ... & s_{n,n} \end{array} \right) \]

where \(s_{i,j}\) is the similarity score of image \(i\) as query with image \(j\) as target

but…

\(\sum\limits_{i=1}^n s_{i,i} != n\) not all images are the same as themself
\(\sum\limits_{i=1}^n \sum\limits_{j=1}^n s_{i,j} - s_{j,i} !=0\) the matrix is not symmetric
the score appears to depend on the order

at which point

double and triple check I am not doing anything stupid
the image filename pairs are correct
- they were
contact the Home Office to double check their data
contact the software authors to see if it was stochastic
- then WAIT

in the meantime

we looked at an older set of data
this was symmetric but had much fewer of the subject images
so could we cleaned up the bigger set
work with this until CAST and the software authors get back

the clean data

Select images where

\(s_{i,i}=1\)

and in doing this it results in data where

\(s_{i,j}=s_{j,i} \forall j\)

i.e. the matrix is symmetric

this gives

A set of 3155 Images

154 female subjects
155 male subjects
1677 non-subject images

What does it look like?

Cluster it

If first we convert the 0-1 similarity score

a 0-1 distance score with

\(d_{i,j}=1-s_{i,j}\)

then cluster the results with hclust

the resulting dendrogram the resulting subtree

What makes a good set of target images?

age difference
pose angle
quality

Some images have no age

Removing images with no age gives

A set of 1726 Images

90 female subjects
130 male subjects
912 non-subject images

Similarity vs Age difference

Confounding Issues

Linear Model \(similarity \thicksim age * \delta age * sex\) has an adjusted \(R^{2}\) of 0.32 for subject self comparisons scores

Does it get them right

Missing child:

set of known subject images
final cut off age before they went missing
image of individual of unknown age

	0	1	2	3	4	5	6	7	8
0	14:28	0:0	0:0	0:0	0:0	0:0	0:0	0:0	0:0
1	0:3	0:3	0:0	0:0	0:0	0:0	0:0	0:0	0:0
3	0:1	0:1	0:1	0:1	0:0	0:0	0:0	0:0	0:0
4	2:2	2:2	1:1	1:1	1:1	0:0	0:0	0:0	0:0
5	64:88	7:23	7:23	8:23	8:23	9:23	0:0	0:0	0:0
6	131:146	88:106	11:30	11:30	11:29	14:29	13:29	0:0	0:0
7	1:1	1:1	1:1	1:1	1:1	0:0	0:0	0:0	0:0
8	1:1	1:1	1:1	1:1	1:1	1:1	0:0	0:0	0:0
9	0:0	0:1	0:1	0:1	0:1	0:1	0:1	0:1	0:1

Problems

The dataset is too sparse and small, partly due to error of unknown origin outside the control of the student.

In order to know at what ages which age differences are problematic

needs a very large data set
with enough individuals covering full age range
and a larger background set

Thank you for listening

this talk was experimental in its production
this talk has not really been about face recognition delibarately
- gave no detail on NeoFace and how it works
- did not show a single facial image