Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.
The basic idea of TDA is to describe the ``shape of the data'' by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I'm not an expert on TDA but I do find it fascinating. I'll try to give a flavor of what this subject is about.
1. Cycles and Holes
Let
us start with clusters. Consider data
where each
is a point in
.
One way to define clusters is to define a graph
over the data. We fix a tuning parameter
and we put an edge between two data points
and
if
.
We can then define the clusters to be the
connected components of this graph.
This type of clustering is identical to single linkage clustering.
There are problems with single linkage clustering; for example, single linkage cluster are inconsistent estimates of high density regions; see Hartigan (1981). So this should raise some alarm bells. But let's ignore that for now.
To extract more topological information--- in particular, to get the homology groups--- we need to do some more work. In addition to the edges of the graph, we will include triangles and higher order simplices. Look at the following example:
The top left shows some data. We see, intuitively, that there is one ``hole'' in the data. Also, shown is the Äech complex, described below. Basically this consists of 0-simplices (the points), the 1-simplices (the edges) and a 2-simplex (a triangle). The top right and bottom left show two ``cycles'' (or loops) through the data. Both cycles show the same hole in the data. To capture the idea that both cycles capture the same hole we need to regard these cycles as equivalent. This means that we can deform one cycle into the other. More formally, the difference between the two cycles is a ``filled-in cycle,'' that is, it is the boundary of a higher-order object. Here is another example of two equivalent cycles based on a Figure 4.9 from Zomorodian 2005: (Afra Zomorodian, 2005, Topology for computing.
2. Making It Formal: The Äech Complex
Think
of each data point as a 0-dimensional simplex. Let
denote these simplices. Think of each edge in the
graph as a 1-dimensional simplex. Let
denote these simplices (i.e. the edges). (We only
include an edge if
.)
Now we form a set
of 2-dimensional simplices (triangles) as follows.
If three points
are such that
then we include the simplex defined by these three
points, denoted by
,
in
.
Here,
is a ball of radius
centered at
.
Now
keep going and form the 3-dimensional simplices
,
the 4-dimensional simplices
,
and so on. (Note that a
simplex is defined by
points.)
The
collection
of simplices
has a special structure. In particular, if
then any face of
is also in
.
A class of simplices with this property is called
a simplicial
complex. This particular simplicial complex is
called a Äech complex.
The
Äech complex contains topological information. To
see this, we need a few more definitions.
Specifically, we introduce chains
which are sets of
-dimensional
simples, cycles
which are chains without boundary (cycles), and
boundary cycles
which are filled in cycles, that is, they are
cycles but they are also boundaries of higher
order chains. This leads to the following key
picture of homology:
A
chain of order
is any set of
-simplices.
Formally, these are written as a sum. For
example,
is
a chain consisting of three, one-dimensional
simplices (edges). Intuitively, you can think of
the sum as taking the set difference. Thus .
The chains
are a group under addition.
The
boundary of a simplex
is obtained by removing each element one at a time
and adding the result. For example, the boundary
of the triangle
is
Thus,
the triangle has an empty 2-boundary. More
generally, the boundary of a -simplex
is
where
means that
has been removed. The boundary of a chain
is defined to be
.
A
chain with an empty boundary is a cycle.
The set of cycles is denoted by .
Formally, cycles are the kernel of boundary map:
.
Some
elements of
are ``filled in cycles.'' Formally, these
boundary cycles are boundaries of chains
in
.
That is, the boundary cycles
are the image of
:
.
It may be verified that
This yields the key picture above.
Now
two cycles
and
are considered equivalent if they differ by an
element of
.
That is,
if
for some
.
The set equivalence classes
is the homology group. Formally,
.
(Think of the two cycles in the first example
defining the same hole.) The order of the group is
called the Betti number and is denoted by
.
How do we compute the groups and the Betti numbers? All of the above can be expressed as matrices. Basically, the matrices capture which simplices are boundaries of which other simplices. This is the appeal of homology: it can be expressed as linear algebra. The Matlab package ``Plex'' will take a set of points and compute all the stuff we have discussed.
In
case your getting confused, let's step back and
summarize the main point:
is the number of connected components.
is the number of holes.
is the number of tunnels. etc.
3. Persistence
Notice
that there is a tuning parameter .
The conclusions depend crucially on
.
How do we choose it? Usually, we don't. Instead,
one displays the topological information as a
function of
.
This can be done in a barcode plot as follows:
(this is Figure 4 from Ghrist (2008)):
Source: Figure 4, from Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.
What
we see here is that as
varies, some topological features appear and then
disappear. The features that persist for a long
stretch are considered to be real; the small ones
are called topological noise. This type of
analysis is called persistent
homology.
4. What Are We Actually Estimating?
Suppose
the data
were obtained as follows. First we sample data
from a distribution
which is supported on some manifold
.
We don't observe
.
Instead, we observe
where
represent random noise.
The
manifold
has homology groups defined in a way similarly to
how we defined the groups for the data. The hope
is that the homology from the data estimates the
homology of
.
The implicit belief seems to be that persistent
features (long bars in the barcode plot) are real
features of the underlying manifold.
What's
known more precisely is that, under certain
conditions, and for some choices of ,
the Äech complex has high probability of having
the right homology. See, for example, Niyogi,
Smale and Weinberger (2011). There are other ways
to try to estimate the homology of
;
see for example, Caillerie (et al 2012).
In fact, the data need to be cleaned first by removing low density points. This means we need to first do some sort of density estimation. (This relates to my earlier comment about the lack of consistency of single linkage clustering.) So there are in fact many tuning parameters and practical ways to choose appropriate tuning parameters still seem elusive. This is an example of what I like to call the curse of tuning parameters.
5. What Do We Use It For?
TDA is fascinating. But is it useful? One can find many examples of TDA in genomics, neuroscience and image analysis, for example,
But is TDA crucial here or is it a hammer looking for a nail? After all, the most successful data analysis tools are simple and homology is pretty complicated.
To me, the jury is still out. I find TDA interesting but whether it will prove to be an enduring set of data analysis methods remains to be seen. In the meantime, I hope people will keep thinking about it.
6. Refefences
Edelsbrunner, H. and Harer, J. (2010). Computational Topology: An Introduction.
Zomorodian, Afra. (2005). Topology for Computing.
Balakrishnan, S., Rinaldo, A., Sheehy,D., Singh,A. and Wasserman, L. (2011). Minimax Rates for Homology Inference.
Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.
Y. Dabaghian, F. MÃmoli, L. Frank, G. Carlsson. (2012). A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS Computational Biology. link.
Niyogi, P. and Smale, S. and Weinberger, S. (2011). A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40, p 646.
Caillerie, Chazal, Dedecker, and Michel. (2012). Deconvolution for the Wasserstein metric and geometric inference. Electronic Journal of Statistics, 5, 1394--1423. link.
--- Larry Wasserman