TDA: Um sistema complexo e interessante para identificar " clusters ''

Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.

The basic idea of TDA is to describe the ``shape of the data'' by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I'm not an expert on TDA but I do find it fascinating. I'll try to give a flavor of what this subject is about.

1. Cycles and Holes

Let us start with clusters. Consider data ${Y_1,\ldots,Y_n}$ where each ${Y_i}$ is a point in ${\mathbb{R}^d}$ . One way to define clusters is to define a graph over the data. We fix a tuning parameter ${\epsilon}$ and we put an edge between two data points ${Y_i}$ and ${Y_j}$ if ${||Y_i - Y_j|| \leq \epsilon}$ . We can then define the clusters to be the connected components of this graph.

This type of clustering is identical to single linkage clustering.

There are problems with single linkage clustering; for example, single linkage cluster are inconsistent estimates of high density regions; see Hartigan (1981). So this should raise some alarm bells. But let's ignore that for now.

To extract more topological information--- in particular, to get the homology groups--- we need to do some more work. In addition to the edges of the graph, we will include triangles and higher order simplices. Look at the following example:

The top left shows some data. We see, intuitively, that there is one ``hole'' in the data. Also, shown is the Äech complex, described below. Basically this consists of 0-simplices (the points), the 1-simplices (the edges) and a 2-simplex (a triangle). The top right and bottom left show two ``cycles'' (or loops) through the data. Both cycles show the same hole in the data. To capture the idea that both cycles capture the same hole we need to regard these cycles as equivalent. This means that we can deform one cycle into the other. More formally, the difference between the two cycles is a ``filled-in cycle,'' that is, it is the boundary of a higher-order object. Here is another example of two equivalent cycles based on a Figure 4.9 from Zomorodian 2005: (Afra Zomorodian, 2005, Topology for computing.

2. Making It Formal: The Äech Complex

Think of each data point as a 0-dimensional simplex. Let ${S_0=\{ [Y_1], \ldots, [Y_n]\}}$ denote these simplices. Think of each edge in the graph as a 1-dimensional simplex. Let ${S_1}$ denote these simplices (i.e. the edges). (We only include an edge if ${||Y_i - Y_j|| \leq \epsilon}$ .) Now we form a set ${S_2}$ of 2-dimensional simplices (triangles) as follows. If three points ${Y_i,Y_j,Y_k}$ are such that ${B(Y_i,\epsilon)\cap B(Y_j,\epsilon)\cap B(Y_k,\epsilon)\neq \emptyset}$ then we include the simplex defined by these three points, denoted by ${[Y_i,Y_j,Y_k]}$ , in ${S_2}$ . Here, ${B(Y_i,\epsilon)}$ is a ball of radius ${\epsilon}$ centered at ${Y_i}$ .

Now keep going and form the 3-dimensional simplices ${S_3}$ , the 4-dimensional simplices ${S_4}$ , and so on. (Note that a ${k}$ simplex is defined by ${k+1}$ points.)

The collection ${C_\epsilon}$ of simplices ${\{S_0,S_1, \ldots, S_{n-1}\}}$ has a special structure. In particular, if ${\sigma\in C_\epsilon}$ then any face of ${\sigma}$ is also in ${C_\epsilon}$ . A class of simplices with this property is called a simplicial complex. This particular simplicial complex is called a Äech complex.

The Äech complex contains topological information. To see this, we need a few more definitions. Specifically, we introduce chains ${C_p}$ which are sets of ${p}$ -dimensional simples, cycles ${Z_p}$ which are chains without boundary (cycles), and boundary cycles ${B_p}$ which are filled in cycles, that is, they are cycles but they are also boundaries of higher order chains. This leads to the following key picture of homology:

A chain of order ${k}$ is any set of ${k}$ -simplices. Formally, these are written as a sum. For example,

$\displaystyle [Y_1,Y_2] + [Y_2,Y_3] + [Y_3,Y_4]$

is a chain consisting of three, one-dimensional simplices (edges). Intuitively, you can think of the sum as taking the set difference. Thus ${[Y_1,Y_2] + [Y_2,Y_3] =[Y_1,Y_3]}$ . The chains ${C_p}$ are a group under addition.

The boundary of a simplex ${\sigma}$ is obtained by removing each element one at a time and adding the result. For example, the boundary of the triangle ${[Y_1,Y_2,Y_3]}$ is

$\displaystyle \partial_2 [Y_1,Y_2,Y_3] = [Y_2,Y_3] + [Y_1,Y_3] + [Y_1,Y_2] = [Y_2,Y_1] + [Y_1,Y_2] = \emptyset.$

Thus, the triangle has an empty 2-boundary. More generally, the boundary of a ${k}$ -simplex ${\sigma = [Y_1,\ldots, Y_k]}$ is

$\displaystyle \partial_k [Y_1,\ldots, Y_k]= \sum_j [Y_1,\ldots, \hat{Y}_j,\ldots, Y_k]$

where ${\hat Y_j}$ means that ${Y_j}$ has been removed. The boundary of a chain ${C = \sum_j \sigma_j}$ is defined to be ${\partial_k C = \sum_j \partial_k \sigma_j}$ .

A chain with an empty boundary is a cycle. The set of cycles is denoted by ${Z_p}$ . Formally, cycles are the kernel of boundary map: ${Z_p = {\sf ker}\partial_p}$ .

Some elements of ${Z_p}$ are ``filled in cycles.'' Formally, these boundary cycles are boundaries of chains in ${C_{p+1}}$ . That is, the boundary cycles ${B_p \subset Z_p}$ are the image of ${\partial_{p+1}}$ : ${B_p = {\sf im}\partial_{p+1}}$ . It may be verified that

$\displaystyle \partial_{p-1}\partial_p C_p = \emptyset.$

This yields the key picture above.

Now two cycles ${z}$ and ${z$ are considered equivalent if they differ by an element of ${B_p}$ . That is, ${z\sim z$ if ${z = z$ for some ${b\in B_P}$ . The set equivalence classes ${H_p}$ is the homology group. Formally, ${H_p = Z_p/C_p}$ . (Think of the two cycles in the first example defining the same hole.) The order of the group is called the Betti number and is denoted by ${\beta_p}$ .

How do we compute the groups and the Betti numbers? All of the above can be expressed as matrices. Basically, the matrices capture which simplices are boundaries of which other simplices. This is the appeal of homology: it can be expressed as linear algebra. The Matlab package ``Plex'' will take a set of points and compute all the stuff we have discussed.

In case your getting confused, let's step back and summarize the main point: ${\beta_0}$ is the number of connected components. ${\beta_1}$ is the number of holes. ${\beta_2}$ is the number of tunnels. etc.

3. Persistence

Notice that there is a tuning parameter ${\epsilon}$ . The conclusions depend crucially on ${\epsilon}$ . How do we choose it? Usually, we don't. Instead, one displays the topological information as a function of ${\epsilon}$ . This can be done in a barcode plot as follows: (this is Figure 4 from Ghrist (2008)):

Source: Figure 4, from Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.

What we see here is that as ${\epsilon}$ varies, some topological features appear and then disappear. The features that persist for a long stretch are considered to be real; the small ones are called topological noise. This type of analysis is called persistent homology.

4. What Are We Actually Estimating?

Suppose the data ${Y_1,\ldots, Y_n}$ were obtained as follows. First we sample data ${X_1,\ldots, X_n}$ from a distribution ${G}$ which is supported on some manifold ${M}$ . We don't observe ${X_i}$ . Instead, we observe ${Y_i = X_i + \delta_i}$ where ${\delta_1,\ldots,\delta_n}$ represent random noise.

The manifold ${M}$ has homology groups defined in a way similarly to how we defined the groups for the data. The hope is that the homology from the data estimates the homology of ${M}$ . The implicit belief seems to be that persistent features (long bars in the barcode plot) are real features of the underlying manifold.

What's known more precisely is that, under certain conditions, and for some choices of ${\epsilon}$ , the Äech complex has high probability of having the right homology. See, for example, Niyogi, Smale and Weinberger (2011). There are other ways to try to estimate the homology of ${M}$ ; see for example, Caillerie (et al 2012).

In fact, the data need to be cleaned first by removing low density points. This means we need to first do some sort of density estimation. (This relates to my earlier comment about the lack of consistency of single linkage clustering.) So there are in fact many tuning parameters and practical ways to choose appropriate tuning parameters still seem elusive. This is an example of what I like to call the curse of tuning parameters.

5. What Do We Use It For?

TDA is fascinating. But is it useful? One can find many examples of TDA in genomics, neuroscience and image analysis, for example,

But is TDA crucial here or is it a hammer looking for a nail? After all, the most successful data analysis tools are simple and homology is pretty complicated.

To me, the jury is still out. I find TDA interesting but whether it will prove to be an enduring set of data analysis methods remains to be seen. In the meantime, I hope people will keep thinking about it.

6. Refefences

Edelsbrunner, H. and Harer, J. (2010). Computational Topology: An Introduction.

Zomorodian, Afra. (2005). Topology for Computing.

The Stanford CompTop group

The Geometrica team at INRIA.

Balakrishnan, S., Rinaldo, A., Sheehy,D., Singh,A. and Wasserman, L. (2011). Minimax Rates for Homology Inference.

Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.

Y. Dabaghian, F. MÃmoli, L. Frank, G. Carlsson. (2012). A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS Computational Biology. link.

Niyogi, P. and Smale, S. and Weinberger, S. (2011). A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40, p 646.

Caillerie, Chazal, Dedecker, and Michel. (2012). Deconvolution for the Wasserstein metric and geometric inference. Electronic Journal of Statistics, 5, 1394--1423. link.

--- Larry Wasserman

TDA: Um sistema complexo e interessante para identificar " clusters ''

New post on Normal Deviate

Topological Data Analysis