A handy guide to using the ToMATo algorithm ¶

Introduction¶

This code is an implemention of the ToMATo algorithm exposed in [1], a clustering method based on the idea of topological persistance. In short, the algorithm needs a density estimation (so to each point $x$ we associate a value $\hat{f}(x)$) and a neighborhood graph. First, it starts with a mode-seeking phase (naive hill-climbing) to build the initial clusters (each with its own mode), following the connected points in the neighborhood graph. Finally, it merges these initial clusters based on their prominence. This merging has a hierarchical nature, i.e. we always obtain the successive new clusters by merging two existing ones.

The merging phase depends on a parameter, which is the minimum prominence a cluster needs to avoid getting merged into another, adjacent, bigger cluster (i.e. with a higher associated mode); thus, it determines to a great extent the obtained number of clusters. In practice, the convenience of this parameter depends on the input graph and the density estimation, and it can be hard to choose it properly. This is why, in this implemention, we allow instead the option to choose the "desired" final number of clusters $n$, and the algorithm itself keeps diminishing the value of the parameter until all the initial clusters have been merged (if possible) into the $n$ clusters with highest prominence.

Along with the clustering itself, the algorithm also produces the persistence diagram of the merge tree of the initial clusters. This is a really convenient graphical tool to help decide the "natural" number of clusters in our input data. We explain its interpretation briefly in the section "Output information".

Input data format¶

As mentioned, the algorithm needs a neighborhood graph of the data and a value associated each entry (an estimation of $f$ over it). Given that, in many situations, the input data is a point cloud (i.e. a set of $n$ observations each with $p$ numerical features), the code provides a few density estimators and graph constructions over them for convenience, but advanced users may provide their own graph and density estimates instead of point coordinates.

Since the algorithm essentially computes basins of attraction, it is also encouraged to use it on functions that do not represent densities at all.

For an input point cloud, the density estimation and graph construction methods that have been implemented are:

For density estimation, the ubiquitous Kernel Density Estimation (KDE for short) can be used (using the scikit-learn library), and also the Distance-to-a-Measure method (DTM), a bit more experimental and recently developed to face more efficiently the potential presence of outliers; more information about it can be found in the tutorial [2] and the paper [3]. The logarithmic versions of both estimation methods are also implemented.

Regarding the building of the graph, there is the option to construct the $k$-NN graph (where, for each vertex, an edge is created between it and its $k$ nearest neighbors), and the $r$-radius graph (where an edge is created whenever two vertices lay in a distance less than $r$). Obviously, both parameters can (and should) be properly chosen. In the following image we can see both constructions over a point cloud in the square 1x1 (first image); in the second one, we have the $k$-NN graph (with $k$=4), while in the third we have the $r$-radius graph (with $r$=0.3):

Output information¶

At the end, the algorithm outputs basically two informations of interest:

In all cases, it produces the (0-dimensional) persistance diagram of the merging process of the initial clusters. In short, this is a graphical representation of the lifespan of the different clusters as we keep diminishing the prominence threshold.

At the beginning, we have a point for each initial cluster, which also has an associated peak (the vertex with the highest estimate of $f$, a "mode" of $f$). Then, we start looking for merges of these clusters, by melding them with neighboring clusters with higher associated peaks. To do so, we basically keep checking, for the different vertices $i$ (in decreasing order), which "neighboring" peaks $p_{j}$ lower than $p_{i}$ satisfy $\hat{f}(p_{j}) < \hat{f}(i) + \tau$, where $\tau$ is our prominence value. When this happens, we merge the whole cluster associated to that peak $p_{j}$ to the one in which $i$ belongs, forming a new, bigger cluster, still with peak $p_{i}$. The higher $\tau$ needs to be before this happens, the more prominent is $p_{j}$ and its associated cluster.

In a persistance diagram, all this information is encoded in the following way: there is a point $(x,y)$ for each initial cluster. The $x$ coordinate is the value of its associated peak $p$. The $y$ coordinate is the value $\hat{f}(p)-\tau$ from which we can find a "neighboring point" of that peak, but belonging to a different cluster, with equal or greater $\hat{f}$; equivalently, it is the highest neighbor of $p$ not belonging to the cluster it defines. Thus, the length of vertical line connecting $(x,y)$ with the diagonal, or equivalently $x-y$, is the prominence of the peak. In consequence, to get an idea of the real number of clusters, it is natural to look for the number of points in the persistance diagram further away from the diagonal. The points associated to a peak of a cluster which never dies (i.e. it never gets merged, so it forms a connected component at the end) are colored in green.

In view of the persistance diagram obtained, it is then natural to ask for a specific number of clusters at the end, or to specify a certain persistance threshold. After this has been stipulated, the algorithm also outputs a numerical "label" for each entry in the input data (in the same order they have been introduced, whatever the format): the cluster it has been assigned to. This labelling is saved in the attribute "labels_" as an ordered vector, so it can be easily used to plot the data in different colors or formats depending on their assigned cluster.

THE TOMATO CLASS¶

The code now¶

This is the current version of the code in the Gudhi Library:

# This file is part of the Gudhi Library - https://gudhi.inria.fr/ - which is released under MIT.
# See file LICENSE or go to https://gudhi.inria.fr/licensing/ for full license details.
# Author(s):       Marc Glisse
#
# Copyright (C) 2020 Inria
#
# Modification(s):
#   - YYYY/MM Author: Description of the modification

import numpy
from ..point_cloud.knn import KNearestNeighbors
from ..point_cloud.dtm import DTMDensity
from ._tomato import *

# The fit/predict interface is not so well suited...


class Tomato:
    """
    This clustering algorithm needs a neighborhood graph on the points, and an estimation of the density at each point.
    A few possible graph constructions and density estimators are provided for convenience, but it is perfectly natural
    to provide your own.

    :Requires: `SciPy <installation.html#scipy>`_, `Scikit-learn <installation.html#scikit-learn>`_ or others
        (see :class:`~gudhi.point_cloud.knn.KNearestNeighbors`) in function of the options.

    Attributes
    ----------
    n_clusters_: int
        The number of clusters. Writing to it automatically adjusts `labels_`.
    merge_threshold_: float
        minimum prominence of a cluster so it doesn't get merged. Writing to it automatically adjusts `labels_`.
    n_leaves_: int
        number of leaves (unstable clusters) in the hierarchical tree
    leaf_labels_: ndarray of shape (n_samples,)
        cluster labels for each point, at the very bottom of the hierarchy
    labels_: ndarray of shape (n_samples,)
        cluster labels for each point, after merging
    diagram_: ndarray of shape (`n_leaves_`, 2)
        persistence diagram (only the finite points)
    max_weight_per_cc_: ndarray of shape (n_connected_components,)
        maximum of the density function on each connected component. This corresponds to the abscissa of infinite
        points in the diagram
    children_: ndarray of shape (`n_leaves_`-n_connected_components, 2)
        The children of each non-leaf node. Values less than `n_leaves_` correspond to leaves of the tree.
        A node i greater than or equal to `n_leaves_` is a non-leaf node and has children children_[i - `n_leaves_`].
        Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node `n_leaves_` + i
    weights_: ndarray of shape (n_samples,)
        weights of the points, as computed by the density estimator or provided by the user
    params_: dict
        Parameters like metric, etc
    """

    def __init__(
        self,
        graph_type="knn",
        density_type="logDTM",
        n_clusters=None,
        merge_threshold=None,
        #       eliminate_threshold=None,
        #           eliminate_threshold (float): minimum max weight of a cluster so it doesn't get eliminated
        **params
    ):
        """
        Args:
            graph_type (str): 'manual', 'knn' or 'radius'. Default is 'knn'.
            density_type (str): 'manual', 'DTM', 'logDTM', 'KDE' or 'logKDE'. When you have many points,
                'KDE' and 'logKDE' tend to be slower. Default is 'logDTM'.
            metric (str|Callable): metric used when calculating the distance between instances in a feature array.
                Defaults to Minkowski of parameter p.
            kde_params (dict): if density_type is 'KDE' or 'logKDE', additional parameters passed directly to
                sklearn.neighbors.KernelDensity.
            k (int): number of neighbors for a knn graph (including the vertex itself). Defaults to 10.
            k_DTM (int): number of neighbors for the DTM density estimation (including the vertex itself).
                Defaults to k.
            r (float): size of a neighborhood if graph_type is 'radius'. Also used as default bandwidth in kde_params.
            eps (float): (1+eps) approximation factor when computing distances (ignored in many cases).
            n_clusters (int): number of clusters requested. Defaults to None, i.e. no merging occurs and we get
                the maximal number of clusters.
            merge_threshold (float): minimum prominence of a cluster so it doesn't get merged.
            symmetrize_graph (bool): whether we should add edges to make the neighborhood graph symmetric.
                This can be useful with k-NN for small k. Defaults to false.
            p (float): norm L^p on input points. Defaults to 2.
            q (float): order used to compute the distance to measure. Defaults to dim.
                Beware that when the dimension is large, this can easily cause overflows.
            dim (float): final exponent in DTM density estimation, representing the dimension. Defaults to the
                dimension, or 2 when the dimension cannot be read from the input (metric is "precomputed").
            n_jobs (int): Number of jobs to schedule for parallel processing on the CPU.
                If -1 is given all processors are used. Default: 1.
            params: extra parameters are passed to :class:`~gudhi.point_cloud.knn.KNearestNeighbors` and
                :class:`~gudhi.point_cloud.dtm.DTMDensity`.
        """
        # Should metric='precomputed' mean input_type='distance_matrix'?
        # Should we be able to pass metric='minkowski' (what None does currently)?
        self.graph_type_ = graph_type
        self.density_type_ = density_type
        self.params_ = params
        self.__n_clusters = n_clusters
        self.__merge_threshold = merge_threshold
        # self.eliminate_threshold_ = eliminate_threshold
        if n_clusters and merge_threshold:
            raise ValueError("Cannot specify both a merge threshold and a number of clusters")

    def fit(self, X, y=None, weights=None):
        """
        Args:
            X ((n,d)-array of float|(n,n)-array of float|Sequence[Iterable[int]]): coordinates of the points,
                or distance matrix (full, not just a triangle) if metric is "precomputed", or list of neighbors
                for each point (points are represented by their index, starting from 0) if graph_type is "manual".
            weights (ndarray of shape (n_samples)): if density_type is 'manual', a density estimate at each point
            y: Not used, present here for API consistency with scikit-learn by convention.
        """
        # TODO: First detect if this is a new call with the same data (only threshold changed?)
        # TODO: less code duplication (subroutines?), less spaghetti, but don't compute neighbors twice if not needed. Clear error message for missing or contradictory parameters.
        if weights is not None:
            density_type = "manual"
        else:
            density_type = self.density_type_
            if density_type == "manual":
                raise ValueError("If density_type is 'manual', you must provide weights to fit()")

        if self.graph_type_ == "manual":
            self.neighbors_ = X
            # FIXME: uniformize "message 'option'" vs 'message "option"'
            assert density_type == "manual", 'If graph_type is "manual", density_type must be as well'
        else:
            metric = self.params_.get("metric", "minkowski")
            if metric != "precomputed":
                self.points_ = X

        # Slight complication to avoid computing knn twice.
        need_knn = 0
        need_knn_ngb = False
        need_knn_dist = False
        if self.graph_type_ == "knn":
            k_graph = self.params_.get("k", 10)
            # If X has fewer than k points...
            if k_graph > len(X):
                k_graph = len(X)
            need_knn = k_graph
            need_knn_ngb = True
        if self.density_type_ in ["DTM", "logDTM"]:
            k = self.params_.get("k", 10)
            k_DTM = self.params_.get("k_DTM", k)
            # If X has fewer than k points...
            if k_DTM > len(X):
                k_DTM = len(X)
            need_knn = max(need_knn, k_DTM)
            need_knn_dist = True
            # if we ask for more neighbors for the graph than the DTM, getting the distances is a slight waste,
            # but it looks negligible
        if need_knn > 0:
            knn_args = dict(self.params_)
            knn_args["k"] = need_knn
            knn = KNearestNeighbors(return_index=need_knn_ngb, return_distance=need_knn_dist, **knn_args).fit_transform(
                X
            )
            if need_knn_ngb:
                if need_knn_dist:
                    self.neighbors_ = knn[0][:, 0:k_graph]
                    knn_dist = knn[1]
                else:
                    self.neighbors_ = knn
            elif need_knn_dist:
                knn_dist = knn
        if self.density_type_ in ["DTM", "logDTM"]:
            dim = self.params_.get("dim")
            if dim is None:
                dim = len(X[0]) if metric != "precomputed" else 2
            q = self.params_.get("q", dim)
            weights = DTMDensity(k=k_DTM, metric="neighbors", dim=dim, q=q).fit_transform(knn_dist)
            if self.density_type_ == "logDTM":
                weights = numpy.log(weights)

        if self.graph_type_ == "radius":
            if metric in ["minkowski", "euclidean", "manhattan", "chebyshev"]:
                from scipy.spatial import cKDTree

                tree = cKDTree(X)
                # TODO: handle "l1" and "l2" aliases?
                p = self.params_.get("p")
                if metric == "euclidean":
                    assert p is None or p == 2, "p=" + str(p) + " is not consistent with metric='euclidean'"
                    p = 2
                elif metric == "manhattan":
                    assert p is None or p == 1, "p=" + str(p) + " is not consistent with metric='manhattan'"
                    p = 1
                elif metric == "chebyshev":
                    assert p is None or p == numpy.inf, "p=" + str(p) + " is not consistent with metric='chebyshev'"
                    p = numpy.inf
                elif p is None:
                    p = 2  # the default
                eps = self.params_.get("eps", 0)
                self.neighbors_ = tree.query_ball_tree(tree, r=self.params_["r"], p=p, eps=eps)

            # TODO: sklearn's NearestNeighbors.radius_neighbors can handle more metrics efficiently via its BallTree
            # (don't bother with the _graph variant, it just calls radius_neighbors).
            elif metric != "precomputed":
                from sklearn.metrics import pairwise_distances

                X = pairwise_distances(X, metric=metric, n_jobs=self.params_.get("n_jobs"))
                metric = "precomputed"

            if metric == "precomputed":
                # TODO: parallelize? May not be worth it.
                X = numpy.asarray(X)
                r = self.params_["r"]
                self.neighbors_ = [numpy.flatnonzero(l <= r) for l in X]

        if self.density_type_ in {"KDE", "logKDE"}:
            # Slow...
            assert (
                self.graph_type_ != "manual" and metric != "precomputed"
            ), "Scikit-learn's KernelDensity requires point coordinates"
            kde_params = dict(self.params_.get("kde_params", dict()))
            kde_params.setdefault("metric", metric)
            r = self.params_.get("r")
            if r is not None:
                kde_params.setdefault("bandwidth", r)
            # Should we default rtol to eps?
            from sklearn.neighbors import KernelDensity

            weights = KernelDensity(**kde_params).fit(self.points_).score_samples(self.points_)
            if self.density_type_ == "KDE":
                weights = numpy.exp(weights)

        # TODO: do it at the C++ level and/or in parallel if this is too slow?
        if self.params_.get("symmetrize_graph"):
            self.neighbors_ = [set(line) for line in self.neighbors_]
            for i, line in enumerate(self.neighbors_):
                line.discard(i)
                for j in line:
                    self.neighbors_[j].add(i)

        self.weights_ = weights
        # This is where the main computation happens
        self.leaf_labels_, self.children_, self.diagram_, self.max_weight_per_cc_ = hierarchy(self.neighbors_, weights)
        self.n_leaves_ = len(self.max_weight_per_cc_) + len(self.children_)
        assert self.leaf_labels_.max() + 1 == len(self.max_weight_per_cc_) + len(self.children_)
        # TODO: deduplicate this code with the setters below
        if self.__merge_threshold:
            assert not self.__n_clusters
            self.__n_clusters = numpy.count_nonzero(
                self.diagram_[:, 0] - self.diagram_[:, 1] > self.__merge_threshold
            ) + len(self.max_weight_per_cc_)
        if self.__n_clusters:
            # TODO: set corresponding merge_threshold?
            renaming = merge(self.children_, self.n_leaves_, self.__n_clusters)
            self.labels_ = renaming[self.leaf_labels_]
            # In case the user asked for something impossible.
            # TODO: check for impossible situations before calling merge.
            self.__n_clusters = self.labels_.max() + 1
        else:
            self.labels_ = self.leaf_labels_
            self.__n_clusters = self.n_leaves_
        return self

    def fit_predict(self, X, y=None, weights=None):
        """
        Equivalent to fit(), and returns the `labels_`.
        """
        return self.fit(X, y, weights).labels_

    # TODO: add argument k or threshold? Have a version where you can click and it shows the line and the corresponding k?
    def plot_diagram(self):
        """
        """
        import matplotlib.pyplot as plt

        l = self.max_weight_per_cc_.min()
        r = self.max_weight_per_cc_.max()
        if self.diagram_.size > 0:
            plt.plot(self.diagram_[:, 0], self.diagram_[:, 1], "ro")
            l = min(l, self.diagram_[:, 1].min())
            r = max(r, self.diagram_[:, 0].max())
        if l == r:
            if l > 0:
                l, r = 0.9 * l, 1.1 * r
            elif l < 0:
                l, r = 1.1 * l, 0.9 * r
            else:
                l, r = -1.0, 1.0
        plt.plot([l, r], [l, r])
        plt.plot(
            self.max_weight_per_cc_, numpy.full(self.max_weight_per_cc_.shape, 1.1 * l - 0.1 * r), "ro", color="green"
        )
        plt.show()

    # Use set_params instead?
    @property
    def n_clusters_(self):
        return self.__n_clusters

    @n_clusters_.setter
    def n_clusters_(self, n_clusters):
        if n_clusters == self.__n_clusters:
            return
        self.__n_clusters = n_clusters
        self.__merge_threshold = None
        if hasattr(self, "leaf_labels_"):
            renaming = merge(self.children_, self.n_leaves_, self.__n_clusters)
            self.labels_ = renaming[self.leaf_labels_]
            # In case the user asked for something impossible
            self.__n_clusters = self.labels_.max() + 1

    @property
    def merge_threshold_(self):
        return self.__merge_threshold

    @merge_threshold_.setter
    def merge_threshold_(self, merge_threshold):
        if merge_threshold == self.__merge_threshold:
            return
        if hasattr(self, "leaf_labels_"):
            self.n_clusters_ = numpy.count_nonzero(self.diagram_[:, 0] - self.diagram_[:, 1] > merge_threshold) + len(
                self.max_weight_per_cc_
            )
        else:
            self.__n_clusters = None
        self.__merge_threshold = merge_threshold

Description¶

Parameters¶

By "parameters" we mean the information we (must) provide to construct a specific instance of the class. They are given as arguments in the constructor function "__init__":

graph_type (str): 'manual', 'knn' (default) or 'radius'.
density_type (str): 'manual', 'DTM', 'logDTM' (default), 'KDE' or 'logKDE'. With many points, 'KDE' and 'logKDE' tend to be slower.
n_clusters (int): number of clusters requested. Defaults to None, i.e. no merging occurs and we get the maximal number of clusters.
merge_threshold (float): minimum prominence of a cluster so it doesn't get merged.

(Naturally, both n_clusters and merge_threshold cannot be provided simultaneously, as it can be deduced from the explanation of the algorithm)

metric (str|Callable): metric used to compute the pairwase distances between points (if we don't input them). If None, use Minkowski of parameter p.
kde_params (dict): if density_type is 'KDE' or 'logKDE', additional parameters passed directly to sklearn.neighbors.KernelDensity.
k (int): number of neighbors for a k-NN graph (including the vertex itself). Defaults to 10.
k_DTM (int): number of neighbors for the DTM density estimation (including the vertex itself). Defaults to k.
r (float): size of a neighborhood if graph_type is 'radius'. Also used as default bandwidth in kde_params.
eps (float): approximation factor when computing nearest neighbors (ignored in many cases).
gpu (bool): enable use of CUDA (through pykeops) to compute k nearest neighbors. This is useful when the dimension becomes large (10+) but the number of points remains low (less than a million).
symmetrize_graph (bool): whether we should add edges to make the neighborhood graph symmetric. This can be useful with k-NN for small k. Defaults to false.
p (float): norm L^p on input points (numpy.inf is supported without gpu). Defaults to 2.
dim (float): final exponent in DTM density estimation, representing the dimension. Defaults to the dimension, or 2 when the dimension cannot be read from the input (metric is "precomputed").
q (float): order used to compute the distance to measure. Defaults to dim. Beware that when the dimension is large, this can easily cause overflows.
n_jobs (int): Number of jobs to schedule for parallel processing on the CPU. If -1 is given all processors are used. Default: 1.
params: extra parameters are passed to the classes gudhi.point_cloud.knn.KNearestNeighbors and gudhi.point_cloud.dtm.DTMDensity.

Attributes¶

By "attributes" we mean the properties, or variables, created within a class: they store its information, allow it to run some of its methods and functionalities, etc... We recall also that, as a common practice, the attributes of a class (those defined with self.) usually have some "_" in its name to make them more distinguishable within the code.

Naturally, the values of most of the attributes depend on the instance itself, and, depending on it, some of them will be present or not. Actually, many of the previous parameters have their corresponding attribute, as for example n_clusters_ and merge_threshold_ (which, when modified, can alter the values of other attributes, as the .setter propery shows), or they are stored inside the "params_" dictionary; input_type, metric,...

Other important attributes which are created specifically to run the desired methods and are not given as parameters are:

n_leaves_ (int): Number of leaves (unstable clusters) in the hierarchical tree. Basically, the number of "temporary" clusters (or mini-clusters) we have along the way.
leaf_labels_ (ndarray of shape (n_samples)): Cluster labels for each point, at the very bottom of the hierarchy.
labels_ (ndarray of shape (n_samples)): Cluster labels for each point, after merging. Writing to n_clusters_ and merge_threshold_ automatically adjusts it.
diagram_ (ndarray of shape (n_leaves_, 2)): Persistence diagram (only the finite points).
weights_: (ndarray of shape (n_samples,)): Weights of the points, as computed by the density estimator or provided by the user.
max_weight_per_cc_: (ndarray of shape (n_connected_components,)): Maximum of the density function on each connected component. This corresponds to the abscissa of infinite points in the diagram.

Methods¶

The Tomato class contains, in essence, two methods:

The first one is the .fit method, which does basically everything: it processes the input data taking into account its format and the given arguments, it does the merging process depending on them, does the labelling of the entries and stores the points that will eventually form the persistance diagram. The method .fit_predict is identical, but it returns the labels vector. Both of them take as the input the coordinates of the points/ distance matrix/ neighborhood matrix, and possibly a "weights" vector, the estimate of $f$ on each entry.

The second one is the .plot_diagram method, without arguments, that plots the persistance diagram (after the fit method).

EXAMPLES AND TESTS¶

Example 1¶

We start with a really simple example with a few hundreds points to get used to manipulating the Tomato class.

import matplotlib.pyplot as plt

cmap = plt.cm.Spectral;
fig, ax = plt.subplots();

import random as rd
import numpy as np


# Simple function to get random values for x uniformly but within intervals (0,a) U (b, 1)
def x_var(x):
    if x > 0.5:
        return rd.uniform(0.6, 1)
    else:
        return rd.uniform(0, 0.4)

    
p1 = np.zeros((200,2))
for i in range(200):
    p1[i,0] = x_var(rd.uniform(0,1))
    p1[i,1] = rd.uniform(0,1)
    
ax.cla()
ax.scatter(*zip(*p1));

There are "clearly" two main groups of points.

Let's suppose we don't know that, so we run the Tomato algorithm blindly. We use the KDE (without specifying extra parameters, thus using the default parameters in Scikit-Learn) and the radius graph with $r$= 0.1. We want to take a look at the persistance diagram:

import gudhi

from gudhi.clustering.tomato import Tomato

ex1 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="radius",
        density_type="KDE",
        #n_clusters=2,
        r=0.1,
    )

ex1.fit(p1)

print("There are " + str(ex1.n_clusters_) + " initial clusters")
ex1.plot_diagram()

There are 7 initial clusters

Even if n_clusters_ gives us 7 initial clusters (when we don't specify the parameter n_clusters in Tomato no merging occurs), we can see from the bottom-right that there are clearly two more prominent groups, but four connected components. Indeed, let's output the graph built on top of our data:

from sklearn.neighbors import NearestNeighbors
X = np.array(p1)
nbrs = NearestNeighbors(n_neighbors=200).fit(X)
distances, indices = nbrs.kneighbors(X)
plt.plot(X[:,0], X[:,1], 'o')
for i in indices:
    Y = np.zeros((2,2))
    for j in range(len(i)):
        if distances[i[0]][j] < 0.1:
            Y[0][0]= X[i[0]][0]
            Y[1][0]= X[i[0]][1]
            Y[0][1]= X[i[j]][0]
            Y[1][1]= X[i[j]][1]
            plt.plot(Y[0], Y[1], 'ro-')

plt.show()

Even if we know that "there are" two main clusters, we cannot force the algorithm to output them, because there is no way the algorithm can merge disconnected components. We don't have problems if we ask for a bigger number of clusters:

ex1.n_clusters_ = 6
print(ex1.n_clusters_)
print(ex1.labels_)

ex1.n_clusters_ = 2
print(ex1.n_clusters_)
print(ex1.labels_)

6
[1 2 3 1 2 2 1 3 2 2 0 2 0 0 0 1 1 2 1 3 2 0 0 0 3 2 5 0 2 2 0 2 3 2 2 1 2
 1 0 1 0 2 0 1 0 0 0 2 0 0 2 0 3 1 1 1 1 0 0 3 1 0 2 4 2 2 1 0 3 2 1 3 3 2
 2 2 1 0 2 3 2 0 1 1 2 0 0 0 0 1 0 1 5 0 1 1 3 2 2 3 2 2 2 1 0 0 1 0 2 2 3
 0 0 2 3 1 3 2 3 3 0 1 1 2 2 0 0 1 1 3 3 0 2 1 1 0 1 1 1 0 0 0 3 1 0 1 0 3
 0 0 1 0 1 2 1 1 0 2 1 4 2 3 0 2 1 1 2 0 1 2 0 0 1 0 2 2 1 0 2 0 0 2 1 0 1
 2 0 0 2 1 2 1 0 1 0 5 1 0 3 2]
4
[1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 3 0 0 0 0 0 1 0 0 1 0
 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 2 0 0 1 0 1 0 1 1 1 0
 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 3 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1
 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 1
 0 0 1 0 1 0 1 1 0 0 1 2 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1
 0 0 0 0 1 0 1 0 1 0 3 1 0 1 0]

Unsurprisingly, if we plot the points with different colors according to their labels, we don't get a very satisfying result:

n = ex1.n_clusters_
labels = ex1.labels_

norm = plt.Normalize(vmin=0, vmax=n-1)

ax.cla()
ax.scatter(*zip(*p1), c=cmap(norm(labels)));
fig

This is the reason why running the algorithm for different values of the parameters is a good idea, specially if the algorithm produces persistance diagrams with several green dots (i.e. connected components) near the bottom-left part (i.e. low, isolated peaks).

Here is the situation when we increase $r$ to 0.15:

ex1 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="radius",
        density_type="KDE",
        n_clusters=2,
        r=0.13,
    )

n = ex1.n_clusters_
print("We obtain " + str(n) + " clusters.")
labels = ex1.fit_predict(p1)
print(ex1.labels_)

print("\nThe persistance diagram looks better, with just two connected components, and two prominent regions:")
ex1.plot_diagram()

print("\nThe graph over which the algorithm runs is:")

plt.plot(X[:,0], X[:,1], 'o')
for i in indices:
    Y = np.zeros((2,2))
    for j in range(len(i)):
        if distances[i[0]][j] < 0.15:
            Y[0][0]= X[i[0]][0]
            Y[1][0]= X[i[0]][1]
            Y[0][1]= X[i[j]][0]
            Y[1][1]= X[i[j]][1]
            plt.plot(Y[0], Y[1], 'ro-')

plt.show()

print("\nAnd the plot of the points according to their label is:")

norm = plt.Normalize(vmin=0, vmax=n-1)

ax.cla()
ax.scatter(*zip(*p1), c=cmap(norm(labels)));
fig

We obtain 2 clusters.
[1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0
 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0
 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1
 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 1
 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1
 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0]

The persistance diagram looks better, with just two connected components, and two prominent regions:

The graph over which the algorithm runs is:

And the plot of the points according to their label is:

Example 2¶

We use now a rather typical example to test clustering algorithms: a point cloud sampled from two concentric circles:

from sklearn import manifold, datasets
p2, y = datasets.make_circles(n_samples=1000, factor=.5, noise=.05)

ax.cla()
ax.scatter(*zip(*p2));
fig

It is well known that many clustering methods perform poorly with non-convex groupings of data, as the one above. This is not the case with the Tomato algorithm, which relies just on looking for "nearby" modes. We use now the $k$-NN graph construction, with $k$=7, and the KDE again, specifying some of it parameters now (for more information, check the Scikit-learn documentation):

ex2 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="knn",
        density_type="KDE",
        kde_params = {"bandwidth": 1.3, "kernel": "epanechnikov"},
        #n_clusters=2,
        k=7,
        eps=0.05,
    )

ex2.fit_predict(p2)
ex2.plot_diagram()

The diagram is not specially obvious; if this happens, it is in general a good idea tu run the algorithm with different values in the parameters.

We also see that there are several connected components, more specifically 6; a quick way to know how many of them we have (although it may be slow with many points) is to "force" the attribute n_clusters to 1, so all possible merges happen, and then print the variable:

ex2.n_clusters_ = 2
n = ex2.n_clusters_
print("There are " + str(n) + " connected components")

There are 6 connected components

Let's plot these components:

labels = ex2.labels_

norm = plt.Normalize(vmin=0, vmax=n-1)

ax.cla()
ax.scatter(*zip(*p2), c=cmap(norm(labels)));
fig

A bit frustrating; this is "natural" consequence of the the $k$-NN graph being directed. We can "solve" this by symmetrizing the graph, although its effectiveness is uncertain. In this case it also makes sense to reduce $k$, as we add more edges:

ex2 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="knn",
        density_type="KDE",
        kde_params = {"bandwidth": 1.3, "kernel": "epanechnikov"},
        #n_clusters=2,
        k=5,
        symmetrize_graph = True,
        eps=0.05,
    )

ex2.fit_predict(p2)
ex2.plot_diagram()

ex2.n_clusters_ = 1
n = ex2.n_clusters_
print("There are " + str(n) + " connected components")

labels = ex2.labels_

norm = plt.Normalize(vmin=0, vmax=n-1)

ax.cla()
ax.scatter(*zip(*p2), c=cmap(norm(labels)));
fig

There are 4 connected components

In general, and intelligent way to proceed would be to run the algorithm for different values of $k$ and the bandwidth $\lambda$, and see for which values we obtain "good" persistance diagrams, with "clearly prominent clusters". This is what we do below, where, for a fixed $k$ and different values of $\lambda$, we compute the prominence of each point of the persistance diagram ($x-y$), and we plot the information:

f, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(nrows=3, ncols=2)

for n_neigh in range(6,12):
    n_diagram = []
    x_diagram = []
    y_diagram = []
    bandwidth_values = [0.1, 2, 0.1]
    bandwidth = bandwidth_values[0]
    
    while bandwidth < bandwidth_values[1]:
        ex2 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="knn",
        density_type="KDE",
        kde_params = {"bandwidth": bandwidth, "kernel": "epanechnikov"},
        #n_clusters=2,
        k=n_neigh,
        eps=0.05,
    )
        ex2.fit(p2)
        init_clusters = len(ex2.diagram_)
        prominences = np.zeros(init_clusters)
        for i in range(init_clusters):
            prominences[i] = ex2.diagram_[i,0] - ex2.diagram_[i,1]
        
        ##"Normalizing" prominences
        max_prom = np.max(prominences)
        for i in range(init_clusters):
            prominences[i] /= max_prom

        n_diagram.append(prominences)
        bandwidth += bandwidth_values[2]
        
    
    for i in range(len(n_diagram)):
        for j in range(len(n_diagram[i])):
            x_diagram.append(bandwidth_values[0] + i*bandwidth_values[2])
            y_diagram.append(n_diagram[i][j])

    
    plt.title('Looking for clusters')
    plt.axis('tight')
    plt.ylabel('K =' + str(n_neigh-1))
    plt.xlabel('bandwidth')
    plt.subplot(6, 1, n_neigh-5)
    plt.scatter(x_diagram, y_diagram)

fig = plt.gcf()
fig.set_size_inches(8, 32)
plt.show()

One can see, for example, that when the bandwidth is $\lambda$ = 0.2, two prominent clusters appear consistently, for almost all values of $k$. If we run Tomato with these parameters, we obtain the "desired" result:

ex2 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="knn",
        density_type="KDE",
        kde_params = {"bandwidth": 0.2, "kernel": "epanechnikov"},
        n_clusters=2,
        k=8,
        eps=0.05,
    )

labels = ex2.fit_predict(p2)
ex2.plot_diagram()

norm = plt.Normalize(vmin=0, vmax=1)

fig, ax = plt.subplots();

ax.cla()
ax.scatter(*zip(*p2), c=cmap(norm(labels)));

Example 3¶

We do now a rather spectacular example in 3D just to show the effectiveness of the algorithm to separate clusters with different shapes. We will generate, using points, a cube, a sphere, and a "swiss roll", together with some noise:

import mpl_toolkits.mplot3d.axes3d as plt3
from sklearn.datasets import make_swiss_roll

fig3 = plt.figure()
ax = plt3.Axes3D(fig3)
ax.view_init(7, -70)

points_cube = 1000
points_sphere = 800
#points_line = 700
points_sr = 8000
points_noise = 2000

X1 = np.zeros((points_cube, 3))
for i in range(points_cube):
    X1[i,0], X1[i,1], X1[i,2] = rd.uniform(-2,2), rd.uniform(-2,2), rd.uniform(-2,2)
    
X2 = np.zeros((points_sphere, 3))
for i in range(points_sphere):
    X2[i,0], X2[i,1], X2[i,2] = rd.uniform(-1,1), rd.uniform(-1,1), rd.uniform(-1,1)
    X2[i,0], X2[i,1], X2[i,2] = 12 + 3*X2[i,0]/np.sqrt(X2[i,0]**2 + X2[i,1]**2 + X2[i,2]**2), 15 + 3*X2[i,1]/np.sqrt(X2[i,0]**2 + X2[i,1]**2 + X2[i,2]**2), -4 + 3*X2[i,2]/np.sqrt(X2[i,0]**2 + X2[i,1]**2 + X2[i,2]**2)

"""
X3 = np.zeros((points_line, 3))
for i in range(points_line):
    param = rd.uniform(-15, 15)
    X3[i,0], X3[i,1], X3[i,2] = 2 - param*0.7, 4 + param*0.7, 2 - param*0.6
    X3[:,0] += 0.02*np.random.randn(points_line)
    X3[:,1] += 0.02*np.random.randn(points_line)
    X3[:,2] += 0.02*np.random.randn(points_line)
"""
    
X4, _ = make_swiss_roll(n_samples=points_sr, noise=.05)

X5 = np.zeros((points_noise, 3))
for i in range(points_noise):
    X5[i,0], X5[i,1], X5[i,2] = rd.uniform(-10,15), rd.uniform(-5,20), rd.uniform(-10,15)

X = np.concatenate((X1,X2,X4,X5))
X = np.array(X)

ax.scatter(X[:, 0], X[:, 1], X[:, 2], color="red", s=4);

ax.view_init(50, -150)
fig3

Let's run the algorithm with $k$-NN and the logDTM estimation:

ex3 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="knn",
        density_type="logDTM",
        #n_clusters=2,
        #symmetrize_graph= True,
        k=9,
    )

ex3.fit(X)
ex3.plot_diagram()
print(ex3.labels_)

[154  31  89 ... 281 195 320]

We see 2-3 prominent clusters in the persistance diagram. We can "identify" the noise by checking which points have a low estimate, and creating a new label. We plot the result at the end:

ex3.n_clusters_ = 3
label = ex3.labels_

for i in range(len(X)):
    if ex3.weights_[i] < 1.2:
        label[i] = 3
    
print(label)

fig3 = plt.figure()
ax = plt3.Axes3D(fig3)
ax.view_init(7, -70)

for l in np.unique(label):
    ax.scatter(X[label == l, 0], X[label == l, 1], X[label == l, 2],
               color=plt.cm.jet(np.float(l) / np.max(label + 1)),
               s=3)

[2 2 2 ... 1 3 3]

ax.view_init(50, -150)
fig3

Although the swiss roll is not completely clustered and it gets separated into two regions due to the presence of the sphere, the noise is quite properly identified, and we obtain a quite good result given the complexity of the shapes involved, specially the interior of the spiral.

Example 4¶

In this example we explore the case in which we don't give the coordinates of the points directly, but the distances between them.

To do so, we sample a set of points over the unit sphere, but not uniformly: we sample them first in the cube 1x1x1 using a sigmoid function in each variable to concentrate them near the vertices and edges of the cube, and then we normalize them. This creates naturally regions of the sphere with more points, more specifically the directions pointing towards the vertices and edges of the cube:

def sample_spherical(npoints):
    vec = []
    vec.append(-0.5 + 1/(1 + np.exp(-5*np.random.uniform(-1,1, npoints))))
    vec.append(-0.5 + 1/(1 + np.exp(-5*np.random.uniform(-1,1, npoints))))
    vec.append(-0.5 + 1/(1 + np.exp(-5*np.random.uniform(-1,1, npoints))))
    vec /= np.linalg.norm(vec, axis=0)
    return vec

npoints = 6000
points = sample_spherical(npoints)

fig3 = plt.figure()
ax = plt3.Axes3D(fig3)
ax.view_init(7, -70)

ax.scatter(points[0,:], points[1,:], points[2,:], s=3);

We compute now the pairwise distances between all the points, using the "spherical" distance $d_{S}$: the distance between two points on the surface of a unit sphere with coordinates $a= (a_{1}, a_{2}, a_{3})$ and $b= (b_{1}, b_{2}, b_{3})$ is given by the formula:

\begin{equation} d_{S}(a,b) = \arccos(a_{1}b_{1}+a_{2}b{2}+a_{3}b_{3}) \end{equation}

As we don't have many points, we can compute all pairwise distances without much problem:

distance_matrix = np.zeros((npoints, npoints))

for i in range(npoints):
    distance_matrix[i,i]= 0
    for j in range(i+1, npoints):
        distance_matrix[i,j] = np.arccos(points[0,i]*points[0,j] + points[1,i]*points[1,j] + points[2,i]*points[2,j])
        distance_matrix[j,i] = distance_matrix[i,j]
    
print(distance_matrix)

[[0.         2.61770751 0.98831557 ... 1.48066665 1.45452474 2.8471735 ]
 [2.61770751 0.         2.13811295 ... 2.12844918 1.52573066 0.59225011]
 [0.98831557 2.13811295 0.         ... 1.76104435 0.62167259 1.86184991]
 ...
 [1.48066665 2.12844918 1.76104435 ... 0.         2.28207006 1.78291622]
 [1.45452474 1.52573066 0.62167259 ... 2.28207006 0.         1.40485347]
 [2.8471735  0.59225011 1.86184991 ... 1.78291622 1.40485347 0.        ]]

KDE and logKDE use the already-built Scikit-learn library and we cannot use them for a precomputed distance matrix. We use logDTM insted of DTM to make the persistance diagram look more clear:

ex4 = Tomato(
        input_type="points",
        metric="precomputed",
        graph_type="knn",
        density_type="logDTM",
        #n_clusters=2,
        k=10,
    )

ex4.fit(distance_matrix)
ex4.plot_diagram()

There are 8 clear clusters, a quite expected result:

ex4.n_clusters_ = 8
label = ex4.labels_

fig3 = plt.figure()
ax = plt3.Axes3D(fig3)
ax.view_init(25, -160)

for l in np.unique(label):
    ax.scatter(points[0, label == l], points[1, label == l], points[2, label == l],
               color=plt.cm.jet(np.float(l) / np.max(label + 1)),
               s=3)

Example 5¶

We do another easy example just to get used to other input formats to our algorithm. In this one we will input ourselves the weights of the points as well as a neighboring graph, which will just be a rectangular mesh in the square 10x10. For the weights, we will be using the function:

\begin{equation} f(x,y) = \sin\Big(\cfrac{x + y}{2}\Big) + \cos\Big(\cfrac{x - y}{2}\Big), \end{equation}

plotted below. In this setting, our algorithm will be just looking for basis of attraction of our function.

def f(x, y):
    return 2+ np.sin(0.5*(x+y)) + np.cos(0.5*(x-y))

x = np.linspace(-10, 10, 30)
y = np.linspace(-10, 10, 30)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary');
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');

ax.view_init(70, -50)
fig

And now the points, with the neighboring graph:

size_mesh = 30
points = np.zeros((2, size_mesh**2))
arange = np.linspace(-10., 10., size_mesh)

#Coordinates of the points
for i in range(size_mesh):
    for j in range(size_mesh):
        points[0][i*size_mesh + j] = arange[i]
        points[1][i*size_mesh + j] = arange[j]

#Neighboring graph
neigh_graph = []
for i in range(size_mesh):
    for j in range(size_mesh):
        neigh = []
        if i > 0:
            neigh.append((i-1)*size_mesh + j)
        if i < size_mesh -1:
            neigh.append((i+1)*size_mesh + j)
        if j > 0:
            neigh.append(i*size_mesh + j-1)
        if j < size_mesh -1:
            neigh.append(i*size_mesh + j+1)
        neigh_graph.append(neigh)

#Drawing the graph
plt.plot(points[0,:], points[1,:], 'o', markersize=2);

for i in range(len(neigh_graph)):
    Y = np.zeros((2,2))
    for j in neigh_graph[i]:
        Y[0][0]= points[0][i]
        Y[1][0]= points[1][i]
        Y[0][1]= points[0][j]
        Y[1][1]= points[1][j]
        plt.plot(Y[0], Y[1], 'ro-', linewidth=2)

plt.show()

We now associate the weights to the different points according to f, and run the Tomato algorithm to compute the basins of attraction:

#We associate the weights
weights = np.zeros(size_mesh**2)
for i in range(size_mesh**2):
    weights[i] = f(points[0][i], points[1][i])
    
#We run Tomato
ex5 = Tomato(
        graph_type = "manual",
        density_type = "manual"
    )

ex5.fit(neigh_graph, weights= weights)
ex5.plot_diagram()
print(ex5.diagram_)

[[3.75212014 1.99409954]
 [3.75212014 1.99409954]
 [3.98535762 1.98244992]
 [3.98535762 1.98244992]
 [3.9737506  1.9824371 ]
 [3.54402111 1.97677169]
 [3.9882662  1.94776161]]

ex5.n_clusters = 7
labels = ex5.fit_predict(neigh_graph, weights= weights)

norm = plt.Normalize(vmin=0, vmax=6)

fig, ax = plt.subplots();

ax.cla()
ax.scatter(points[0,:], points[1,:], c=cmap(norm(labels)));

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary', linewidths=0.5);
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');

points3d = np.zeros((3, size_mesh**2))
for i in range(size_mesh**2):
    points3d[0, i] = points[0,i]
    points3d[1, i] = points[1,i]

z_coord = np.zeros(size_mesh**2)    
for l in np.unique(labels):
    ax.scatter(points3d[0, labels == l], points3d[1, labels == l], points3d[2, labels == l], s=10)
    
ax.view_init(80, -50)
fig = plt.gcf()
fig.set_size_inches(12,7)
plt.show()

Example 6¶

In this last example, closer to the kind of datasets we could find in real life, we will work with the famous "Digits dataset", containing 1797 observations each with 64 features: each entry represents a (highly compressed) hand-written digit in a 8x8 grid, where each cell can vary from 0 to 16, representing its opacity. Naturally, the dataset also contains the correct labels of each instance: a number from 0 to 9, the one written in the grid.

#Load the digits dataset
digits = datasets.load_digits()

#Display the 25th digit
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[25], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

It is well known the difficulty of performing data science algorithms in high dimension, and clustering is not an exception; in fact, it is a process particularly sensitive to numerical data being sparse. Thus, even with dimensionality reduction techniques, it's not a good idea to expect a brilliant performance of our algorithm in this setting. In any case, it is interesting to see what kind of results we get. The results of other clustering methods over this dataset can be found in [4].

digits, real_label = datasets.load_digits(return_X_y=True)

print(digits)
print(real_label)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
[0 1 2 ... 8 9 8]

We can embed the dataset in the plane by using PCA dimensionality reduction. We observe that, with that reduction level, the different clusters of numbers are somewhat distinguishable, but there is also considerable overlapping:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
digits_red = pca.fit_transform(digits)

def plot_clustering(X_red, labels, title=None):
    x_min, x_max = np.min(X_red, axis=0), np.max(X_red, axis=0)
    X_red = (X_red - x_min) / (x_max - x_min)

    plt.figure(figsize=(6, 4))
    for i in range(X_red.shape[0]):
        plt.text(X_red[i, 0], X_red[i, 1], str(y[i]),
                 color=plt.cm.nipy_spectral(labels[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 8})

    plt.xticks([])
    plt.yticks([])
    if title is not None:
        plt.title(title, size=15)
    plt.axis('off')
    fig = plt.gcf()
    fig.set_size_inches(12,7)
    
plot_clustering(digits_red, real_label, title = "2d embedding of the digits dataset, colors= real labels")

It's useless to try to run the algorithm without doing any kind of dimensionality reduction first: accurate density estimation is almost always unsucessful with highly sparse data. We can try to use our algorithm after killing some dimensions first. Wit our dataset, after some experimentation, when there are 11 dimensions left DTM density estimation looks quite well:

pca = PCA(n_components=11)
digits_red = pca.fit_transform(digits)

ex6 = Tomato(
        input_type="points",
        metric="euclidean",
        graph_type="knn",
        density_type="logDTM",
        n_clusters=10,
        k=9,
    )

ex6.fit(digits_red)
ex6.plot_diagram()

It looks like the algorithm found "naturally" 9-10 clusters, let's plot these 10 groups in 2D:

labels = ex6.fit_predict(digits_red)

pca = PCA(n_components=2)
digits_red = pca.fit_transform(digits_red)

plot_clustering(digits_red, labels, title = "2d embedding of the digits dataset, colors= clusters")

The result looks suprisingly good, actually.

A way to measure the matching degree consists in computing the vector $10 \cdot \text{real_labels} + \text{clustering_labels}$, which takes values $\in \{0, \dots, 99\}$, and then counting the number of times each number appears. In a perfect classification, only 10 values would appear, more specifically $10\cdot i + label_{i}$, with $i \in \{0, \dots, 10\}$; in a decent clustering, we should at least see some clearly more prominent values, which is indeed what happens in our case!

vect_count = 10*real_label + labels
print(vect_count)

count = np.zeros((2,100))
for i in range(100):
    count[0,i]= i
for i in vect_count:
    count[1,i] += 1

fig, ax = plt.subplots();
ax.cla()
ax.scatter(count[0,:], count[1,:], s=16);

[ 0 12 27 ... 82 97 87]

From this graph it looks like most of the values have been properly grouped. Only the value 8 looks more mismatched. We can also get an idea about how the numbers have been grouped or labeled using a table, and checking the columns and rows: each of them should only contain one "big" value:

import pandas as pd
data = [[1, 2], [3, 4]]
pd.DataFrame(data, columns=["Foo", "Bar"])

table = []
for i in range(10):
    row = []
    #row.append(i)
    for i in range(i*10, i*10 +10):
        row.append(count[1, i])
    table.append(row)

pd.DataFrame(table, columns=[ 'Label 0', 'Label 1', 'Label 2', 'Label 3', 'Label 4', 'Label 5', 'Label 6', 'Label 7', 'Label 8', 'Label 9'])

The clustering has been quite successful. In any case, the numbers 3's and 9's have been almost completely clustered together (which is not that suprising, given the low resolution of the dataset), and the number 8 is almost evenly divided between labels 2 (with the 1's) and 7 (with the 3's and 9's), again not very surprising.

References¶

[1]. Frédéric CHAZAL et al., Persistence-Based Clustering in Riemannian Manifolds . In :Journal of the ACM 60 (juin 2011).doi:10.1145/1998196.1998212 : URL: https://hal.inria.fr/inria-00389390/document

[2]. Tutorial for DTM: https://github.com/GUDHI/TDA-tutorial/blob/master/Tuto-GUDHI-DTM-filtrations.ipynb

[3]. Gérard BIAU, Frédéric CHAZAL, David COHEN-STEINER, Luc DEVROYE, Carlos RODRIGUEZ. A Weighted k-Nearest Neighbor Density Estimate for Geometric Inference . 2011. ffinria-00560623v1 : URL: http://luc.devroye.org/BiauChazalCohenDevroyeRodriguez-kNN-2EJS-2011.pdf

[4] Various Agglomerative Clustering on a 2D embedding of digits: https://scikit-learn.org/stable/auto_examples/cluster/plot_digits_linkage.html#sphx-glr-auto-examples-cluster-plot-digits-linkage-py

	Label 0	Label 1	Label 2	Label 3	Label 4	Label 5	Label 6	Label 7	Label 8	Label 9
0	177.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	1.0	154.0	0.0	0.0	1.0	0.0	0.0	26.0	0.0
2	1.0	0.0	2.0	0.0	164.0	0.0	1.0	9.0	0.0	0.0
3	0.0	0.0	7.0	0.0	0.0	3.0	3.0	170.0	0.0	0.0
4	0.0	0.0	2.0	177.0	0.0	0.0	2.0	0.0	0.0	0.0
5	0.0	1.0	0.0	1.0	0.0	178.0	0.0	2.0	0.0	0.0
6	1.0	179.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	168.0	0.0	0.0	11.0
8	0.0	2.0	82.0	0.0	2.0	4.0	2.0	82.0	0.0	0.0
9	0.0	0.0	20.0	0.0	0.0	5.0	2.0	144.0	0.0	9.0