KNN-Graph

3 minute read

Published:

Let’s say we have a dataset of 10 molecules (A through J), and we want to construct a 3-NN graph (k=3) where k indicates the number of neighbours.

After preparing our data, transforming it into fingerprints and then calculating its MinHash signature and stored in the LSH Forest (see LSH Forest for more information). We use the LSH Forest to find the nearest neighbours for each molecule.

  1. First we query the LSH Forest: Let’s say we use Molecule A’s MinHash signature (e.g. 10010101) and we use it to query all the LSH tables in the forest. Then we find that the neighbours (i.e. items stored in the same or near leaves) are the following set of candidates {B, C, D, E, F}
  2. Compute the Actual distances between them For each candidate, we compute the distance to molecule A Let’s say we get the distances A-B: 0.2 A-C: 0.3 A-D: 0.4 A-E: 0.5 A-F: 0.6 Typically in the context of fingerprints (like those often used for molecules), Jaccard similarity is used for calculating the distances between points. In fact, the (Jaccard) distance is $Jaccard\ distance = 1- Jaccard\ similarity$

To calculate the Jaccard similarity: Let A and B be to binary vectors, theN: Jaccard Similarity = (A ∩ B) / (A ∪ B) Where: (A ∩ B) is the number of positions where both A and B have 1s (A ∪ B) is the total number of positions where either A or B (or both) have 1s

  1. Select k Nearest Neighbours: For that we sort these by distance and select the top k (in this case, 3): Molecule A’s 3 nearest neighbors are B, C, and D.

Now our objective is t create the Edges in the Graph. For molecule A, we create B, C and D. We repeat this process for all molecules. Ending up with something like this:

image

The result is a graph where:

  • Each node represents a molecule.
  • Each node has outgoing edges to its k nearest neighbors.
  • The graph is directed (edges have a direction) and not necessarily symmetric.
  • Edge weights could be stored (representing distances) if needed for later steps.

The graph is a crucial intermediate step in tmap. It captures the local similarity relationships between molecules, which will be used in the next step to construct the Minimum Spanning Tree (MST).

Why a directed Graph?

One thing you may have noticed is that the graph is directed rather than undirected. The reason being that the k-NN relationship is not necessarily symmetric. So, this raises the question What is the difference between A –> B and B –> A? How can it be that if A considers B a nearest neighbour, B might not necessarily consider A as one of its nearest neighbours?

image

In this diagram:

  • A considers B, C, and D as its 3 nearest neighbors.
  • B considers E, F, and A as its 3 nearest neighbors.

Even though the distance between A and B is the same in both directions (0.2), B has other neighbors (E and F) that are closer to it than A is.

This asymmetry occurs because:

  1. We’re looking at a fixed number (k) of nearest neighbors for each molecule.
  2. The distribution of distances can be different around different points in the high-dimensional space.

In our example:

  • From A’s perspective, B is its closest neighbor at a distance of 0.2.
  • From B’s perspective, while A is close (0.2), it has even closer neighbors (E at 0.1 and F at 0.15).

Comments