Choosing k (Number of Clusters)¶

Selecting the right number of clusters is critical for balancing cluster granularity against computational cost and empty cluster waste. The scripts/select_k.py script automates this by sweeping over k values on a subsample of PQ codes.

Running the sweep¶

python scripts/select_k.py \
    --pq-codes data/pq_codes.npy \
    --encoder models/encoder.joblib \
    --n-subsample 10000000 \
    --k-values 10000 25000 50000 100000 200000 \
    --iterations 10 \
    --output results/k_selection.csv \
    --plot results/k_selection.png

The script supports checkpointing — results are saved to the CSV after each k value. If interrupted, rerun the same command and it resumes from where it left off.

Results on 100M Enamine REAL molecules¶

Benchmark configuration: AMD Ryzen 7, 64GB RAM, 10 PQk-means iterations per k.

k	Avg Distance	Empty Clusters	Median Cluster Size	Fit Time
10,000	3.65	6.8%	8,945	1.3 h
25,000	2.74	13.3%	3,673	3.1 h
50,000	2.17	19.6%	1,876	6.2 h
100,000	1.69	26.6%	956	12.6 h
200,000	1.30	34.7%	492	26.4 h

How to interpret these metrics¶

Avg Distance: Mean PQ-space distance from each point to its assigned centroid. Lower is better, but diminishing returns set in quickly.
Empty Clusters: Fraction of clusters with zero members. High values mean you're over-partitioning — the data doesn't have that many natural groupings.
Median Cluster Size: Typical number of molecules per cluster. Determines how many molecules you see in each leaf TMAP.

Guidelines¶

k = 50,000 is a good default — under 20% empty clusters, median size ~1,900, and the avg distance improvement starts plateauing beyond this point.
k = 100,000 if you need tighter clusters and can tolerate ~27% empty clusters.
Beyond 200K, over a third of clusters are empty — diminishing returns.

Scaling estimates¶

Fit time scales linearly with both n (number of molecules) and k:

Scenario	Estimated Fit Time
1B molecules, k=50K	~2.6 days
1B molecules, k=100K	~5.2 days
2B molecules, k=100K	~10.5 days

Memory requirements

PQ codes are (n, 42) uint8 arrays. At 1B molecules this is ~42 GB. Ensure your system has sufficient RAM, or use the --n-subsample flag to fit on a representative subset and then assign in chunks.