API Reference¶
Core Classes¶
DataStreamer¶
chelombus.streamer.data_streamer.DataStreamer
¶
Class for streaming large datasets in manageable chunks.
Source code in chelombus/streamer/data_streamer.py
parse_input(input_path, chunksize=None, verbose=0, smiles_col=0)
¶
Reads input data from a file or a directory of files and yields the data in chunks.
This method processes each file provided by the input path, which can be either a single file
or a directory containing multiple files. Lines in plain-text files are split and the column at
smiles_col is used as the SMILES string. SDF/SD files are handled in a streaming fashion via
RDKit and converted to SMILES on the fly. For each file, the extracted SMILES strings are placed
in a buffer until it reaches chunksize, at which point the buffer is yielded and cleared. If
chunksize is None or there are remaining lines after processing a file, those are yielded as a
final chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str
|
The path to a file or directory containing input files. |
required |
chunksize
|
Optional[int]
|
The number of lines to include in each yielded chunk. If None, the entire file content is yielded as a single chunk. |
None
|
verbose
|
int
|
Level of verbosity. Default is 0. |
0
|
smiles_col
|
int
|
Column index for text inputs in which SMILES appear among several columns. Ignored for SDF inputs where SMILES strings are generated from molecule blocks. |
0
|
Yields:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of lines from the input file(s), where the list length will be equal to |
List[str]
|
|
Source code in chelombus/streamer/data_streamer.py
FingerprintCalculator¶
chelombus.utils.fingerprints.FingerprintCalculator
¶
A class to compute molecular fingerprints from a list of SMILES strings. Currently, only the 'morgan' and 'mqn' fingerprints type are supported.
Source code in chelombus/utils/fingerprints.py
FingerprintFromSmiles(smiles, fp, nprocesses=os.cpu_count(), **params)
¶
Generate fingerprints for a list of SMILES strings in parallel.
The method selects the appropriate fingerprint function based on the 'fp' parameter, binds additional keyword parameters using functools.partial, and then applies the function across the SMILES list using multiprocessing.Pool.map.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles_list
|
list
|
A list of SMILES strings. |
required |
fp
|
str
|
The fingerprint type to compute (e.g., 'morgan'). |
required |
nprocesses
|
int
|
Number of processes for multithreaded fingerprint calculation. Default to cpu cores |
cpu_count()
|
**params
|
Additional keyword parameters for the fingerprint function. For 'morgan', required keys are: - fpSize (int): Number of bits in the fingerprint. - radius (int): Radius for the Morgan fingerprint. |
{}
|
Returns:
| Type | Description |
|---|---|
NDArray
|
npt.NDArray: A NumPy array of fingerprints with shape (number of SMILES, fpSize). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported fingerprint type is requested. |
Source code in chelombus/utils/fingerprints.py
PQEncoder¶
chelombus.encoder.encoder.PQEncoder
¶
Bases: PQEncoderBase
Class to encode high-dimensional vectors into PQ-codes.
Source code in chelombus/encoder/encoder.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
pq_trained = []
instance-attribute
¶
The codebook is defined as the Cartesian product of all centroids. Storing the codebook C explicitly is not efficient, the full codebook would be (k')^m centroids Instead we store the K' centroids of all the subquantizers (k' · m). We can later simulate the full codebook by combining the centroids from each subquantizer.
__init__(k=256, m=8, iterations=20)
¶
Initializes the encoder with trained sub-block centroids.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Number of centroids. Default is 256. We assume that all subquantizers |
256
|
m
|
int
|
Number of distinct subvectors of the input X vector. |
8
|
iterations
|
int
|
Number of iterations for the k-means |
20
|
High values of k increase the computational cost of the quantizer as well as memory usage of strogin the centroids (k' x D floating points). Using k=256 and m=8 is often a reasonable choice.
Reference: DOI: 10.1109/TPAMI.2010.57
Source code in chelombus/encoder/encoder.py
fit(X_train, verbose=1, **kwargs)
¶
KMeans fitting of every subvector matrix from the X_train matrix. Populates the codebook by storing the cluster centers of every subvector
X_train is the input matrix. For a vector that has dimension D
then X_train is a matrix of size (N, D)
where N is the number of rows (vectors) and D the number of columns (dimension of
every vector i.e. fingerprint in the case of molecular data)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_train
|
array
|
Input matrix to train the encoder. |
required |
verbose
|
int
|
Level of verbosity. Default is 1 |
1
|
**kwargs
|
Optional keyword arguments passed to the underlying KMeans |
{}
|
Source code in chelombus/encoder/encoder.py
transform(X, verbose=1, **kwargs)
¶
Transforms the input matrix X into its PQ-codes.
For each sample in X, the input vector is split into m equal-sized subvectors.
Each subvector is assigned to the nearest cluster centroid
and the index of the closest centroid is stored.
The result is a compact representation of X, where each sample is encoded as a sequence of centroid indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input data matrix of shape (n_samples, n_features),
where n_features must be divisible by the number of subvectors |
required |
verbose
|
int
|
Level of verbosity. Default is 1 |
1
|
**kwargs
|
Optional keyword arguments passed to the underlying KMeans |
{}
|
Returns:
| Type | Description |
|---|---|
NDArray
|
np.ndarray: PQ codes of shape (n_samples, m), where each element is the index of the nearest centroid for the corresponding subvector. |
Source code in chelombus/encoder/encoder.py
fit_transform(X, verbose=1, **kwargs)
¶
Fit and transforms the input matrix X into its PQ-codes
The encoder is trained on the matrix and then for each sample in X,
the input vector is split into m equal-sized vectors subvectors composed
byt the index of the closest centroid. Returns a compact representation of X,
where each sample is encoded as a sequence of centroid indices (i.e PQcodes)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array
|
Input data matrix of shape (n_samples, n_features) |
required |
verbose
|
int
|
Level of verbosity. Defaults to 1. |
1
|
**kwargs
|
Optional keyword. These arguments will be passed to the underlying KMeans |
{}
|
Returns:
| Type | Description |
|---|---|
NDArray
|
np.array: PQ codes of shape (n_samples, m), where each element is the index of the nearest |
NDArray
|
centroid for the corresponding subvector |
Source code in chelombus/encoder/encoder.py
inverse_transform(pq_codes, binary=False, round=True)
¶
Inverse transform. From PQ-code to the original vector. This process is lossy so we don't expect to get the exact same data. If binary=True then the vectors will be returned in binary. This is useful for the case where our original vectors were binary. With binary=True then the returned vectors are transformed from [0.32134, 0.8232, 0.0132, ... 0.1432, 1.19234] to [0, 1, 0, ..., 0, 1] If round=True then the vector will be approximated to integer values (useful in cases where we expeect to have a count-based fingerprint)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pq_code
|
(np.array): Input data of PQ codes to be transformed into the |
required | |
binary
|
(bool): Wheter to return the vectors rounded to 0s and 1s. Default is False |
False
|
|
round
|
(bool): Round inversed vector to integers values. Default is True |
True
|
Source code in chelombus/encoder/encoder.py
PQKMeans¶
chelombus.clustering.PyQKmeans.PQKMeans
¶
This class provides a scikit-learn-like interface to the PQk-means algorithm, which operates directly on PQ codes using symmetric distance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
encoder
|
PQEncoder
|
A trained PQEncoder instance |
required |
k
|
int
|
Number of clusters |
required |
iteration
|
int
|
Number of k-means iterations (default: 20) |
20
|
verbose
|
bool
|
Whether to print progress information (default: False) |
False
|
Example
encoder = PQEncoder(k=256, m=6) encoder.fit(training_data) pq_codes = encoder.transform(data) clusterer = PQKMeans(encoder, k=100000) labels = clusterer.fit_predict(pq_codes)
Source code in chelombus/clustering/PyQKmeans.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
cluster_centers_
property
writable
¶
Get cluster centers (PQ codes of shape (k, m)).
fit(X_train)
¶
Fit the PQk-means model to training PQ codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_train
|
ndarray
|
PQ codes of shape (n_samples, n_subvectors) |
required |
Returns:
| Type | Description |
|---|---|
PQKMeans
|
self |
Source code in chelombus/clustering/PyQKmeans.py
predict(X)
¶
Predict cluster labels for PQ codes.
Uses Numba JIT-compiled parallel assignment with precomputed symmetric distance lookup tables
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
PQ codes of shape (n_samples, n_subvectors), dtype uint8 |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Cluster labels of shape (n_samples,) |
Source code in chelombus/clustering/PyQKmeans.py
fit_predict(X)
¶
Fit the model and predict cluster labels in one step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
PQ codes of shape (n_samples, n_subvectors) |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Cluster labels of shape (n_samples,) |
Source code in chelombus/clustering/PyQKmeans.py
Cluster I/O¶
chelombus.utils.cluster_io
¶
Cluster I/O utilities using DuckDB for efficient querying.
This module provides functions for querying and exporting clustered molecular data stored in chunked parquet files. It uses DuckDB for efficient SQL-based queries across multiple files without loading everything into memory.
Architecture
During pipeline.transform(), data is saved as chunked parquet files: results/chunk_00001.parquet (smiles, cluster_id) results/chunk_00002.parquet (smiles, cluster_id) ...
These functions query across all chunks using DuckDB's glob pattern support, enabling efficient filtering and export without loading all data into RAM.
Example
from chelombus.utils.cluster_io import query_cluster, export_all_clusters
Get molecules from cluster 42¶
df = query_cluster('results/', cluster_id=42)
Export all clusters (for HPC/SLURM processing)¶
stats = export_all_clusters('results/', 'clusters/')
query_cluster(results_dir, cluster_id, cluster_column='cluster_id', columns=None)
¶
Query all molecules from a specific cluster.
Scans all chunked parquet files in results_dir and returns molecules belonging to the specified cluster as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files (output from pipeline.transform()). |
required |
cluster_id
|
int
|
The cluster ID to query. |
required |
cluster_column
|
str
|
Name of the column containing cluster IDs. |
'cluster_id'
|
columns
|
Optional[List[str]]
|
Optional list of columns to return. Default returns all columns. Common columns: ['smiles', 'cluster_id']. |
None
|
Returns:
| Type | Description |
|---|---|
'pd.DataFrame'
|
DataFrame containing all molecules from the specified cluster. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If duckdb is not installed. |
FileNotFoundError
|
If results_dir doesn't exist. |
ValueError
|
If no parquet files found or cluster_id doesn't exist. |
Example
df = query_cluster('results/', cluster_id=42) print(f"Cluster 42 has {len(df)} molecules") print(df['smiles'].head())
Source code in chelombus/utils/cluster_io.py
export_cluster(results_dir, cluster_id, output_path, cluster_column='cluster_id', columns=None)
¶
Export a single cluster to a file.
Queries all molecules from a specific cluster and writes them to a file. Supports parquet and CSV output formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
cluster_id
|
int
|
The cluster ID to export. |
required |
output_path
|
Union[str, Path]
|
Output file path. Must end with .parquet or .csv. |
required |
cluster_column
|
str
|
Name of the column containing cluster IDs. |
'cluster_id'
|
columns
|
Optional[List[str]]
|
Optional list of columns to export. Default exports all columns. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Number of molecules exported. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If duckdb is not installed. |
ValueError
|
If output format is not supported. |
Example
count = export_cluster('results/', 42, 'cluster_42.parquet') print(f"Exported {count} molecules")
Source code in chelombus/utils/cluster_io.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | |
export_all_clusters(results_dir, output_dir, format='parquet')
¶
Export all clusters to individual files using partitioned writes.
This function exports all clusters to separate files in a single streaming pass through the data, using DuckDB's PARTITION_BY functionality for memory efficiency.
Output structure (Hive-style partitioning): output_dir/ ├── cluster_id=0/data_0.parquet ├── cluster_id=1/data_0.parquet ├── cluster_id=2/data_0.parquet ... └── cluster_id=99999/data_0.parquet
This structure is compatible with: - DuckDB queries: SELECT * FROM 'clusters//.parquet' WHERE cluster_id = 42 - Spark/pandas: Reads partition column automatically - SLURM arrays: Process cluster_id=$SLURM_ARRAY_TASK_ID
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
output_dir
|
Union[str, Path]
|
Output directory for partitioned cluster files. |
required |
format
|
str
|
Output format, either 'parquet' or 'csv'. Default 'parquet'. |
'parquet'
|
batch_size
|
Number of clusters to process per batch for progress reporting. Does not affect memory usage (DuckDB handles streaming internally). |
required |
Returns:
| Type | Description |
|---|---|
dict[int, int]
|
Dictionary mapping cluster_id to molecule count. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If duckdb is not installed. |
ValueError
|
If format is not supported. |
Example
stats = export_all_clusters('results/', 'clusters/') print(f"Exported {len(stats)} clusters") print(f"Largest cluster: {max(stats.values())} molecules")
Note
For HPC/SLURM usage, you can process clusters in parallel:
Source code in chelombus/utils/cluster_io.py
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | |
get_cluster_stats(results_dir, column='cluster_id')
¶
Get statistics for all clusters.
Returns a DataFrame with cluster IDs and molecule counts, sorted by cluster_id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
column
|
str
|
Name for the column containing cluster ID defaults to 'cluster_id' |
'cluster_id'
|
Returns:
| Type | Description |
|---|---|
'pd.DataFrame'
|
DataFrame with columns ['cluster_id', 'molecule_count']. |
Example
stats = get_cluster_stats('results/') print(f"Number of clusters: {len(stats)}") print(f"Average cluster size: {stats['molecule_count'].mean():.1f}") print(f"Largest cluster: {stats['molecule_count'].max()}")
Source code in chelombus/utils/cluster_io.py
get_cluster_ids(results_dir, cluster_id_column_name='cluster_id')
¶
Get list of all cluster IDs in the results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
Returns:
| Type | Description |
|---|---|
List[int]
|
Sorted list of cluster IDs. |
Example
cluster_ids = get_cluster_ids('results/') print(f"Clusters range from {min(cluster_ids)} to {max(cluster_ids)}")
Source code in chelombus/utils/cluster_io.py
get_total_molecules(results_dir)
¶
Get total number of molecules across all clusters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Total molecule count. |
Source code in chelombus/utils/cluster_io.py
query_clusters_batch(results_dir, cluster_ids, columns=None, cluster_column='cluster_id')
¶
Query molecules from multiple clusters at once.
More efficient than calling query_cluster() multiple times when you need data from several clusters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
cluster_ids
|
List[int]
|
List of cluster IDs to query. |
required |
columns
|
Optional[List[str]]
|
Optional list of columns to return. |
None
|
Returns:
| Type | Description |
|---|---|
'pd.DataFrame'
|
DataFrame containing molecules from all specified clusters. |
Example
df = query_clusters_batch('results/', [1, 2, 3, 4, 5]) cluster_counts = df.groupby('cluster_id').size()
Source code in chelombus/utils/cluster_io.py
sample_from_cluster(results_dir, cluster_id, n=100, random_state=None)
¶
Get a random sample of molecules from a cluster.
Useful for previewing cluster contents without loading all molecules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
Union[str, Path]
|
Path to directory containing chunked parquet files. |
required |
cluster_id
|
int
|
The cluster ID to sample from. |
required |
n
|
int
|
Number of molecules to sample. |
100
|
random_state
|
Optional[int]
|
Random seed for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
'pd.DataFrame'
|
DataFrame with sampled molecules. |
Example
sample = sample_from_cluster('results/', cluster_id=42, n=10) print(sample['smiles'].tolist())
Source code in chelombus/utils/cluster_io.py
Helper Functions¶
chelombus.utils.helper_functions
¶
format_time(seconds)
¶
Simpel script to transform seconds into readable format
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seconds
|
int
|
Seconds |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
(int) h, (int) min, (float) seconds |
Source code in chelombus/utils/helper_functions.py
save_chunk(fp_chunk, output_dir, chunk_index, file_format='npy', name='fingerprints_chunk', **kwargs)
¶
Save a chunk of fingerprint data to a file in the specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fp_chunk
|
ndarray
|
The fingerprint array chunk (each row corresponds to a fingerprint). |
required |
output_dir
|
str
|
Directory path where the fingerprint chunk will be saved. |
required |
chunk_index
|
int
|
The index number for the current chunk. This is used to generate a unique file name. |
required |
file_format
|
str
|
Format to save the data. Options are: - 'npy': Save as a NumPy binary file. - 'parquet': Save as an Apache Parquet file. Default is 'npy'. |
'npy'
|
name
|
str
|
Name prefix for each the file to be generated. |
'fingerprints_chunk'
|
**kwargs
|
Additional keyword arguments to pass to the saving function.
For 'npy', kwargs are passed to |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The full path of the saved file. |