THEMAP

Task Hardness Estimation for Molecular Activity Prediction

A Python library for calculating distances between chemical datasets to enable intelligent dataset selection for molecular activity prediction tasks.

Overview

THEMAP is a Python library designed to calculate distances between chemical datasets for molecular activity prediction tasks. The primary goal is to enable intelligent dataset selection for:

Transfer Learning: Identify the most relevant source datasets for your target prediction task
Domain Adaptation: Measure dataset similarity to guide model adaptation strategies
Task Hardness Assessment: Quantify how difficult a prediction task will be based on dataset characteristics
Dataset Curation: Select optimal training datasets from large chemical databases like ChEMBL

Installation

Quick Start (Recommended)

The easiest way to install THEMAP with all features:

git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source env.sh

This automatically:

Installs uv (fast Python package manager) if needed
Creates a virtual environment in .venv
Installs all dependencies
Activates the environment

After installation, try an example:

python examples/basic/molecule_datasets_demo.py

To reactivate the environment later:

source .venv/bin/activate

Manual Installation

For more control, install with pip:

pip install themap                # Basic installation from PyPI
pip install -e ".[all]"           # Full installation (editable)
pip install -e ".[protein]"       # Protein analysis only
pip install -e ".[otdd]"          # Optimal transport only
pip install -e ".[dev,test]"      # Development + testing

Conda Alternative

For GPU support with specific CUDA versions:

conda env create -f environment.yml
conda activate themap
pip install -e . --no-deps

Prerequisites

Python 3.10 or higher
For GPU features: CUDA-compatible GPU and drivers

Quick Start

Basic Dataset Analysis

import os
from dpu_utils.utils.richpath import RichPath
from themap.data.molecule_dataset import MoleculeDataset

# Load datasets
source_dataset_path = RichPath.create(os.path.join("datasets", "train", "CHEMBL1023359.jsonl.gz"))
source_dataset = MoleculeDataset.load_from_file(source_dataset_path)

# Basic dataset analysis (works with minimal installation)
print(f"Dataset size: {len(source_dataset)}")
print(f"Positive ratio: {source_dataset.get_ratio}")
print(f"Dataset statistics: {source_dataset.get_statistics()}")

# Validate dataset integrity
try:
    source_dataset.validate_dataset_integrity()
    print("✅ Dataset is valid")
except ValueError as e:
    print(f"❌ Dataset validation failed: {e}")

Molecular Embeddings

# Only works with pip install -e ".[ml]" or higher
from themap.data.molecule_dataset import MoleculeDataset
dataset_path = RichPath.create(os.path.join("datasets", "train", "CHEMBL1023359.jsonl.gz"))

# Load dataset
dataset = MoleculeDataset.load_from_file(dataset_path)

# Calculate molecular embeddings (requires ML dependencies)
try:
    features = dataset.get_features("ecfp")
    print(f"Features shape: {features.shape}")
except ImportError:
    print("❌ ML dependencies not installed. Use: pip install -e '.[ml]'")

Distance Calculation

# Only works with pip install -e ".[all]"
from themap.data.tasks import Tasks, Task
from themap.distance import MoleculeDatasetDistance, ProteinDatasetDistance, TaskDistance

# Create Tasks collection from your datasets
source_dataset_path = RichPath.create(os.path.join("datasets", "train", "CHEMBL1023359.jsonl.gz"))
source_dataset = MoleculeDataset.load_from_file(source_dataset_path)
target_dataset_path = RichPath.create(os.path.join("datasets", "test", "CHEMBL2219358.jsonl.gz"))
target_dataset = MoleculeDataset.load_from_file(target_dataset_path)
source_task = Task(task_id="CHEMBL1023359", molecule_dataset=source_dataset)
target_task = Task(task_id="CHEMBL2219358", molecule_dataset=target_dataset)

# Step 1: Create Tasks collection with train/test split
tasks = Tasks(train_tasks=[source_task], test_tasks=[target_task])

# Step 2: Compute molecule distance with method-specific configuration
try:
    # Use different methods for different data types
    mol_dist = MoleculeDatasetDistance(
        tasks=tasks,
        molecule_method="otdd",     # OTDD for molecules
    )
    mol_dist._compute_features()
    distance = mol_dist.get_distance()
    print(distance)

except ImportError:
    print("❌ Distance calculation dependencies not installed. Use: pip install -e '.[all]'")

Usage Examples

Transfer Learning Dataset Selection

# Find the most similar training datasets for your target task
candidate_datasets = ["CHEMBL1023359", "CHEMBL2219358", "CHEMBL1243967"]
target_dataset = "my_target_assay"

distances = calculate_all_distances(candidate_datasets, target_dataset)
best_source = min(distances, key=distances.get)  # Closest dataset for transfer learning

Domain Adaptation Assessment

# Assess how much domain shift exists between datasets
domain_gap = calculate_dataset_distance(source_domain, target_domain)
if domain_gap < threshold:
    print("Direct transfer likely to work well")
else:
    print("Domain adaptation strategies recommended")

Task Hardness Prediction

# Predict task difficulty based on dataset characteristics
hardness_score = estimate_task_hardness(dataset, reference_datasets)
print(f"Predicted task difficulty: {hardness_score}")

Reproducing FS-Mol Experiments

Pre-computed molecular embeddings and distance matrices for the FS-Mol dataset are available on Zenodo.

Setup

Download data from Zenodo
Extract to datasets/fsmol_hardness/
Run the provided Jupyter notebooks in the notebooks/ directory

Documentation

Full documentation is available at themap.readthedocs.io or can be built locally:

mkdocs serve  # Serve locally at http://127.0.0.1:8000

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
pip install -e ".[dev,test]"

Running Tests

pytest
pytest --cov=themap  # with coverage

Code Quality

ruff check && ruff format  # linting and formatting
mypy themap/               # type checking

Citation

If you use THEMAP in your research, please cite our paper:

@article{fooladi2024quantifying,
  title={Quantifying the hardness of bioactivity prediction tasks for transfer learning},
  author={Fooladi, Hosein and Hirte, Steffen and Kirchmair, Johannes},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={10},
  pages={4031-4046},
  year={2024},
  publisher={ACS Publications}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Support

Ready to optimize your chemical dataset selection for machine learning? Start with THEMAP today! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github		.github
assets		assets
configs/examples		configs/examples
datasets		datasets
docs		docs
examples		examples
notebooks		notebooks
scripts		scripts
tests		tests
themap		themap
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build_docs.py		build_docs.py
env.sh		env.sh
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_pipeline.py		run_pipeline.py
run_pipeline.sh		run_pipeline.sh
run_tests.py		run_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

THEMAP

Table of Contents

Overview

Installation

Quick Start (Recommended)

Manual Installation

Conda Alternative

Prerequisites

Quick Start

Basic Dataset Analysis

Molecular Embeddings

Distance Calculation

Usage Examples

Transfer Learning Dataset Selection

Domain Adaptation Assessment

Task Hardness Prediction

Reproducing FS-Mol Experiments

Setup

Documentation

Contributing

Development Setup

Running Tests

Code Quality

Citation

License

🤝 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

HFooladi/THEMAP

Folders and files

Latest commit

History

Repository files navigation

THEMAP

Table of Contents

Overview

Installation

Quick Start (Recommended)

Manual Installation

Conda Alternative

Prerequisites

Quick Start

Basic Dataset Analysis

Molecular Embeddings

Distance Calculation

Usage Examples

Transfer Learning Dataset Selection

Domain Adaptation Assessment

Task Hardness Prediction

Reproducing FS-Mol Experiments

Setup

Documentation

Contributing

Development Setup

Running Tests

Code Quality

Citation

License

🤝 Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages