Active Stacking-Deep Learning with Strategic Sampling for Small and Imbalanced Chemical Toxicity Prediciton

Darlene Nabila Zetta†, Watshara Shoombuatong‡, and Tarapong Srisongkram*

†Graduate School in the Program of Pharmaceutical Sciences, Faculty of Pharmaceutical Sciences, Khon Kaen University, Khon Kaen, 40002, Thailand. (darlenenabilazetta.d@kkumail.com)

‡Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand. (watshara.sho@mahidol.ac.th)

*Division of Pharmaceutical Chemistry, Faculty of Pharmaceutical Sciences, Khon Kaen University, Khon Kaen, 40002, Thailand. (tarasri@kku.ac.th)

Full paper submitted in ACS Omega.

Overview

This repository implements Active Stack-Deep Learning with Strategic Sampling for Small and Imbalance Chemical Toxicity Prediction. The pipeline includes:

Data preprocessing
Feature extraction
Model training
Performance evaluation

Requirements

Create a virtual environment

Create a virtual environment with Python 3.11.

Install dependencies from requirements.txt:

pip install -r requirements.txt

Data Preparation

The data/ directory is organized as follows:

data/
├── train/ # Training data
├── test/ # Testing data
├── subsets/ # Selected initial subsets
└── pool/ # Remaining unlabeled compounds

To preprocess the data, run:

python preprocess.py

Features Extraction

The features extraction of twelve fingerprints calculated with python file: The extraction of twelve molecular fingerprints is performed using the following script:

python calculate_fp.py

This script is supported by the fingerprints_xml/ folder, which contains the necessary fingerprint definitions.

Training and Evaluate the Model

The training and evaluation process using the processed data includes the following steps:

Divide the subset data for sampling

Run:

python divide_sampling.py

This will generate multiple k-ratio subset samplings saved in one folder.

Train models on subset samplings + OOF predictions

Run:

python train_meta_sampling.py
python train_meta_sampling_oof.py

This script trains models on each subset sampling.

Train the stacking ensemble and evaluate

You can run one of the following, depending on the desired model:

Run:

python train_average_probability_att.py
python train_average_probability_bilstm.py
python train_average_probability_cnn.py

This trains a CNN-based stacking ensemble on the subsets. The average predictions are calculated and the evaluation results are saved as a CSV file.

Split the pool data set

Run:

python pool_split.py

This script separates the remaining pool data into a new folder for the next iteration.

Predict pool data set

Run:

python pool_pred_sampling.py
python pool_pred_average.py

This script predicts the remaining pool data.

Apply active learning selection strategies to the pool data

You can run one of the following, depending on the desired strategy:

python entropy_cal.py
python margin_cal.py
python uncertainty_cal.py

These scripts select new compounds from the pool based on entropy, margin, or uncertainty, and generate updated subset and pool files.

Repeat the steps above for each active learning iteration until the desired number of compounds or performance is achieved.

Reproducing Results

To reproduce the results reported in the paper:

Follow the requirements and data preprocessing steps.
Run the training and evaluation scripts in sequence as described above.
The outputs and evaluation results will be saved in the specified folders.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Active Stacking-Deep Learning with Strategic Sampling for Small and Imbalanced Chemical Toxicity Prediciton

Darlene Nabila Zetta†, Watshara Shoombuatong‡, and Tarapong Srisongkram*

📋 Table of Contents

Overview

Requirements

Data Preparation

Features Extraction

Training and Evaluate the Model

Reproducing Results

MIT License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
__pycache__		__pycache__
data		data
fingerprints_xml		fingerprints_xml
README.md		README.md
calculate_fp.py		calculate_fp.py
entropy_cal.py		entropy_cal.py
graphic_abstract.png		graphic_abstract.png
margin_cal.py		margin_cal.py
pool_pred_average.py		pool_pred_average.py
pool_pred_sampling.py		pool_pred_sampling.py
pool_split.py		pool_split.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
sampling_divide.py		sampling_divide.py
single_learner_al.py		single_learner_al.py
train_average_probability_att.py		train_average_probability_att.py
train_average_probability_bilstm.py		train_average_probability_bilstm.py
train_average_probability_cnn.py		train_average_probability_cnn.py
train_meta_sampling.py		train_meta_sampling.py
train_meta_sampling_oof.py		train_meta_sampling_oof.py
uncertainty_cal.py		uncertainty_cal.py

taraponglab/meta-activelearning

Folders and files

Latest commit

History

Repository files navigation

Active Stacking-Deep Learning with Strategic Sampling for Small and Imbalanced Chemical Toxicity Prediciton

Darlene Nabila Zetta†, Watshara Shoombuatong‡, and Tarapong Srisongkram*

📋 Table of Contents

Overview

Requirements

Data Preparation

Features Extraction

Training and Evaluate the Model

Reproducing Results

MIT License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages