Active Stacking-Deep Learning with Strategic Sampling for Small and Imbalanced Chemical Toxicity Prediciton
†Graduate School in the Program of Pharmaceutical Sciences, Faculty of Pharmaceutical Sciences, Khon Kaen University, Khon Kaen, 40002, Thailand. (darlenenabilazetta.d@kkumail.com)
‡Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand. (watshara.sho@mahidol.ac.th)
*Division of Pharmaceutical Chemistry, Faculty of Pharmaceutical Sciences, Khon Kaen University, Khon Kaen, 40002, Thailand. (tarasri@kku.ac.th)
Full paper submitted in ACS Omega.
- Overview
- Requirements
- Data Preparation
- Features Extraction
- Training and Evaluate the Model
- Reproducing Results
- MIT License
This repository implements Active Stack-Deep Learning with Strategic Sampling for Small and Imbalance Chemical Toxicity Prediction. The pipeline includes:
- Data preprocessing
- Feature extraction
- Model training
- Performance evaluation
- Create a virtual environment
Create a virtual environment with Python 3.11.
- Install dependencies from
requirements.txt:
pip install -r requirements.txtThe data/ directory is organized as follows:
data/
├── train/ # Training data
├── test/ # Testing data
├── subsets/ # Selected initial subsets
└── pool/ # Remaining unlabeled compounds
To preprocess the data, run:
python preprocess.pyThe features extraction of twelve fingerprints calculated with python file: The extraction of twelve molecular fingerprints is performed using the following script:
python calculate_fp.py
This script is supported by the fingerprints_xml/ folder, which contains the necessary fingerprint definitions.
The training and evaluation process using the processed data includes the following steps:
- Divide the subset data for sampling
Run:
python divide_sampling.py
This will generate multiple k-ratio subset samplings saved in one folder.
- Train models on subset samplings + OOF predictions
Run:
python train_meta_sampling.py
python train_meta_sampling_oof.py
This script trains models on each subset sampling.
- Train the stacking ensemble and evaluate
You can run one of the following, depending on the desired model:
Run:
python train_average_probability_att.py
python train_average_probability_bilstm.py
python train_average_probability_cnn.py
This trains a CNN-based stacking ensemble on the subsets. The average predictions are calculated and the evaluation results are saved as a CSV file.
- Split the pool data set
Run:
python pool_split.py
This script separates the remaining pool data into a new folder for the next iteration.
- Predict pool data set
Run:
python pool_pred_sampling.py
python pool_pred_average.py
This script predicts the remaining pool data.
- Apply active learning selection strategies to the pool data
You can run one of the following, depending on the desired strategy:
python entropy_cal.py
python margin_cal.py
python uncertainty_cal.py
These scripts select new compounds from the pool based on entropy, margin, or uncertainty, and generate updated subset and pool files.
Repeat the steps above for each active learning iteration until the desired number of compounds or performance is achieved.
To reproduce the results reported in the paper:
-
Follow the requirements and data preprocessing steps.
-
Run the training and evaluation scripts in sequence as described above.
-
The outputs and evaluation results will be saved in the specified folders.
Copyright (c) [2025] [Dr.Tarapong Srisongram]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
