Explaining and avoiding failures modes in goal-directed generation

This code reproduces the results found in the paper "Explaining and avoiding failures modes in goal-directed generation". The paper builds on the work of Renz and al ¹, that is available at: https://www.sciencedirect.com/science/article/pii/S1740674920300159

This code is a fork from the repository supporting ¹. This code is available at: https://github.com/ml-jku/mgenerators-failure-modes. The main difference with the original codebase is new notebooks supporting our experiments. Another modification has been performed: the call to the GPL-ed library implementing the Levenshtein distance in addcarbon.py has been replaced by a function included in addcarbon.py.

We thank the authors of ¹ both for their very insightful work, and their well-written and reproducible codebase.

¹ https://doi.org/10.1016/j.ddtec.2020.09.003

Summary of the results

Renz and al highlighted (among many other interesting results on generative models for molecules) that in goal-directed generation, molecules generated can have high optimization scores and in the meantime low scores according to control models:

To explain those results, we looked at the agreement between optimization model and control models on the initial data distribution:

We then assess whether this initial difference could explain the previous results:

The main conclusion of our work is that the underlying issue lies in the initial disagreement between optimization and control models, and not with the goal-directed generation algorithm.

Code

The instructions for installation are the same as described in https://github.com/ml-jku/mgenerators-failure-modes.

Install dependencies

pip install -r requirements.txt
conda install rdkit -c rdkit
wget https://raw.githubusercontent.com/jrwnter/cddd/master/download_default_model.sh -O- -q | bash

Download Guacamol data splits

The compounds are used for distribution learning and for starting populations for the graph-based genetic algorithm.

mkdir data
wget -O data/guacamol_v1_all.smiles https://ndownloader.figshare.com/files/13612745
wget -O data/guacamol_v1_test.smiles https://ndownloader.figshare.com/files/13612757
wget -O data/guacamol_v1_valid.smiles https://ndownloader.figshare.com/files/13612766
wget -O data/guacamol_v1_train.smiles https://ndownloader.figshare.com/files/13612760

Bioactivity data

The csv-files downloaded from ChEMBL are located in assays/raw. Running the preprocess.py script will transform the data into binary classification tasks and store them in assays/processed.

Experiments

An alternative to running the experiments (which can take time) is to unzip the "results.zip" archive. Results from the paper can then be reproduced from there by running the different notebooks.

To reproduce the original results presented in "On failure modes in molecule generation and optimization":

python run_goal_directed.py --log_base results/original_start_chembl --nruns 10 --random_start

To run the same analysis while using the dataset as a starting point:

python run_goal_directed.py --log_base results/original_start_dataset --nruns 10

To run the experiments on the ALDH1 dataset and the JAK2 dataset with modified parameters for the predictive model:

python run_goal_directed.py --log_base results/new_datasets_start_from_chembl --nruns 10 --random_start --chids_set alternative --n_estimators 200 --min_samples_leaf 3

dataset_analysis.ipynb: Analysis of the relationships between optimization and control scores on the distribution of the dataset.
run_analysis.ipynb: Analysis of the experiment on the new datasets (ALDH1 and JAK2 modified).
tolerance_intervals.ipynb: Computes tolerance intervals for expected control scores, and plot them alongside actual control scores obtained during the experiments.
nn_analysis.ipynb: Analyze whether there is already a bias towards higher similarities with molecules from Split 1 for high scoring molecules in the dataset.
display_molecules.ipynb: shows outliers from the DRD2 dataset, and molecules generated during the different experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assays		assays
cddd		cddd
figures		figures
guacamol_baselines		guacamol_baselines
mso		mso
readme_figures		readme_figures
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
addcarbon.py		addcarbon.py
dataset_analysis.ipynb		dataset_analysis.ipynb
display_molecules.ipynb		display_molecules.ipynb
metrics.ipynb		metrics.ipynb
nearest_neighbours.ipynb		nearest_neighbours.ipynb
nn_analysis.ipynb		nn_analysis.ipynb
optimize.py		optimize.py
parametersearch.py		parametersearch.py
plot_utils.py		plot_utils.py
plots.ipynb		plots.ipynb
plots_original.ipynb		plots_original.ipynb
predictions.py		predictions.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
results.zip		results.zip
run_analysis.ipynb		run_analysis.ipynb
run_goal_directed.py		run_goal_directed.py
substructure_search.ipynb		substructure_search.ipynb
tolerance_intervals.ipynb		tolerance_intervals.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explaining and avoiding failures modes in goal-directed generation

Summary of the results

Code

Install dependencies

Download Guacamol data splits

Bioactivity data

Experiments

About

Releases

Packages

Contributors 2

Languages

License

Sanofi-Public/IDD-papers-avoiding_failure_modes

Folders and files

Latest commit

History

Repository files navigation

Explaining and avoiding failures modes in goal-directed generation

Summary of the results

Code

Install dependencies

Download Guacamol data splits

Bioactivity data

Experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages