-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at modular DataPipeline #1
base: main
Are you sure you want to change the base?
Conversation
PiperOrigin-RevId: 388172326 Change-Id: I11d9e498c226cd752947feb51b7d1eb343b4d7ab
PiperOrigin-RevId: 388195957 Change-Id: Id22b96c964a82e3450a5f457d8facd1e85b1e86a
PiperOrigin-RevId: 390355132 Change-Id: I920a80db674541af41ee83ef3d5bd5255c782ee9
PiperOrigin-RevId: 390553964 Change-Id: Ia5bc6e12ab3f1a7827b7d914c93f6990a1139780
PiperOrigin-RevId: 390566020 Change-Id: I3fafbe8246d0a5ad018f0398b39bf7dacee00468
UsageClone Ethan's Forkgit clone [email protected]:eho-tacc/alphafold.git af2-eho-fork
cd af2-eho-fork Running Entire DataPipelineThe singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif python3 run_alphafold.py --flagfile= /scratch/projects/tacc/bio/alphafold/test-container/flags/reduced_dbs.ff \
--fasta_paths= /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta \
--output_dir=$SCRATCH/af2_reduced \
--model_names=model_1 Running One Step of the DataPipelineOn S2 idev: # Install the AF2 console script
singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif bash -c 'python3 -m pip install -q --user . && $HOME/.local/bin/af2 jackhmmer --help'
# Run only the jackhmmer_uniref90 step
singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif $HOME/.local/bin/af2 jackhmmer --input-fasta-path /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta --jackhmmer-binary-path /usr/bin/jackhmmer --database-path /scratch/projects/tacc/bio/alphafold/data/uniref90/uniref90.fasta --output-dir $PWD Output CachingVery little (MB) data is transferred between steps of the DataPipeline. This means that we can cheaply cache the outputs of expensive steps such as After running a step of the DataPipeline (either using This fork verbosely logs this behavior. To change the location of the cache, set the environment variable AF2_CACHE_DIR="/tmp" To disable reading from the cache, forcing every step to rerun, set the environment variable export AF2_SKIP_PCKL_CACHE=1 |
TestingCoverage on Systems
Expected Behavior
|
alphafold.data.pipeline.DataPipeline
class, contained in a new classalphafold.data.pipeline.ModularDataPipeline
.DataPipeline
does),ModularDataPipeline
makes calls to functions in new modulealphafold.data.tools.cli
, which wrap construction and execution of these "runner" instances.jackhmmer
,hhsearch
,hhblits
. All of these steps will attempt to cache results to a pickle file based on their hashed input (kw)args, using the newalphafold.data.tools.cache_utils.cache_to_pckl
decorator. This allows the user to avoid repetitive calls to expensive operations such ashhblits
.alphafold.data.tools.cli
also provides a Click CLI subcommand for each pipeline step. These subcommands are installed under theaf2
console_script
in thesetup.py
. For instance, to run only thejackhmmer_uniref90
step of the pipeline, issue the following after installing the AlphaFold2 package:af2 jackhmmer --input-fasta-path /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta --jackhmmer-binary-path /usr/bin/jackhmmer --database-path /scratch/projects/tacc/bio/alphafold/data/uniref90/uniref90.fasta --output-dir $PWD
ModularDataPipeline
.