Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about using custom MSAs #289

Closed
ntnn19 opened this issue Feb 6, 2025 · 3 comments
Closed

Questions about using custom MSAs #289

ntnn19 opened this issue Feb 6, 2025 · 3 comments
Labels
question Further information is requested

Comments

@ntnn19
Copy link

ntnn19 commented Feb 6, 2025

Hi,

Thank you very much for making your inference code publicly available.

My questions are as follows:

  1. The documentation calls setting both unpairedMsa and pairedMsa to a custom non-empty A3M string an "expert option". Why is it considered as an expert option?

  2. The documentation recommends manual pairing or using the output of "appropriate software". Could you provide examples of such software or recommended best practices for manual pairing?

  3. The documentation recommends providing the paired MSA data using only the unpairedMsa field after manual pairing. This is counter-intuitive. Since the data is paired, why is it provided in the unpairedMsa field? Could the naming be clarified or the rationale explained more clearly?

  4. For multimer prediction, is it possible to provide custom, non-paired MSA for each chain, and let your pipeline do the pairing? if yes, which json key should hold these custom, non-paired MSAs?

I would appreciate your help on this.

@ntnn19 ntnn19 changed the title Question about using custom MSAs Questions about using custom MSAs Feb 6, 2025
@Augustin-Zidek Augustin-Zidek added the question Further information is requested label Feb 6, 2025
@Augustin-Zidek
Copy link
Collaborator

Hi!

Here are answers to your questions:

  1. The non-expert option is to let AlphaFold do the genetic search and let it pair the MSA as part of its data pipeline. Once you use custom MSA, you need to make sure a lot of things are ok -- the format, alignment, ordering by quality, deduplication, ... Moreover, for multimers, you have to make sure the pairing was done right. I.e. a lot of things can be subtly wrong in the MSA leading to worse prediction quality and we want to highlight that to our users by saying it is an expert option.
  2. See e.g. https://zhanggroup.org/cpxDeepMSA/, https://seq2fun.dcmb.med.umich.edu/DeepMSA2/, https://github.com/sokrypton/ColabFold, ... For manual pairing you should put the sequences you want to be paired in the same row in each of the MSA (unpairedMsa) for each of the chains in the complex. Pad with gap-only sequences. Typically you want to pair together sequences that are e.g. from the same organism or have some biological relationship.
  3. Yes, the naming is bad (sorry!), unfortunately. See MSA Pairing #257 for a detailed discussion.
  4. Yes, two options: Provide the unpairedMsa for each chain, they will be paired row-wise in the order provided. The other option is providing pairedMsa for each chain and making sure the naming matches the UnProt naming scheme. AlphaFold will then pair using organism IDs. However, pairing should not matter that much in practice. I would simply provide custom unpaired MSA for each chain and see what the prediction looks like. Only if the predictions are not good (i.e. low predicted accuracy metrics), I would tweak the MSA.

The simplest option though is to let AlphaFold build the MSA using its built-in data pipeline.

@GXcells
Copy link

GXcells commented Feb 7, 2025

Hi!

Here are answers to your questions:

  1. The non-expert option is to let AlphaFold do the genetic search and let it pair the MSA as part of its data pipeline. Once you use custom MSA, you need to make sure a lot of things are ok -- the format, alignment, ordering by quality, deduplication, ... Moreover, for multimers, you have to make sure the pairing was done right. I.e. a lot of things can be subtly wrong in the MSA leading to worse prediction quality and we want to highlight that to our users by saying it is an expert option.
  2. See e.g. https://zhanggroup.org/cpxDeepMSA/, https://seq2fun.dcmb.med.umich.edu/DeepMSA2/, https://github.com/sokrypton/ColabFold, ... For manual pairing you should put the sequences you want to be paired in the same row in each of the MSA (unpairedMsa) for each of the chains in the complex. Pad with gap-only sequences. Typically you want to pair together sequences that are e.g. from the same organism or have some biological relationship.
  3. Yes, the naming is bad (sorry!), unfortunately. See MSA Pairing #257 for a detailed discussion.
  4. Yes, two options: Provide the unpairedMsa for each chain, they will be paired row-wise in the order provided. The other option is providing pairedMsa for each chain and making sure the naming matches the UnProt naming scheme. AlphaFold will then pair using organism IDs. However, pairing should not matter that much in practice. I would simply provide custom unpaired MSA for each chain and see what the prediction looks like. Only if the predictions are not good (i.e. low predicted accuracy metrics), I would tweak the MSA.

The simplest option though is to let AlphaFold build the MSA using its built-in data pipeline.

So basically we can run MSA using data pipeline on each protein separately and then feed the unpaired MSA when predicting multimer complex? Then the pipeline will automatically also predict paired MSA for the given complex? I am asking this because it is too time and resource consuming to repredict each time MSA when using several time same protein for multiple different complexes.

@Augustin-Zidek
Copy link
Collaborator

So basically we can run MSA using data pipeline on each protein separately and then feed the unpaired MSA when predicting multimer complex? Then the pipeline will automatically also predict paired MSA for the given complex? I am asking this because it is too time and resource consuming to repredict each time MSA when using several time same protein for multiple different complexes.

Yes indeed, you can run AlphaFold 3 data pipeline for each of the monomer chains individually, then compose the input JSON that includes the MSA for the various multimer combinations. See #171 where this was discussed for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants