I tend to agree that pure scaling isn't the solution to PLMs. I wrote about this in a more narrative fashion about 2 years ago when the xTrimo model came out. I do suspect, however, that new techniques will be develop around scale, but not just parameter count.
Hi Jason — just had a look at your post, like the impedance matching analogy! There is still a possibility that we observe a “double descent” phenomena with pLM as we further scale pretraining (data size and parameter count). The ESM team had a promising blog post at the end of 2024 wrt potential for zero-shot contact prediction. Generally agree with the potential from more compute with bio LMs as 1) we integrate more modalities together 2) rely on smarter retrieval mechanisms (see my latest post) and 3) find better ways to “prompt” these models — could very well be the larger model have learned much richer/useful representations which we are not able to properly access at the moment. Exciting times ahead!
I thoroughly enjoyed reading this. It was inspiring that I always believe that scaling laws alone are not the ultimate solution for AI4Science. Besides, scientific tools should prioritize accessibility and usability, enabling private deployment and fine-tuning for specific research needs.
I noticed that you evaluated the performance of Evo2 on ProteinGym. You also mentioned that scoring Evo2 on ProteinGym requires mapping protein sequences back to their corresponding DNA sequences—a process you described as tedious and one that will eventually be extended to the full benchmark. May I kindly inquire if there is an estimated timeline for its release? Alternatively, would it be possible to share the 22 assays that have already been processed? We are eager to test the performance of our self-developed GLM, GENERator (https://arxiv.org/abs/2502.07272), on this benchmark.
For your interest (if there is any), GENERator is a generative GLM trained on the eukaryotic domain. Despite being 100 times cheaper and faster than Evo2, GENERator-3B has demonstrated competitive performance in Clinvar variant effect prediction (0.95 for GENERator vs. 0.96 for Evo2). If you'd like to explore further, we also provide a one-click VEP script for easy testing: https://github.com/GenerTeam/GENERator/blob/main/src/tasks/downstream/variant_effect_prediction.py.
On a related note, I wanted to share a minor comment regarding the suboptimal performance of Evo2. Based on my experience, part of this may be attributable to the limited length of CDS regions. For instance, augmenting the DNA sequence to 12k bp by including upstream non-coding regions could potentially improve performance. That said, I’m not entirely sure if this approach would be considered fair, as such additional information cannot be leveraged by any PLMs.
Thank you for the kind words, and congrats on GENERator! We are aiming to release the data this coming week as part of a new benchmark — stay tuned ;) I agree with your last suggestion re: including upstream non-coding region — this could potentially help increasing performance for these models!
Ilya Sutskever says what we scale matters (expressive architectures), how we scale matters (objectives).
This post suggests scaling transformer architectures with BERT or autoregressive objectives isn't leading to predictable scaling laws for protein fitness prediction...yet.
But its hard to bet against scaling when you see what's happened in the rest of applied AI.
So maybe we have not yet cracked the recipe for scaling? ;)
Definitely! I think it's good, every now and then, to pause and reflect on why ideas that work so well in one domain do not seem to work as well (out of the box) in another. That was my intent with this blog: wanted to share some observation (which I found myself puzzling / disappointing) and spark conversations around it. There are still many things to try though -- this is only the beginning!
This is really interesting! Sclaing law plots are usually expressed in FLOPS, a metric of both model size and training data size. You obviously touch on training data, to what extend would massively scaling training data by including protein variation within species, wider sampling of alternate transcripts, naturally occuring variation both common and rare etc etc influence scaling? I think data from RNA-seq, and to some extend long-read DNA sequencing are somewhat under utilised whilepotentially very hard to integrate due to high sequence similarity. I imagine there are significant potential gains hidden in optimizing trairng (lower learning rate, masked training, etc) for high similarity sequences that do add info but are hard to train on.
Thank you! The data aspect is key and I don't think we have exhausted our options there. Besides the sheer number of tokens we train on, what matters is:
1. The quality of the data -- one way to get more data beyond the basic datasets that most pLMs have been trained on so far (eg UniRef) is to tap into large metagenomic databases. But these contain a majority of computational predictions based on open reading frames from environmental samples, so typically lower quality. But probably a good place to start before going to more extreme approaches (eg., synthetic sequence generation). I like your suggestions about RNA-seq / long-read DNA sequencing.
2. The sequence similarity/clustering -- this is very tricky to get right, as you sometimes want to keep the full extent of sequence diversity in the training data (ie no similarity filtering/clustering). For instance, that's likely the issue with ESM on viral sequences bc ESM models are trained on Uniref50/Uniref90 (depending on model version), and the corresponding clustering (eg going from U100 to U90) has removed critical sequence diversity (see the paper from Gurev et al referenced above for more details).
3. How relevant the additional training sequences are to the families of interest in downstream tasks. Do they have domains in common? Is there any biochemical signal that is shared between these families? ProteinGym is the largest benchmark out there, but it focuses on 200+ protein families. Training a bigger model on more sequences from unrelated protein families will likely not help much.
Regarding hyperparameters I'm sure there are further gains possible there, but would expect them to be second order vs the data considerations above.
I tend to agree that pure scaling isn't the solution to PLMs. I wrote about this in a more narrative fashion about 2 years ago when the xTrimo model came out. I do suspect, however, that new techniques will be develop around scale, but not just parameter count.
https://biotechbio.substack.com/p/intuition-on-ai-scale-for-biologists
Hi Jason — just had a look at your post, like the impedance matching analogy! There is still a possibility that we observe a “double descent” phenomena with pLM as we further scale pretraining (data size and parameter count). The ESM team had a promising blog post at the end of 2024 wrt potential for zero-shot contact prediction. Generally agree with the potential from more compute with bio LMs as 1) we integrate more modalities together 2) rely on smarter retrieval mechanisms (see my latest post) and 3) find better ways to “prompt” these models — could very well be the larger model have learned much richer/useful representations which we are not able to properly access at the moment. Exciting times ahead!
I thoroughly enjoyed reading this. It was inspiring that I always believe that scaling laws alone are not the ultimate solution for AI4Science. Besides, scientific tools should prioritize accessibility and usability, enabling private deployment and fine-tuning for specific research needs.
I noticed that you evaluated the performance of Evo2 on ProteinGym. You also mentioned that scoring Evo2 on ProteinGym requires mapping protein sequences back to their corresponding DNA sequences—a process you described as tedious and one that will eventually be extended to the full benchmark. May I kindly inquire if there is an estimated timeline for its release? Alternatively, would it be possible to share the 22 assays that have already been processed? We are eager to test the performance of our self-developed GLM, GENERator (https://arxiv.org/abs/2502.07272), on this benchmark.
For your interest (if there is any), GENERator is a generative GLM trained on the eukaryotic domain. Despite being 100 times cheaper and faster than Evo2, GENERator-3B has demonstrated competitive performance in Clinvar variant effect prediction (0.95 for GENERator vs. 0.96 for Evo2). If you'd like to explore further, we also provide a one-click VEP script for easy testing: https://github.com/GenerTeam/GENERator/blob/main/src/tasks/downstream/variant_effect_prediction.py.
On a related note, I wanted to share a minor comment regarding the suboptimal performance of Evo2. Based on my experience, part of this may be attributable to the limited length of CDS regions. For instance, augmenting the DNA sequence to 12k bp by including upstream non-coding regions could potentially improve performance. That said, I’m not entirely sure if this approach would be considered fair, as such additional information cannot be leveraged by any PLMs.
Thank you for the kind words, and congrats on GENERator! We are aiming to release the data this coming week as part of a new benchmark — stay tuned ;) I agree with your last suggestion re: including upstream non-coding region — this could potentially help increasing performance for these models!
Ilya Sutskever says what we scale matters (expressive architectures), how we scale matters (objectives).
This post suggests scaling transformer architectures with BERT or autoregressive objectives isn't leading to predictable scaling laws for protein fitness prediction...yet.
But its hard to bet against scaling when you see what's happened in the rest of applied AI.
So maybe we have not yet cracked the recipe for scaling? ;)
Definitely! I think it's good, every now and then, to pause and reflect on why ideas that work so well in one domain do not seem to work as well (out of the box) in another. That was my intent with this blog: wanted to share some observation (which I found myself puzzling / disappointing) and spark conversations around it. There are still many things to try though -- this is only the beginning!
This is really interesting! Sclaing law plots are usually expressed in FLOPS, a metric of both model size and training data size. You obviously touch on training data, to what extend would massively scaling training data by including protein variation within species, wider sampling of alternate transcripts, naturally occuring variation both common and rare etc etc influence scaling? I think data from RNA-seq, and to some extend long-read DNA sequencing are somewhat under utilised whilepotentially very hard to integrate due to high sequence similarity. I imagine there are significant potential gains hidden in optimizing trairng (lower learning rate, masked training, etc) for high similarity sequences that do add info but are hard to train on.
Thank you! The data aspect is key and I don't think we have exhausted our options there. Besides the sheer number of tokens we train on, what matters is:
1. The quality of the data -- one way to get more data beyond the basic datasets that most pLMs have been trained on so far (eg UniRef) is to tap into large metagenomic databases. But these contain a majority of computational predictions based on open reading frames from environmental samples, so typically lower quality. But probably a good place to start before going to more extreme approaches (eg., synthetic sequence generation). I like your suggestions about RNA-seq / long-read DNA sequencing.
2. The sequence similarity/clustering -- this is very tricky to get right, as you sometimes want to keep the full extent of sequence diversity in the training data (ie no similarity filtering/clustering). For instance, that's likely the issue with ESM on viral sequences bc ESM models are trained on Uniref50/Uniref90 (depending on model version), and the corresponding clustering (eg going from U100 to U90) has removed critical sequence diversity (see the paper from Gurev et al referenced above for more details).
3. How relevant the additional training sequences are to the families of interest in downstream tasks. Do they have domains in common? Is there any biochemical signal that is shared between these families? ProteinGym is the largest benchmark out there, but it focuses on 200+ protein families. Training a bigger model on more sequences from unrelated protein families will likely not help much.
Regarding hyperparameters I'm sure there are further gains possible there, but would expect them to be second order vs the data considerations above.