Discussion about this post

User's avatar
Jason Steiner's avatar

I tend to agree that pure scaling isn't the solution to PLMs. I wrote about this in a more narrative fashion about 2 years ago when the xTrimo model came out. I do suspect, however, that new techniques will be develop around scale, but not just parameter count.

https://biotechbio.substack.com/p/intuition-on-ai-scale-for-biologists

Expand full comment
Qiuyi Li's avatar

I thoroughly enjoyed reading this. It was inspiring that I always believe that scaling laws alone are not the ultimate solution for AI4Science. Besides, scientific tools should prioritize accessibility and usability, enabling private deployment and fine-tuning for specific research needs.

I noticed that you evaluated the performance of Evo2 on ProteinGym. You also mentioned that scoring Evo2 on ProteinGym requires mapping protein sequences back to their corresponding DNA sequences—a process you described as tedious and one that will eventually be extended to the full benchmark. May I kindly inquire if there is an estimated timeline for its release? Alternatively, would it be possible to share the 22 assays that have already been processed? We are eager to test the performance of our self-developed GLM, GENERator (https://arxiv.org/abs/2502.07272), on this benchmark.

For your interest (if there is any), GENERator is a generative GLM trained on the eukaryotic domain. Despite being 100 times cheaper and faster than Evo2, GENERator-3B has demonstrated competitive performance in Clinvar variant effect prediction (0.95 for GENERator vs. 0.96 for Evo2). If you'd like to explore further, we also provide a one-click VEP script for easy testing: https://github.com/GenerTeam/GENERator/blob/main/src/tasks/downstream/variant_effect_prediction.py.

On a related note, I wanted to share a minor comment regarding the suboptimal performance of Evo2. Based on my experience, part of this may be attributable to the limited length of CDS regions. For instance, augmenting the DNA sequence to 12k bp by including upstream non-coding regions could potentially improve performance. That said, I’m not entirely sure if this approach would be considered fair, as such additional information cannot be leveraged by any PLMs.

Expand full comment
6 more comments...

No posts