ENSEMBITS: an alphabet of protein conformational ensembles
Ensembits introduces the first tokenizer designed to capture protein dynamics rather than static structures, addressing a fundamental gap in protein language models. By encoding conformational ensembles through a Residual VQ-VAE trained on molecular dynamics data, the work enables models to learn correlated motions and alternative states that traditional structure tokenizers miss. This matters because protein function often depends on flexibility and motion, not just fold. The technique outperforms existing methods on dynamics prediction tasks, potentially unlocking more accurate function prediction and evolutionary analysis across computational biology and drug discovery workflows.62
























