The huge protein database that spawned AlphaFold and biology’s AI revolution

Scientists are using AI to dream up revolutionary new proteins

It’s easy to marvel at the technical wizardry behind

breakthroughs such as AlphaFold
. But a lot of that success is thanks to a database of protein structures dreamed up in the 1960s by Helen Berman, a crystallographer at the University of Southern California in Los Angeles, and like-minded scientists.

The Protein Data Bank (PDB) now holds the structures of more than 200,000 proteins, freely available to anyone. These data help AlphaFold to

predict the structures of proteins from their sequence
, and for other AIs to imagine new proteins at the push of a button.

Berman tells

Nature
why she’s pleased with the recognition — chemistry Nobel laureates David Baker at the University of Washington in Seattle, and John Jumper at Google DeepMind in London, both credited the PDB — and how other scientific fields can pave the way for AI breakthroughs with good data.

How did scientists share protein structures before the PDB?

The PDB came into existence when there were only a handful of structures to begin with. They were shared either by punch cards — every atom had its own punch card — or magnetic tape. The individual investigator would have to mail those things across the ocean if it was going from England to America.

What sparked the creation of the PDB?

I was a student in the 1960s in crystallography, and the structures of proteins were just beginning to appear. I was not a protein crystallographer, but I was struck by how important these structures were going to be.

I worked with a few other younger people who were also interested in structure. A small group of us began corresponding with one another about how we could get there to be a protein data bank. I don’t know that we called it that, but that’s what we wanted: some kind of a place where all these structures could be.

Was making these data open a key principle?

At the beginning of the PDB, the whole goal was just to get the protein-structure coordinates, and make sure we didn’t lose them. In the 1980s, there began a movement to say these structures are key for the public health. They’re key for good science. They have to be put in the PDB, because at the time there was no requirement. It required some encouragement on the part of the funding agencies. And it took a while for the journals to buy into the idea of requiring the data to be in the PDB. Now you cannot publish a structure without having it in the PDB.

Do you think we would have had Alpha Fold without the PDB?

Knowing what I think I know about how AlphaFold works, it would have been extremely difficult. Two things were important about the PDB data: it’s checked and validated by expert curators. The other thing is that the data are completely machine readable.

What’s it been like to observe this revolution in biological AI, with tools like AlphaFold, RoseTTAFold and protein-design software? They’re all trained on the PDB.

For me, it’s thrilling. The ideas that I had back then was that we would be able to understand protein sequence–structure relationships better. I am really, really happy about the results that came out of AlphaFold and all the work that David Baker has done in protein design.

Does it speak to the importance of experimental data for powering AI breakthroughs in science?

Yes, 100%. People will say, ‘Oh, well, the PDB data are really special.’ But we actually know why they’re special. It took a long, long time to figure out how to handle the data, how to represent the data, how to collect the data. We as a community, the PDB community, know how to do this.

I think that other communities can, should and must do this. Because otherwise we’re not going to get the big breakthroughs. The methodologies that allow you to do protein prediction and protein design — the same thing could happen in chemistry. It could happen in geology. It could happen in physics.