26 December 2018

Super sequencing

A new technology for mass sequencing of proteins has been proposed

Vera Mukhina, "Elements"

Modern technologies allow fast and high-quality sequencing of DNA and RNA molecules. But there have been no similar methods for determining the amino acid sequence of proteins until now. But recently, scientists from the United States have proposed a method of mass protein sequencing, in which they managed to combine some ideas of new-generation DNA sequencing methods and mass spectrometry. With the help of the Edman reaction and fluorescence, scientists were able to separate one amino acid from the protein chains and record the changes taking place. The new method was also applied to the study of posttranslational modifications of protein molecules.

More than 40 years have passed since Frederick Sanger proposed the first method of DNA sequencing. During this time, sequencing technologies have made great strides, although some basic ideas continue to be used. Collectively, new methods are called next generation sequencing (NGS) methods: they make it possible to perform qualitative and quantitative analysis of RNA and DNA relatively quickly, massively and cheaply (and greatly benefit in comparison with the old methods). At the same time, what is important, some of these methods allow you to simultaneously analyze many nucleic acid molecules at once.

The essence of the most common of these methods is as follows. Nucleic acids are cut in arbitrary places into small single-stranded pieces, which are then completed with specially labeled nucleotides on the principle of complementarity so that the addition of each is accompanied by a fluorescent flash. These outbreaks are recorded and translated into the genetic code. After that, the unique overlapping code fragments are combined into larger ones using computer analysis and the original sequences are restored. Thus, it is possible to sequence a genome or transcriptome, but not proteins having a different chemical nature, which does not imply the possibility of complementary completion: unlike nucleotides, which make up DNA and RNA and which can form complementary pairs, proteins consist of amino acids that do not have such a "pairing".

Nevertheless, it is possible to get an indirect idea of the proteins present in the cell by sequencing DNA/RNA and then translating this code into protein. A more accurate protein analysis usually involves the use of mass spectrometry, which allows you to separate molecules by mass, "weigh" and predict their primary structure by comparing the molecular weight with the "tabular values" calculated from genomic data. Modern technologies involve additional fragmentation of proteins by proteinases, which cut them specifically between specific amino acids, so that in fact it is not the whole protein that is analyzed, but a set of fragments into which it can be disassembled by an enzyme. A stable pattern of cuts for each individual protein also allows you to extract additional information about its structure.

Being a very common and powerful tool, mass spectrometry also has problematic sides. In particular, difficulties arise with new and post-translationally modified proteins for which tabular values are missing (so-called de novo sequencing).

Edward Marcotte (Edward M. Marcotte) and his colleagues from the University of Texas at Austin proposed an alternative to existing solutions and adapted NGS methods for direct protein sequencing. Scientists claim that the new technology allows processing millions of molecules in parallel and identifying even individual proteins in mixtures.

By analogy with the nucleotide NGS, it was decided to use fluorescent labels to distinguish the elements of the amino acid sequence of the protein, but degradation, rather than chain synthesis, was used as the reason for the change in fluorescence. The Edman reaction allows you to "bite off" one amino acid from the N-end of the protein and then analyze them to restore the entire sequence. This method was invented back in the 50s of the XX century, in its classic version, proteins are fixed on a substrate, and the amino acids that have fallen off are washed off and determined in a chromatograph. The cycle can be repeated until the entire protein sequence is known. The disadvantages of this method (in particular, labor and time costs) do not allow using it in large-scale studies.

The new technology differs from the old one in three important details. Firstly, many different proteins are fixed on the substrate at once. This makes it possible to significantly speed up the process, because in one cycle it is possible to extract information about one tail amino acid of each of them. At the same time, if scientists continued to determine the protein sequence in the classical way using a chromatograph, they would have reached a dead end. Different proteins have different terminal amino acids, which would mix when washed away so that it would be impossible to understand which of them belonged to which protein. Therefore, the second innovation turned out to be related to the method of determining amino acids – for this, fluorescent labels were needed. And in order to avoid their confusion when flushing, the fluorescence was measured not of the amino acids that had fallen off, but of the fixed proteins. To do this, some amino acid residues of proteins were specifically marked with fluorophores even before the reaction began, so that a bright carpet of dots was visible in the microscope chamber, each of which corresponds to a fixed protein (Fig. 1).

protein_sequence_1.jpg

Fig. 1. Proteins labeled with fluorescent dyes in the field of an optical microscope (see Total internal reflection fluorescence microscope). Each glowing dot is a separate peptide in which some amino acids are colored in different colors (in this case, yellow and pink). By changing the glow during the gradual degradation of proteins, researchers have learned to determine their primary structure. Photos from the website news.utexas.edu .

Scientists have calculated that three million peptides can be crammed into six and a half square millimeters of the substrate without loss of quality. For comparison, the total number of proteins and peptides in one cell of the Escherichia coli bacterium is estimated at about the same amount (according to the data from the article Ron Milo, 2013. What is the total number of protein molecules per cell volume? A call to rethink some published values).

In the work under discussion, scientists labeled cysteine and sometimes lysine or serine, but in principle it is possible to choose dyes for other amino acids. If a label is used for one amino acid, all the dots are of the same color, and if there are several labels, the color depends on the ratio of labels (and hence labeled amino acids) in the protein. It is worth noting that especially stable dyes are selected here – the Edman reaction uses rather harsh reagents and heating, so it is necessary that the labels do not fall apart on the way. Of course, they have a non-protein structure.

For the experiment, we took a high-resolution TIRF microscope, which allows us to see each glowing point-a peptide. Before the reaction begins, the level and spectrum of the glow of each point is measured further, as the reaction proceeds, the microscope captures all its color changes. In one cycle, the brightness can either decrease if the last amino acid of the chain was labeled and washed away, or remain the same if there was no label. By the gradual fading of dots from the same place, the positions of labeled amino acids are determined, and the protein itself is calculated from them.

protein_sequence_2.jpg

Fig. 2. Example of the results obtained for a mixture of 1,695 peptide molecules with two labeled cysteines in different positions. A change in the structure of one of the molecules during eight reaction cycles (a) and its photos for each cycle (b). d is the total results for all the molecules of the mixture. The color indicates the intensity of the glow of the molecules at each of the eight reaction cycles. The results are grouped by similarity and the number of peptides for which the luminescence intensity dropped in the same cycles is calculated, respectively, the positions of cysteines in these molecules coincide. It turned out that most of the molecules in the mixture with cysteine are in the second and fifth positions (675 pieces). A drawing from the article under discussion in Nature Biotechnology.

The third modification of the classical Edman method was that proteins are cut into small peptides before the reaction begins, so that not long molecules, but their pieces are fixed in the chamber. This was done for several reasons. Firstly, due to periodic errors during the reaction (reagents did not work, the amino acid did not wash off, etc.), more and more proteins accumulate with each cycle, which are at the wrong stage of the reaction and, accordingly, are misinterpreted. Cutting makes it possible to analyze the whole protein in pieces in a small number of cycles, rather than gradually disassemble the entire sequence. This eliminates the accumulated errors and at the same time (after all, proteins can be several thousand amino acids long) repeatedly reduces the required number of reaction cycles.

On the other hand, such a move complicates the task of determining: when cutting, information about the whole protein is lost and the cut pieces must now somehow be "glued" back together. The researchers borrowed the solution to this problem from the method of mass spectrometry. If the cuts do not occur randomly (as in the case of conventional NGS), but only between specific amino acids, then all peptides will have tails known in advance. Therefore, the selection of specific proteases increases the number of "symbols" that can be read from the protein molecule.

Residual problems associated with the fact that proteins are cut into pieces and that not all amino acids are known in them are solved by competent computer analysis of the results. Like mass spectrometry, this method cannot yet do without pre-created databases of proteins that may end up in the mixture. The found site is tried on to match the prepared sequences of these proteins (so that they fit the format of the experimental data, the places of incisions and the positions of the fluorescent labels are modeled for them). If it fits, then the gaps between the labeled amino acids are filled in from the database. Scientists have calculated that the data from two fluorescent markers and a proteinase should be enough to analyze mixtures of medium complexity (about a thousand different proteins), and the development of a couple more labels should be enough to correctly identify most proteins of the proteome.

The researchers also tested this method to determine posttranslational modifications of amino acids that occur after synthesis and are not reflected in the genome in any way. As an illustrative example, they took a short repeating amino acid sequence located on the tail of the enzyme RNA polymerase ll (see CTD of RNA polymerase). This is a very important site for polymerase work, in which phosphorylation of different serines can provoke splicing or capping (J.-P. Hsin, J. L. Manley, 2012. The RNA polymerase II CTD coordinates transcription and RNA processing). Taking a mixture of differently modified peptides and labeling phosphoserines, they found their leaching at different cycles of the Edman reaction. By restoring the position of phosphorylated serines in peptides based on this, the scientists confirmed the potential applicability of the method for the analysis of posttranslational modifications.

It would be wrong to say that the new method solves all the problems of proteomics. But it has important advantages over mass spectrometry - high sensitivity and parallel sequencing of many proteins, which can greatly reduce the complexity of this process in the future. Nevertheless, one of the key problems of proteomics – the impossibility of easily determining de novo proteins – remained unresolved. Since the method is completely new, an industrial sequencer has not yet been created for it, which would allow other researchers to evaluate all the pros and cons. But the authors of the work plan to refine the method and create such a device in the future.

Source: Swaminathan et al., Highly parallel single-molecule identification of proteins in zeptomole-scale mixtures // Nature Biotechnology. 2018.

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version