All cells in the body contain the same genetic sequence, but each cell expresses only the subset of those gene. These cell -specific genetic expression patterns that guarantee that brain cells are different from skin cells are partially determined by the three -dimensional structure of genetic substances that control the accessibility of each gene.
MIT chemists have come up with a new way to determine these 3D genome structures using aggressive artificial intelligence. These methods can predict thousands of structures in just a few minutes and are much faster than existing experimental methods for analyzing structures.
Using this method, researchers can easily study how the genome 3D tissue affects the gene expression patterns and functions of individual cells.
“Our goal was trying to predict the three -dimensional genome structure from the basic DNA sequence,” said Bin Zhang, an associate professor of chemistry and a senior researcher. “Because this is possible, this method is equivalent to the most advanced experimental technology, so you can really open many interesting opportunities.”
MIT’s graduate student GREG SCHUETTE and ZHUOHAN LAO are the main authors of this dissertation that appeared today in the progress of science.
From sequence to structure
In the cell nucleus, DNA and protein form a complex called chromatin, and have some levels of tissue, so the cells pack 2 meters into DNAs of 1/100 of the diameter. 。 A long -like DNA -like chain around a protein called Histon creates a structure like a string beads.
Chemical tags known as epigenetic modifications can be attached to DNA in a specific place, and these tags that differ depending on the cell type affect the folding of chromatin and the accessibility of the nearby genes. These differences in chromatin three -dimensional structures use different cell types to determine which genes will appear at different times in different cells.
For the past 20 years, scientists have developed experimental methods to determine the chroma -chin structure. One of the widely used technologies known as Hi-C works by linking the adjacent DNA chains in the nucleus of cells. After that, researchers can determine which segments are located near each other by shortening DNA into many small fragments.
This method can be used in a large group of cells to calculate the average structure of the chromatin section, or a single cell determines the specific cell structure. However, HI-C and similar methods are labor-intensive, and may take about a week to generate data from one cell.
To overcome these restrictions, Zhang and his students have developed a model to create high -speed and accurate methods that predict the chromatin structure of a single cell using recent progress in produced AI. The AI models designed can quickly analyze the DNA sequence and predict the chromatin structure that can be generated in cells.
“Deep learning is good at pattern recognition,” says Zhang. “This allows you to analyze a very long DNA segment, thousands of base pairs, and grasp important information encoded at these DNA bases.”
CHROMOGEN, a model created by researchers, has two components. The first component, deep learning model, is taught to “read” the genome and analyzes information encoded in the basic DNA sequence and chromatin accessibility data. The latter is widely used and specific to cell type.
The second component is an AI model that predicts a physical and accurate chromatin three -dimensional structure trained with more than 11 million chromatin three -dimensional structure. These data were generated from experiments using DIP-C (Hi-C variant) in 16 cells from human B lymphocytes.
When integrated, the first component notifies the generated model how it will affect the formation of a chromatin structure with different cell -type environments, and this scheme effectively captures the relationship between the sequence structure. For each sequence, researchers use models to generate many possible structures. This is because the DNA is a very disabled molecule, which can cause many different three -dimensional structures that can be different.
“The main complicated factor that predicts the genome structure is that there is no single solution we are aiming for. There is a structure distribution, regardless of the genome you are looking. It is very challenging to predict that it is very complex and high -dimensional statistical distribution is very challenging, “says Schuette.
Quick analysis
Once trained, the model can generate predictions on a much faster time scale than Hi-C and other experimental techniques.
“You may spend six months to execute experiments to acquire dozens of structures with a specific cell type, but you can generate 1000 structures in a specific area in 20 minutes in one GPU. Such SCHUETTE says.
After training the model, researchers used it to generate more than 2,000 DNA sequence structural predictions, compared them with the structure that was experimentally determined. They discovered that the structure generated by the model is the same or very similar to what the experimental data is found.
“We usually look at hundreds or thousands of three -dimensional sitting on each sequence. It gives you a reasonable expression of the structure that can be held by a specific area.” He says. “If you repeat the experiment multiple times with different cells, it will have a very different three -dimensional structure. That is what our model is trying to predict.”
Researchers have also discovered that they can accurately predict data from cell type other than trained models. This suggests that this model can help analyze how the chromatin structure differs between cells and how they have affected the function. 。 You can use this model to find out how many chromatin states that may exist in a single cell and how they change their genes.
Another possible application is to investigate how a specific DNA sequence changes the three -dimensional structure of chromatin.
“There are many interesting questions that I think can be dealt with with this type of model,” says Zhang.
Researchers have made them use their data and models for others who want to use it.
This study was funded by the National Institute of Health.