Biologists at the University of California-San Diego have developed the first system for determining gene expression based on machine learning, according to an announcement from the university.
Gene expression is the process by which information encoded within genes leads to key products, such as proteins, the university said.
“This paper represents the first method to distinguish genes that can be expressed from those that cannot,” said Steve Briggs, a University of California-San Diego Division of Biological Sciences professor and senior author of the paper. “This is the basis for all of biology. Whether it’s drug discovery or plant breeding or evolution, this touches the basic studies of biology.”
The method, developed by graduate student Ryan Sartor, Briggs and their colleagues, was published Aug. 12 in Proceedings of the National Academy of Sciences.
Biologists have previously classified gene expression through experimental observations and scientific literature references, the researchers said, but the genomics field lacked a formalized process for revealing this information, called the “expressible gene set,” which comprises all protein-coding genes with the potential to be expressed.
“In biology, there is no method to do this,” Briggs said. “In the past, we’ve just had empirical approaches to making catalogs. We haven’t had scientific criteria that classifies the genes based on their molecular features.”
The new method leverages machine learning, the use of algorithms and other processes to analyze data and is based on an example set of nearly 30,000 maize plant genes containing specific, detailed molecular features. An advanced algorithm was trained on the data and “learned” to classify gene expression at 99.4% accuracy, the university said.
The key to the advancement is bringing together chromatin biology — which contributes to regulating the DNA packaging within cells — with molecular features that are known to determine gene expression. By combining these with mathematical machine learning, the new method of determining the species-wide set of transcribed genes, or “expressome,” then creates an atlas of expressible genes, the university explained.
The method may also be useful in understanding evolutionary mechanisms that silence certain genes.
Briggs is now applying the method to sorghum, an important grain for food and fodder, but noted that it can be useful beyond plant species. Ultimately, he said the new method is like a word decoder.
“The genome sequence is like a book,” Briggs explained. “The words are the genes. Until now, we couldn’t tell which DNA sequences were real words and which merely resembled words. By removing non-words, we now have a much more accurate reading of the book.”
Co-authors of the paper include Jaclyn Noshay and Nathan Springer of the University of Minnesota. The National Science Foundation’s Plant Genome Research Program supported the research.