Fish genome sequenced with big data

Annotated genome of California yellowtail could be first step in its sustainable aquaculture harvest.

April 12, 2018

5 Min Read
Fish genome sequenced with big data
Scientists annotated and assembled the fish genome of Seriola dorsalis, AKA California Yellowtail, using big data and supercomputers. Closely related Seriola lalandi shown.Credit: Fishbase.

The U.S. imports more than 80% of its seafood, according to U.N. estimates, but new genetic research could help make farmed fish more palatable and bring America's wild fish species to dinner tables, according to a recent announcement from the Texas Advanced Computing Center (TACC) at the University of Texas-Austin.

Scientists have used big data and supercomputers to decode a fish genome -- a first step for a sustainable aquaculture harvest, the announcement said.

Researchers assembled and annotated the genome of the fish species Seriola dorsalis for the first time. Also known as California yellowtail, it's a fish of high value to the sashimi (raw seafood) industry.

The research team was formed from the Southwest Fisheries Science Center of the U.S. National Marine Fisheries Service, Iowa State University and the Instituto Politécnico Nacional in Mexico. TACC provided big data analytics for the project. The results were published in the January issue of the journal BMC Genomics.

"The major findings in this publication were to characterize the S. dorsalis genome and its annotation, along with getting a better understanding of sex determination of this fish species," said study co-author Andrew Severin, a scientist and facility manager at the Genome Informatics Facility of Iowa State University.

"We can now confidently say that S. dorsalis has a Z-W sex determination system and that we know the chromosome that it's contained on and the region that actually determines the sex of this fish," Severin said, explaining that Z-W refers to the sex chromosomes and depends on whether the male or female is heterozygous (XX,XY or ZZ,ZW), respectively.

It's hard to tell the difference between a male and female yellowtail fish because they don't have any obvious phenotypical traits. "Being able to determine sex in fish is really important, because we can develop a marker that can be used to determine sex in young fish that you can't determine phenotypically," Severin explained. "This can be used to improve aquaculture practices."

Sex identification lets fish farmers stock tanks with the right ratio of males to females and get better yield.

Assembling and annotating a genome is like building an enormous, three-dimensional jigsaw puzzle, and the S. dorsalis genome has 685 million pieces — base pairs of DNA — to put together.

"Gene annotations are locations on the genome that encode transcripts that are translated into proteins," Severin explained.

Severin and his team assembled the genome of 685 megabase (MB) pairs from thousands of smaller fragments that each provided information to form the complete picture. "We had to sequence them for quite a bit of depth in order to construct the full 685 MB genome," he said.

"This amounted to a lot of data," said study co-author Arun Seetharam, an associate scientist at the Iowa State Genome Informatics Facility.

The raw DNA sequence data ran 500 gigabytes for the S. dorsalis genome, coming from tissue samples of a juvenile fish collected at the Hubbs SeaWorld Research Institute in San Diego, Cal.

"In order to put them together, we needed a computer with a lot more (random-access memory) to put it all into the computer's memory and then put it together to construct the 685 MB genome. We needed really powerful machines," Seetharam said.

The genome assembly work was conducted at the Pittsburgh Supercomputing Center on the Blacklight system, which, at one point, was the world's largest coherent shared-memory computing system. At the center, Blacklight has since been superseded by the data-centric Bridges system, which includes similar large-memory nodes of up to 12 terabytes — a thousand times more than a typical personal computer.

"You have to be able to compare every single piece of sequence data to every other piece to figure out which pieces need to be joined together -- like a giant puzzle," Severin explained.

He said the project originally set out to complete a large RNA sequencing project, and it turned out that there was sufficient funding to also do a genome assembly.

"That resulted in a long-term collaboration with the Southwest Fisheries Science Center," Severin said. "With the recent advances in high-throughput DNA sequencing, we're now able to generate terabytes of sequencing data. This tends to be short, 100-150 base pair reads that we have to put together like a very large puzzle and figure out where all the pieces go," he added.

Severin and Seetharam's team have completed the basic picture of the genome for S. dorsalis, but they say there's still room for refinement.

"The genome that we assembled is not perfect in the sense that it is still in many pieces. We weren't able to fully piece together entire chromosomes," Seetharam said. "We have many scaffolds representing each of those chromosomes, and we are missing a lot of information that is needed to fill in the gaps."

Sequencing technology advancements can address these gaps through the advancement of sequencing technology that can produce longer DNA reads, Seetharam said.

Both Severin and Seetharam are resolute in their conviction that big data can solve problems in sustainable food production.

"I believe the public is going to see more of this type of big data utilization and to see why science is so important for our future," Severin said. "We're going to start comparing genome assemblies with each other and start getting at what a genome is and how it works and (asking) how, for a particular genome, does the presence or absence of genes or its context with regard to its three-dimensional structure, how does that make a species?"

Subscribe to Our Newsletters
Feedstuffs is the news source for animal agriculture

You May Also Like