S. rueppellii larvae were obtained from ‘biopestgroup.com’. CO2 was used for anaesthesia to allow the insects to be sorted from the substrate. The larvae were then flash frozen with liquid N2 and stored at -80°C. The whole process was completed within 48 hours of arrival.
Next Generation Sequencing
i) Illumina genomic sequencing 150 bp paired end data:
417,662,063 reads with a total length of 125.3Gb.
ii) PacBio CLR data, total reads 6,748,327 with a total length of 83.2 Gbp (277x) and a polymerase read length N50 of 63,285bp.
iii) Illumina RNA sequencing 150 bp paired end data: 123,298,454 reads.
iv) Illumina genomic Hi-C sequencing 150 bp paired end data: 21.6Gb
Several assemblers were trialled to generate the assembly (including Canu, DBG2OLC and wtdbg2), however, many struggled to produce a good quality assembly, perhaps due to the high repeat content and heterozygosity of the genome. Flye and Platanus-Allee produced the best quality assemblies. Flye had the best assembly statistics in terms of scaffold N50 (100,207bp with 18 scaffolds >1 million bp) and BUSCO completeness score (99.2%). However, duplication was very high (48.3%) for this assembly, even after subsetting the longest reads to get 150x coverage (duplication was 63.8% prior to subsetting). The total number of scaffolds was 50,164. Platanus-Allee had a lower scaffold N50 (42,845bp with 0 scaffolds >1 million bp) and a slightly lower BUSCO completeness score (97.6%), but duplication was much lower (3.6%). The total number of scaffolds was 67,142.
In order to retain the high contiguity of the Flye assembly, whilst attempting to reduce its high duplication percentage, the Flye and Platanus-Allee assemblies were merged using QuickMerge. Some manual curation was also performed to bring back falsely removed contigs. This resulted in an assembly with a slightly lower completeness score of 96.5%, however, the duplication was reduced to 15.5% whilst preserving most of the long-length scaffolds produced using Flye. The assembly had a scaffold N50 of 67,653bp and a total of 59,284 scaffolds, 16 of which were >1 million bp.
A subsequent round of Purge Haplotigs brought the duplication score down to 4.6% whilst still maintaining a completeness of 95.6%. Scaffold N50 increased to 126,450bp and the total number of scaffolds was reduced to 15,009.
This draft assembly was next used for scaffolding with Hi-C data using the 3D-DNA de novo genome assembly pipeline. This increased the scaffold N50 to 87,361,475 bp, with 5 scaffolds > 10 million bp. The total number of scaffolds was reduced to 11,549, with 6 chromosomal-level scaffolds, numbered by sequence length. There is currently no karyotypic information for S. rueppellii to confirm the correct number of chromosomes, however, this value corresponds to a cytogenetic analysis of Eristalis tenax which had 6 chromosomes. The BUSCO completeness score was reduced to 94.6%, however, a round of Pilon error polishing brought this back up to 96.4% (subsequent rounds of Pilon worsened the BUSCO score). A final run with Purge Haplotigs reduced duplication from 4% to 3%.