Bioinformatic Workflows


Example gene Annotation workflow

• Advanced Repeat library pipeline

• Genome guided transcriptome using trinity and stringtie -> combined using Evigene

BUSCO: C:96.8%[S:45.3%,D:51.5%],F:0.2%,M:3.0%

• Maker gene prediction using trained Augustus and Genemark and Evigene evidence

BUSCO: C:89.8%[S:83.1%,D:6.7%],F:2.7%,M:7.5%

• Maker transfer evidence using Evigene (transcriptome)

BUSCO: C:84.7%[S:76.7%,D:8.0%],F:0.4%,M:14.9%

• Evigene pick best models from both Maker gene prediction and transferred set

BUSCO: C:94.9%[S:87.3%,D:7.6%],F:0.9%,M:4.2% = ~30K gene models

• Additional steps using Gffcompare to pull in novel transcripts and manual curation to select models and remove excess gene predictions without evidence.

BUSCO: C:96.2%[S:88.4%,D:7.8%],F:0.7%,M:3.1% = ~13,500 gene models

• PASA to correct models and add UTR/isoforms

BUSCO: C:96.4%[S:85.7%,D:10.7%],F:0.5%,M:3.1% = ~13,400 gene models


Final steps:

i) Manual model selection followed to split models that were incorrect and curate where possible any obviously incorrect models.

ii) Using genomic pfam track to identify loci. Manual curation of gene families of interest: P450, UDP, ABC, IRAC.


Example Genome Assembly workflow for 20x coverage

Hifiasm assembly (recommend deepconsensus to correct before)

Purge_haplotigs

Geneious lastz alignments to do some limited manual further removal of redundancy and merge overlapping contigs

• HiC juicer and 3d-dna to scaffold (switch off breaking as excessive "-r 0")

Check Juicer HiC maps and manually correct/check

Map HiC reads and error correct using homozygous snp/indels

Take unmapped HiC reads and map back to original assembly to check if missing sequence has been lost and reinsert

Check lastz alignments to check for any duplication error and check for artefact's


Note: Difference when coverage is 5x would be to do different assemblies using flye, canu and use quickmerge and additional manual curation to try and merge contigs using geneious. But the result will be limited without 20x coverage.