-
Notifications
You must be signed in to change notification settings - Fork 264
Description
Hi!
[verbose text coming]
I'm working with Avian Influenza Virus datasets. The viral genome is comprised of 8 short segments, each coding one gene, and ranging between 700 and 2300 nt. The repositories have 100,000s of accessions. The sequences of the segments are all of very similar length.
I have used easy-linclust to reduce the complexity of a large set of genomic segments to generate a small reference set to guide assemblies, to great success.
I included the clustered sequence representatives in an evolutionary analysis using BEAST to describe the genomes we have sequenced. Virologists don't usually do this, they use the whole redundant dataset to make big ass alignments and then collapse all of what they don't want to see from the resulting trees. I'm being asked what is the rationale to use the representative sequences instead of unclustered data to provide phylogenetic context and external references to our alignments/trees.
I'm having trouble explaining the relatioship between the clusters and the phylogenetic history of the data.
First question:
As a general query, is my approach valid? Or flawed?
Then.
When I explore the clusters made by easy-linclust. I usually find that there is one very large cluster and a few clusters with very few sequences, like, from a dataset of 6367 records I would see:
/gisaid_July2025/trim_noen$ cut -f 1 lc_trim_gisaid_7.fasta_cluster.tsv | sort | uniq -c | sort -rn
5166 7|MP|EPI_ISL_90154|A/chicken/China/AH-10-01/2010|EPI317184|A_/H9N2
553 7|MP|EPI_ISL_68841|A/chicken/PA/35154/1991|EPI239832|A/H1N1
323 7|MP|EPI_ISL_33887|A/swine/Fujian/204/2007|EPI190598|A/H1N1
88 7|MP|EPI_ISL_151928|A/turkey/Egypt/S6405A/2012|EPI493036|A/H5N1
55 7|MP|EPI_ISL_28122|A/chicken/New_York/87493-2/2005|EPI172308|A/H7N2
33 7|MP|EPI_ISL_5860|A/chicken/New_York/28263/1989|EPI42809|A/H6N3
30 7|MP|EPI_ISL_3011|A/mute_swan/Germany/R1359/07|EPI171731|A/H5N1
14 7|MP|EPI_ISL_198159|A/swine/Indiana/A01260888/2015|EPI653017|A/H1N2
12 7|MP|EPI_ISL_170078|A/chicken/Changzhou/c02/2013|EPI556436|A/H3N2
9 7|MP|EPI_ISL_8791|A/shorebird/DE/68/2004|EPI86570|A/H13N9
9 7|MP|EPI_ISL_197217|A/swine/North_Carolina/A02076926/2015|EPI647521|A/H1N1
8 7|MP|EPI_ISL_30544|A/duck/Laos/A0503/2007|EPI180866|A/H5N1
7 7|MP|EPI_ISL_22987|A/chicken/New_York/10196-5/2005|EPI160738|A/_H7N2
This mirrors the number of the viral strains (subtypes) present in the databases, so it might be an evolutionary feature.
What is throwing me off is that if I extract the records from the first cluster, and cluster it again by itself, I don't get a single representative, but a few of them:
re-clustering the 5166 records from the first cluster with:
mmseqs easy-linclust custer1s7.fasta cluster1s7res tmp
5132 7|MP|EPI_ISL_96739|A/duck/Malaysia/2001|EPI336730|A_/H9N2
22 7|MP|EPI_ISL_12743|A/chicken/Taiwan/2838V/00|EPI119818|A/H6N1
12 7|MP|EPI_ISL_8874|A/laughing_gull/DE/2838/1987|EPI88075|A/_H13N2
Second question:
Is this expected behaviour? Why didn't the two small clusters appear the first time I clustered all the data?