question about easy-linclust with very large, redundant DNA datasets

Hi!
[verbose text coming]
I'm working with Avian Influenza Virus datasets. The viral genome is comprised of 8 short segments, each coding one gene, and ranging between 700 and 2300 nt. The repositories have 100,000s of accessions. The sequences of the segments are all of very similar length.
I have used easy-linclust to reduce the complexity of a large set of genomic segments to generate a small reference set to guide assemblies, to great success.
I included the clustered sequence representatives in an evolutionary analysis using BEAST to describe the genomes we have sequenced. Virologists don't usually do this, they use the whole redundant dataset to make big ass alignments and then collapse all of what they don't want to see from the resulting trees. I'm being asked what is the rationale to use the representative sequences instead of unclustered data to provide phylogenetic context and external references to our alignments/trees.
I'm having trouble explaining the relatioship between the clusters and the phylogenetic history of the data. 

First question:
As a general query, is my approach valid? Or flawed? 

Then.
When I explore the clusters made by easy-linclust. I usually find that there is one very large cluster and a few clusters with very few sequences, like, from a dataset of 6367 records I would see:


/gisaid_July2025/trim_noen$ cut -f 1 lc_trim_gisaid_7.fasta_cluster.tsv | sort | uniq -c | sort -rn
   5166 7|MP|EPI_ISL_90154|A/chicken/China/AH-10-01/2010|EPI317184|A_/_H9N2
    553 7|MP|EPI_ISL_68841|A/chicken/PA/35154/1991|EPI239832|A_/_H1N1
    323 7|MP|EPI_ISL_33887|A/swine/Fujian/204/2007|EPI190598|A_/_H1N1
     88 7|MP|EPI_ISL_151928|A/turkey/Egypt/S6405A/2012|EPI493036|A_/_H5N1
     55 7|MP|EPI_ISL_28122|A/chicken/New_York/87493-2/2005|EPI172308|A_/_H7N2
     33 7|MP|EPI_ISL_5860|A/chicken/New_York/28263/1989|EPI42809|A_/_H6N3
     30 7|MP|EPI_ISL_3011|A/mute_swan/Germany/R1359/07|EPI171731|A_/_H5N1
     14 7|MP|EPI_ISL_198159|A/swine/Indiana/A01260888/2015|EPI653017|A_/_H1N2
     12 7|MP|EPI_ISL_170078|A/chicken/Changzhou/c02/2013|EPI556436|A_/_H3N2
      9 7|MP|EPI_ISL_8791|A/shorebird/DE/68/2004|EPI86570|A_/_H13N9
      9 7|MP|EPI_ISL_197217|A/swine/North_Carolina/A02076926/2015|EPI647521|A_/_H1N1
      8 7|MP|EPI_ISL_30544|A/duck/Laos/A0503/2007|EPI180866|A_/_H5N1
      7 7|MP|EPI_ISL_22987|A/chicken/New_York/10196-5/2005|EPI160738|A_/_H7N2


This mirrors the number of the viral strains (subtypes) present in the databases, so it might be an evolutionary feature.
What is throwing me off is that if I extract the records from the first cluster, and cluster it again by itself, I don't get a single representative, but a few of them:
re-clustering the 5166 records from the first cluster with:

mmseqs easy-linclust custer1s7.fasta cluster1s7res tmp

   5132 7|MP|EPI_ISL_96739|A/duck/Malaysia/2001|EPI336730|A_/_H9N2
     22 7|MP|EPI_ISL_12743|A/chicken/Taiwan/2838V/00|EPI119818|A_/_H6N1
     12 7|MP|EPI_ISL_8874|A/laughing_gull/DE/2838/1987|EPI88075|A_/_H13N2

Second question:
Is this expected behaviour? Why didn't the two small clusters appear the first time I clustered all the data?
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

question about easy-linclust with very large, redundant DNA datasets #1056

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

question about easy-linclust with very large, redundant DNA datasets #1056

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions