-
Notifications
You must be signed in to change notification settings - Fork 264
Description
Hello teacher, I am having trouble removing redundant assembly, can you give me some advice?
I am doing a common transcriptome-based mining identification of viruses, and I assemble the sequences downloaded from the SRA separately after removing the host. My plan is to aggregate these assembly results for candidate virus alignment identification. I saw two deredundancy methods of mmseqs2 easy-cluster and easy-linclust, but also retrieved the deredundancy of cd-hit-est, I don't know if mmseqs2 is suitable for the purpose of deredundancy of my transcriptome assembly and merging, if I want to set a stricter clustering threshold, what parameters do I need to pay attention to, I hope you can help me.
I also initially tried the mmseqs2 easy-linclust which is much faster than cd-hit-est.
mmseqs easy-linclust virus.candidate.fasta mmseqs.cluster ./mmseqs.tmp --threads 60
And the results of mmseqs.cluster_all_seqs.fasta, mmseqs.cluster_cluster.tsv, mmseqs.cluster_rep_seq.fasta are obtained. I know mmseqs.cluster_rep_seq.fasta should be the result of deredundancy, but I want to get the information for clustering in order to find the distribution of the virus sequence across different samples, which file should be viewed, or what parameters are set.