-
Notifications
You must be signed in to change notification settings - Fork 5
Files accepted
PanGraphViewer accepts three different data formats to help pangenome graph visualization, which are the rGFA format, GFA_v1 format and the VCF format.
PanGraphViewer is designed based on the reference GFA (rGFA) file given the flexibility of this data format.
If users have multiple high-quality genome assemblies from different individuals, users may use minigraph (Linux preferred) to generate an rGFA file.
Before running, the header of the fasta file needs modifying. For example, if users have a fasta file from Sample1 with a header like:
>chr1
AAAAAGCCGCGCGCGCTTGCGC
Users need to modify the header to:
>Sample1||chr1
AAAAAGCCGCGCGCGCTTGCGC
On Linux, the command lines that can be used to achieve this are:
sample="" ## the name of the sample. For instance: Sample1
fasta="" ## full path to the fasta file
name=`echo $fasta | rev | cut -d"." -f2-| rev`
sed -e "s/>/>${sample}||/g" $fasta > ${name}.headerModified.fasta
In PanGraphViewer, we have provided a python script renameFastaHeader.py to help this conversion. The script can be found in the scripts folder under panGraphViewer --> panGraphViewerApp. Users can also use the Desktop version to convert by clicking Tools --> Format Conversion --> Modify FASTA Header.
usage: renameFastaHeader.py [-h] [--version] [-f FASTA] [-n NAME] [-o OUTPUT]
rename the header of a given fasta file
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-f FASTA a fasta format file
-n NAME name of the sample
-d DELIM delimiter. Default: '||'
-o OUTPUT the output directory
NOTE: For the sample name, please DO NOT include ||.
If users do not modify the header of the fasta files and directly use minigraph to generate the rGFA file, panGraphViewer can still read the file, while some features, for example where the node comes from would not show in detail. A warning message will also display in both UI and the opened Terminal or powershell.
If users don't have an rGFA file, but a GFA_v1 file with paths (P), users may follow the standard here to convert the GFA_v1 file into an rGFA file.
In case it is difficult to do so, we have provided an internal function to help this convertion. Users can simply select a GFA_v1 file in PanGraphViewer to browse the underlying graph. However, a warning message will show if a GFA_v1 rather than an rGFA is selected. Also, our program will check internally if the input GFA_v1 has PATH(P) in it. If not, an error message will display.
NOTE: if the GFA_v1 file is big (> 1Gb), our program will take a while to perform the conversion from GFA_v1 to rGFA. We recommend using gfa2rGFA.py located in the scripts folder to perform the conversion if the file is big.
A VCF file is also accepted to show the pangenome graph. Basically, a reference fasta file is optional if the VCF is a standard one. The program will check the input VCF file and evaluate if the VCF file meets the requirement automatically. If not, a warning or an error message will show.
Depending on the purpose, VCF filtration is highly recommended before plotting the underlying graph.
Here, we have provided a method to help convert a VCF file to an rGFA file. Users can perform the conversion directly through the interface provided in the application or directly use vcf2rGFA.py under the panGraphViewer --> panGraphViewerApp --> scripts folder.
NOTE: if there are many variations in the VCF file, we recommend using vcf2rGFA.py directly to convert by chromosomes/sequences rather than converting entirely. This will save a lot of computational resources when plotting graphs.
The usage of vcf2rGFA.py is shown below. Both Windows and Linux/macOS users can directly use this script to convert a VCF file to an rGFA file.
usage: vcf2rGFA.py [-h] [--version] [-f FASTA] [-b BACKBONE] [-v VCF] [-o OUTPUT] [-c [CHR [CHR ...]]] [-n NTHREAD]
Convert a vcf file to an rGFA file
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-f FASTA a fasta format file that from the backbone sample
-b BACKBONE the name of the backbone sample
-v VCF the vcf file
-o OUTPUT the output directory
-c [CHR [CHR ...]] the name of the chromosome(s) [default: all chroms]
-n NTHREAD number of threads [default: 4]
If users want to check nodes that fall in a gene model region, a BED, GTF or GFF3 file from the backbone sample can be provided to do so. Basically, the BED file should contain at least 6 columns as shown below.
| Column | Information |
|---|---|
| 1 | Chromosome ID |
| 2 | Gene start position |
| 3 | Gene end position |
| 4 | Gene ID |
| 5 | Score (or others; the program does not use the info in this column) |
| 6 | Orientation |
- Users can load the
BED,GTForGFF3file to the application to check the coordinate overlaps between nodes and genes. - By default, genes overlapping with more than
2nodes will be shown in the dropdown menu. - A gene list will be saved in the output directory after parsing the annotation file.
- When using this function, a plot containing the selected gene and nodes falling in the gene region will be shown. A subgraph of the gene region will also be shown.