Files accepted

Pangenome graph files

PanGraphViewer accepts three different data formats to help pangenome graph visualization, which are the rGFA format, GFA_v1 format and the VCF format.

rGFA

PanGraphViewer is designed based on the reference GFA (rGFA) file given the flexibility of this data format.

If users have multiple high-quality genome assemblies from different individuals, users may use minigraph (Linux preferred) to generate an rGFA file.

Before running, the header of the fasta file needs modifying. For example, if users have a fasta file from Sample1 with a header like:

>chr1
AAAAAGCCGCGCGCGCTTGCGC

Users need to modify the header to:

>Sample1||chr1
AAAAAGCCGCGCGCGCTTGCGC

On Linux, the command lines that can be used to achieve this are:

sample=""  ## the name of the sample. For instance: Sample1
fasta=""   ## full path to the fasta file
name=`echo $fasta | rev | cut -d"." -f2-| rev`
sed -e "s/>/>${sample}||/g" $fasta > ${name}.headerModified.fasta

In PanGraphViewer, we have provided a python script renameFastaHeader.py to help this conversion. The script can be found in the scripts folder under panGraphViewer --> panGraphViewerApp. Users can also use the Desktop version to convert by clicking Tools --> Format Conversion --> Modify FASTA Header.

usage: renameFastaHeader.py [-h] [--version] [-f FASTA] [-n NAME] [-o OUTPUT]

rename the header of a given fasta file

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit
  -f FASTA    a fasta format file
  -n NAME     name of the sample
  -d DELIM    delimiter. Default: '||'
  -o OUTPUT   the output directory

NOTE: For the sample name, please DO NOT include ||.

If users do not modify the header of the fasta files and directly use minigraph to generate the rGFA file, panGraphViewer can still read the file, while some features, for example where the node comes from would not show in detail. A warning message will also display in both UI and the opened Terminal or powershell.

GFA_v1

If users don't have an rGFA file, but a GFA_v1 file with paths (P), users may follow the standard here to convert the GFA_v1 file into an rGFA file.

In case it is difficult to do so, we have provided an internal function to help this convertion. Users can simply select a GFA_v1 file in PanGraphViewer to browse the underlying graph. However, a warning message will show if a GFA_v1 rather than an rGFA is selected. Also, our program will check internally if the input GFA_v1 has PATH(P) in it. If not, an error message will display.

NOTE: if the GFA_v1 file is big (> 1Gb), our program will take a while to perform the conversion from GFA_v1 to rGFA. We recommend using gfa2rGFA.py located in the scripts folder to perform the conversion if the file is big.

VCF

A VCF file is also accepted to show the pangenome graph. Basically, a reference fasta file is optional if the VCF is a standard one. The program will check the input VCF file and evaluate if the VCF file meets the requirement automatically. If not, a warning or an error message will show.

Depending on the purpose, VCF filtration is highly recommended before plotting the underlying graph.

Here, we have provided a method to help convert a VCF file to an rGFA file. Users can perform the conversion directly through the interface provided in the application or directly use vcf2rGFA.py under the panGraphViewer --> panGraphViewerApp --> scripts folder.

NOTE: if there are many variations in the VCF file, we recommend using vcf2rGFA.py directly to convert by chromosomes/sequences rather than converting entirely. This will save a lot of computational resources when plotting graphs.

The usage of vcf2rGFA.py is shown below. Both Windows and Linux/macOS users can directly use this script to convert a VCF file to an rGFA file.

usage: vcf2rGFA.py [-h] [--version] [-f FASTA] [-b BACKBONE] [-v VCF] [-o OUTPUT] [-c [CHR [CHR ...]]] [-n NTHREAD]
    
Convert a vcf file to an rGFA file
    
optional arguments:
    -h, --help          show this help message and exit
    --version           show program's version number and exit
    -f FASTA            a fasta format file that from the backbone sample
    -b BACKBONE         the name of the backbone sample
    -v VCF              the vcf file
    -o OUTPUT           the output directory
    -c [CHR [CHR ...]]  the name of the chromosome(s) [default: all chroms]
    -n NTHREAD          number of threads [default: 4]

Annotation files

If users want to check nodes that fall in a gene model region, a BED, GTF or GFF3 file from the backbone sample can be provided to do so. Basically, the BED file should contain at least 6 columns as shown below.

Column	Information
1	Chromosome ID
2	Gene start position
3	Gene end position
4	Gene ID
5	Score (or others; the program does not use the info in this column)
6	Orientation

Users can load the BED, GTF or GFF3 file to the application to check the coordinate overlaps between nodes and genes.
By default, genes overlapping with more than 2 nodes will be shown in the dropdown menu.
A gene list will be saved in the output directory after parsing the annotation file.
When using this function, a plot containing the selected gene and nodes falling in the gene region will be shown. A subgraph of the gene region will also be shown.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files accepted

Pangenome graph files

rGFA

GFA_v1

VCF

Annotation files

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally