GitHub

Bader2025

This is the public code and data repository for Bader et al. 2025.

Purposes

Assembled a paired BCR sequence database based on PBMC samples from healthy, unvaccinated subjects amounting to a total of 524,694 paired BCR sequences of which 450,968 are from naïve B cells.
R-script used to process affinity measurement data acquired by SPR.

Abstract

This pipeline processes paired heavy and light chain antibody sequences from the Observed Antibody Space (OAS) database, DeKosky and Leuko datasets. The sequences are annotated using SADIE's IgBLAST integration and saved as Parquet files with complete AIRR-compliant annotations and metadata for downstream analysis.

This database contains paired BCR sequences with the primary objective of determining the frequency of specific heavy and light chain gene pairings in the human antibody repertoire, and ultimately calculating putative frequencies of naive B cells with unique immunogenetic signatures through bioinformatic analysis. Beyond this core function, the database serves multiple research applications. For example, Bader et al. 2025 utilized this BCR dataset to identify IGHV1-46/IGKV3-20-paired BCRs and select representative HCDR3 loops that capture the natural HCDR3 diversity observed in human antibody repertoires for this specific heavy/light chain gene pairing.

Repository also contains the R-script used to automatically process the affinity measurement data acquired by Surface Plasma Resonance (SPR) on a Carterra LSR instrument.

Directory Structure

data/
├── OAS_paired/            # OAS paired sequence CSV files
│   ├── ERR4082227_paired.csv
│   ├── ERR4082235_paired.csv
│   └── ...
├── DeKosky_paired/        # DeKosky dataset CSV files
│   ├── SRR1585248_joined_NoAlleles.csv
│   ├── SRR1585265_joined_NoAlleles.csv
│   └── ...
├── D326651_Leuko_human_naive.csv  # Leuko dataset
├── oas_manifest.csv       # OAS metadata manifest
├── fasta-heavy/           # Generated heavy chain FASTA files
│   ├── ERR4082227.fasta
│   └── ...
├── fasta-light/           # Generated light chain FASTA files
│   ├── ERR4082227.fasta
│   └── ...
├── parquet-paired/        # Final output Parquet files
│   ├── ERR4082227.parquet
│   ├── SRR1585248.parquet
│   ├── SRR1585265.parquet
│   └── ...
└── README.md
└── run-sadie.ipynb          # Main processing notebook

Processing Pipeline Diagram

┌─────────────────────────────────────────────────────────────┐
│                      INPUT DATA SOURCES                     │
├─────────────────────┬───────────────────┬───────────────────┤
│   OAS Database      │  DeKosky Data     │   Leuko Dataset   │
│   (CSV files)       │  (CSV files)      │   (Single CSV)    │
└─────────┬───────────┴─────────┬─────────┴─────────┬─────────┘
          │                     │                   │
          ▼                     ▼                   ▼
┌──────────────────────────────────────────────────────────────┐
│                  1. DATA STANDARDIZATION                     │
│  • Read CSV with appropriate headers                         │
│  • Handle duplicate columns (DeKosky)                        │
│  • Map to standard column names:                             │
│    - sequence_id_heavy/light                                 │
│    - sequence_heavy/light                                    │
└─────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                   2. FASTA GENERATION                        │
│  • Create BioPython SeqRecord objects                        │
│  • Write heavy chains → data/fasta-heavy/                    │
│  • Write light chains → data/fasta-light/                    │
└─────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                3. AIRR ANNOTATION (SADIE)                    │
│  • Run IgBLAST via SADIE Airr API                            │
│  • Process heavy chain FASTA files                           │
│  • Process light chain FASTA files                           │
│  • Generate AIRR-compliant annotations                       │
└─────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                    4. DATA MERGING                           │
│  • Match heavy/light chains by sequence ID                   │
│  • Merge annotations with _heavy/_light suffixes             │
│  • Add metadata from manifest:                               │
│    - Run ID, Species, Author                                 │
│    - B-cell source/type                                      │
│    - Disease status, etc.                                    │
└─────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                      5. OUTPUT                               │
│  • Save as Parquet files in data/parquet-paired/             │
│  • Each file contains:                                       │
│    - Original sequences                                      │
│    - Complete AIRR annotations                               │
│    - Experimental metadata                                   │
│    - Maintained heavy-light pairing                          │
└──────────────────────────────────────────────────────────────┘

Data Sources

1. OAS Database

Location: data/OAS_paired/
Format: CSV files with paired heavy/light sequences
Species: Human antibody sequences
Manifest: data/oas_manifest.csv contains metadata for each dataset

2. DeKosky Dataset

Location: data/DeKosky_paired/
Format: CSV files with custom column structure
Cell Type: Naive B-cells from PBMC
Special Processing: Requires column renaming due to duplicate headers

3. Leuko Dataset

Location: data/D326651_Leuko_human_naive.csv
Cell Type: Naive B-cells from PBMC
Author: Jonathan Hurtado

Processing Pipeline

Step 1: Data Loading and Preparation

Read CSV files with appropriate headers (some files have JSON headers requiring special handling)
Standardize column names:
- Map to sequence_id_heavy, sequence_id_light, sequence_heavy, sequence_light
- Handle duplicate column names in DeKosky data

Step 2: FASTA Generation

Create FASTA files for heavy and light chains separately:
- Heavy chains saved to data/fasta-heavy/
- Light chains saved to data/fasta-light/
Use BioPython to properly format sequences with IDs

Step 3: AIRR Annotation with SADIE

Run IgBLAST via SADIE's Airr API on each FASTA file
Generate AIRR-compliant annotations including:
- V(D)J gene assignments
- CDR3 sequences
- Framework regions
- Junction analysis

Step 4: Paired Data Merging

Match heavy and light chains using sequence IDs
Merge annotations with suffixes _heavy and _light
Add metadata from manifest:
- Run ID
- Species
- B-cell source (PBMC)
- B-cell type (Naive B-cells)
- Author information
- Disease status
- Other experimental metadata

Step 5: Output Generation

Save as Parquet files in data/parquet-paired/
File naming: Uses run ID or dataset identifier
Format: Apache Parquet for efficient storage and querying

Output Structure

Each Parquet file contains:

Sequence data: Original nucleotide sequences for heavy and light chains
AIRR annotations: Complete IgBLAST results for both chains
Metadata: Experimental and sample information
Pairing information: Maintained heavy-light chain relationships

Technical Details

Dependencies

pandas
BioPython (Bio.Seq, Bio.SeqRecord, Bio.SeqIO)
SADIE (for AIRR annotation via IgBLAST)

Performance

Processing time varies by dataset size
Example: SRR datasets process in ~20-30 seconds each
DeKosky datasets: ~4.5 minutes for complete processing

Error Handling

Checks for existing files to avoid overwriting
Handles mixed data types in columns
Manages memory by deleting dataframes after processing

Usage Example

To run the pipeline:

Ensure all dependencies are installed
Place raw data in appropriate directories
Run the notebook cells sequentially
Output will be generated in data/parquet-paired/

Example Processing Flow

# Process a single OAS file
filename = "ERR4082227"
df = pd.read_csv(f"data/OAS_paired/{filename}_paired.csv")

# Standardize columns
df['sequence_id_heavy'] = df['sequence_id_heavy'].astype(str)
df['sequence_id_light'] = df['sequence_id_light'].astype(str)

# Create FASTA and run SADIE
heavy_df = airr_api.run_fasta(f"data/fasta-heavy/{filename}.fasta")
light_df = airr_api.run_fasta(f"data/fasta-light/{filename}.fasta")

# Merge and save
paired_df = pd.merge(heavy_df, light_df, on='tmp_id', suffixes=('_heavy', '_light'))
paired_df.to_parquet(f'data/parquet-paired/{filename}.parquet')

Output Example

Each Parquet file contains ~100+ columns including:

Heavy Chain Columns:
- sequence_id_heavy
- sequence_heavy
- v_call_heavy
- d_call_heavy
- j_call_heavy
- cdr3_aa_heavy
- junction_heavy
- ...

Light Chain Columns:
- sequence_id_light
- sequence_light
- v_call_light
- j_call_light
- cdr3_aa_light
- junction_light
- ...

Metadata:
- run
- species
- bsource
- btype
- author
- disease
- file_name

Manifest Generation

A tailored manifest (data/oas_manifest_human_paired.csv) is created containing:

Only human paired sequences
Unique author entries
Sorted by run ID
Ready for downstream analysis

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
parquet-paired		parquet-paired
.gitattributes		.gitattributes
.gitignore		.gitignore
Carterra_v13_update19Aug2025.R		Carterra_v13_update19Aug2025.R
README.md		README.md
README.pdf		README.pdf
create_pairedBCR_DB.ipynb		create_pairedBCR_DB.ipynb
run-sadie.ipynb		run-sadie.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bader2025

Purposes

Abstract

Directory Structure

Processing Pipeline Diagram

Data Sources

1. OAS Database

2. DeKosky Dataset

3. Leuko Dataset

Processing Pipeline

Step 1: Data Loading and Preparation

Step 2: FASTA Generation

Step 3: AIRR Annotation with SADIE

Step 4: Paired Data Merging

Step 5: Output Generation

Output Structure

Technical Details

Dependencies

Performance

Error Handling

Usage Example

Example Processing Flow

Output Example

Manifest Generation

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

SchiefLab/Bader2025

Folders and files

Latest commit

History

Repository files navigation

Bader2025

Purposes

Abstract

Directory Structure

Processing Pipeline Diagram

Data Sources

1. OAS Database

2. DeKosky Dataset

3. Leuko Dataset

Processing Pipeline

Step 1: Data Loading and Preparation

Step 2: FASTA Generation

Step 3: AIRR Annotation with SADIE

Step 4: Paired Data Merging

Step 5: Output Generation

Output Structure

Technical Details

Dependencies

Performance

Error Handling

Usage Example

Example Processing Flow

Output Example

Manifest Generation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages