Skip to content

JSON output for paragraphs does not include parent headings #100

@Sanakhamassi

Description

@Sanakhamassi

When using GROBID to parse PDFs into JSON format, paragraphs are assigned a head_section based on detected headings. However, if a paragraph is under a main heading and a subheading, the JSON output only includes the subheading in head_section. The parent/main heading is missing.

PDF structure:
Main heading: Methods
Subheading: Study design
Paragraph: "This study was conducted over a period of 6 months."

Current JSON output assigns only the subheading:
head_section: Study design
Text: "This study was conducted over a period of 6 months."

Below are example output files that demonstrate the issue:

GROBID TEI XML:
mjb3wlzxcb2mc-migowebupload-1766042162782.grobid.tei.xml
Generated JSON output:
mjb3wlzxcb2mc-migowebupload-1766042162782.json

In these files, paragraphs under nested headings only reference the lowest-level heading in head_section, while the parent heading is absent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions