Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 81 additions & 68 deletions grobid_client/format/TEI2LossyJSON.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def convert_tei_file(self, tei_file: Union[Path, BinaryIO], stream: bool = False
text_structure = []
document['body_text'] = text_structure
figures_and_tables = []

Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains only trailing whitespace with no code or comments. Consider removing this empty line or ensuring it contains no whitespace characters, as trailing whitespace can cause issues with some linters and version control systems.

Suggested change

Copilot uses AI. Check for mistakes.
document['figures_and_tables'] = figures_and_tables
references_structure = []
document['references'] = references_structure
Expand All @@ -88,6 +89,7 @@ def convert_tei_file(self, tei_file: Union[Path, BinaryIO], stream: bool = False
) for author in child.find_all("author")
]
)

)

doi_node = child.find("idno", type="DOI")
Expand Down Expand Up @@ -206,7 +208,7 @@ def convert_tei_file(self, tei_file: Union[Path, BinaryIO], stream: bool = False
] if graphic_coords else []
}
)

Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Copilot uses AI. Check for mistakes.
# Extract references from listBibl with comprehensive processing
list_bibl = soup.find("listBibl")
if list_bibl:
Expand Down Expand Up @@ -669,91 +671,101 @@ def _iter_passages_from_soup_for_text(self, text_node: Tag, passage_level: str)
for passage in self._process_div_with_nested_content(div, passage_level, head_paragraph):
yield passage


Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Suggested change

Copilot uses AI. Check for mistakes.
def _process_div_with_nested_content(self, div: Tag, passage_level: str, head_paragraph: str = None) -> Iterator[Dict[str, Union[str, Dict[str, str]]]]:
"""
Process a div and its nested content, handling various back section types.
Supports nested divs for complex back sections like annex with multiple subsections.
Process a div **in document order**. Paragraphs and formulas are yielded
exactly where they appear in the XML.
Comment on lines +677 to +678
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has been simplified but is now less informative. The new description "Process a div in document order. Paragraphs and formulas are yielded exactly where they appear in the XML." doesn't fully explain the method's responsibilities, such as handling nested divs, section headers, and the various back section types mentioned in the original docstring. Consider expanding this to provide a more complete description of what the method does.

Suggested change
Process a div **in document order**. Paragraphs and formulas are yielded
exactly where they appear in the XML.
Traverse a single TEI <div> element and its nested content in document order
and yield passage dictionaries (e.g. paragraphs and formulas) exactly where
they appear in the XML.
This helper is responsible for:
- Interpreting section headers from a child <head> element when present.
If the <div> has a <head> but no top-level <p> children, the cleaned
<head> text is treated as a paragraph (``current_head_paragraph``).
Otherwise, the cleaned <head> text is treated as the section title
(``head_section``) for the passages that follow.
- Synthesizing a section heading from the ``type`` attribute when there is
no explicit <head>, including common back-section types like
``acknowledgement``, ``conflict``, ``contribution``, ``availability``,
and ``annex``. These are mapped to human-readable headings (for example,
``"acknowledgement"`` -> ``"Acknowledgements"``), or, if not in the
mapping, converted from snake_case to Title Case.
- Respecting any heading context passed via ``head_paragraph`` from a
parent or surrounding structure, so that nested <div> elements inherit
the appropriate heading/paragraph context where necessary.
- Walking the children of the <div> (and any relevant nested elements) in
XML document order so that paragraphs, formulas, and other supported
passage types are yielded in the same sequence as they appear in the
source TEI.
The exact structure of the yielded dictionaries matches what
``_iter_passages_from_soup_for_text`` / ``get_formatted_passage`` expect,
and no reordering of content is performed beyond what is implied by the TEI
structure itself.

Copilot uses AI. Check for mistakes.
"""
head = div.find("head")
p_nodes = div.find_all("p")
head_section = None
current_head_paragraph = None

# Check if this div has nested divs first (handle namespace variants)
nested_divs = []
for child in div.children:
if hasattr(child, 'name') and child.name:
# Handle both namespaced and non-namespaced divs
if child.name == "div" or child.name.endswith(":div"):
nested_divs.append(child)

# Count only direct child paragraphs, not those in nested divs
direct_p_nodes = [child for child in div.children if hasattr(child, 'name') and child.name == "p"]

if len(nested_divs) > 0 and len(direct_p_nodes) == 0:
# This is a container div - process each nested div independently
for nested_div in nested_divs:
# Skip references divs
if nested_div.get("type") == "references":
continue
# Pass None as head_paragraph to ensure nested divs use their own headers
for passage in self._process_div_with_nested_content(nested_div, passage_level, None):
yield passage
return # Don't process this div further

# Determine the section header and content type for divs with content
if head:
if len(direct_p_nodes) == 0:
# This div has only a head, no paragraphs (standalone head)
if not div.find_all("p", recursive=False):
current_head_paragraph = self._clean_text(head.get_text())
else:
# This div has both head and paragraphs - head is the section header
head_section = self._clean_text(head.get_text())
else:
# If no head element, try to use the type attribute as head_section
div_type = div.get("type")
if div_type:
# Handle specific div types with appropriate section names
if div_type == "acknowledgement":
head_section = "Acknowledgements"
elif div_type == "conflict":
head_section = "Conflicts of Interest"
elif div_type == "contribution":
head_section = "Author Contributions"
elif div_type == "availability":
# Only set as default if this div has its own content
if len(direct_p_nodes) > 0:
head_section = "Data Availability"
elif div_type == "annex":
head_section = "Annex"
else:
# Generic handling - capitalize and format
head_section = div_type.replace("_", " ").title()

# Process paragraphs in this div
if len(direct_p_nodes) > 0:
for id_p, p in enumerate(direct_p_nodes):
mapping = {
"acknowledgement": "Acknowledgements",
"conflict": "Conflicts of Interest",
"contribution": "Author Contributions",
"availability": "Data Availability",
"annex": "Annex",
}
head_section = mapping.get(div_type) or div_type.replace("_", " ").title()

paragraph_id = None
for child in div.children:
if not hasattr(child, "name") or not child.name:
continue

# nested divs
if child.name in ("div",) or child.name.endswith(":div"):
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple syntax child.name in ("div",) is redundant. Since you're checking a single value, this can be simplified to child.name == "div" for better readability. Alternatively, if you expect to add more tag names in the future, use a set or list without a trailing comma.

Suggested change
if child.name in ("div",) or child.name.endswith(":div"):
if child.name == "div" or child.name.endswith(":div"):

Copilot uses AI. Check for mistakes.
# recurse – the nested div will use its own header
yield from self._process_div_with_nested_content(
child, passage_level, None
)
continue

# paragraphs
if child.name == "p":
paragraph_id = get_random_id(prefix="p_")

if passage_level == "sentence":
for id_s, sentence in enumerate(p.find_all("s")):
struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, sentence)
if self.validate_refs:
for ref in struct['refs']:
assert "Wrong offsets", ref['offset_start'] < ref['offset_end']
assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text']
yield struct
for sentence in child.find_all("s"):
yield get_formatted_passage(
current_head_paragraph or head_paragraph,
head_section,
paragraph_id,
sentence,
)
else:
struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, p)
if self.validate_refs:
for ref in struct['refs']:
assert "Wrong offsets", ref['offset_start'] < ref['offset_end']
assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text']
yield struct

# Update head_paragraph for potential next div
yield get_formatted_passage(
current_head_paragraph or head_paragraph,
head_section,
paragraph_id,
child,
)
Comment on lines +715 to +731
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored code no longer uses the self.validate_refs attribute that was set in __init__. The previous implementation included validation logic to check reference offsets when validate_refs was True. This validation has been removed in the refactoring, making the validate_refs parameter in the constructor non-functional. Consider either restoring the validation logic or removing the unused parameter from the class initializer.

Copilot uses AI. Check for mistakes.
continue

# formulas
if child.name == "formula":
fid = (
child.get("{http://www.w3.org/XML/1998/namespace}id")
or child.get("id")
or get_random_id("f_")
)
raw = child.get_text(separator=" ", strip=True)
formula_text = self._clean_text(raw)

# coordinates (if GROBID gave them)
coords = []
if child.has_attr("coords"):
for c in child["coords"].split(";"):
if c.strip():
coords.append(box_to_dict(c.split(",")))

passage = {
"id": fid,
"text": formula_text,
"coords": coords,
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formula passages always have an empty refs list, unlike paragraphs which extract references using get_refs_with_offsets(). If formulas can contain references (e.g., citations or cross-references), consider extracting them similar to how paragraphs are processed. If formulas intentionally don't need reference extraction, consider adding a comment explaining this design decision.

Suggested change
"coords": coords,
"coords": coords,
# Note: unlike paragraphs, formulas currently do not extract references
# (citations or cross-references); `refs` is intentionally left empty.

Copilot uses AI. Check for mistakes.
"refs": [],
"type": "formula",
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula extraction does not include a "type" field in the structure returned by paragraphs. While formulas have "type": "formula" (line 756), paragraph passages don't have a type field. For consistency in the output structure, consider adding a "type": "paragraph" or "type": "sentence" field to paragraph passages as well, or document why formulas need this field but paragraphs don't.

Copilot uses AI. Check for mistakes.
}
if current_head_paragraph or head_paragraph:
passage["head_paragraph"] = current_head_paragraph or head_paragraph
if head_section:
passage["head_section"] = head_section

yield passage
continue
Comment on lines +734 to +764
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new formula extraction feature lacks test coverage. The existing test file (tests/test_conversions.py) contains tests for TEI to JSON conversion but doesn't include any tests for formula extraction. Consider adding tests that verify formulas are correctly extracted from TEI XML, including their coordinates, text content, and proper placement within the document structure.

Copilot uses AI. Check for mistakes.


Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Copilot uses AI. Check for mistakes.
if current_head_paragraph is not None:
head_paragraph = current_head_paragraph
pass
Comment on lines 767 to +768
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block serves no purpose. The condition checks if current_head_paragraph is not None but only executes a pass statement. This appears to be leftover code from the refactoring. Either remove this block entirely, or if there was intended logic here, it should be implemented.

Copilot uses AI. Check for mistakes.

def process_directory(self, directory: Union[str, Path], pattern: str = "*.tei.xml", parallel: bool = True, workers: int = None) -> Iterator[Dict]:
"""Process a directory of TEI files and yield converted documents.
Expand Down Expand Up @@ -1004,3 +1016,4 @@ def xml_table_to_json(table_element):
def convert_tei_file(tei_file: Union[Path, BinaryIO], stream: bool = False):
converter = TEI2LossyJSONConverter()
return converter.convert_tei_file(tei_file, stream=stream)