-
Notifications
You must be signed in to change notification settings - Fork 81
Add formula in the JSON output #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -66,6 +66,7 @@ def convert_tei_file(self, tei_file: Union[Path, BinaryIO], stream: bool = False | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| text_structure = [] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| document['body_text'] = text_structure | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| figures_and_tables = [] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| document['figures_and_tables'] = figures_and_tables | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| references_structure = [] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| document['references'] = references_structure | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -88,6 +89,7 @@ def convert_tei_file(self, tei_file: Union[Path, BinaryIO], stream: bool = False | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) for author in child.find_all("author") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| doi_node = child.find("idno", type="DOI") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -206,7 +208,7 @@ def convert_tei_file(self, tei_file: Union[Path, BinaryIO], stream: bool = False | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ] if graphic_coords else [] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Extract references from listBibl with comprehensive processing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| list_bibl = soup.find("listBibl") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if list_bibl: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -669,91 +671,101 @@ def _iter_passages_from_soup_for_text(self, text_node: Tag, passage_level: str) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| for passage in self._process_div_with_nested_content(div, passage_level, head_paragraph): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| yield passage | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has been simplified but is now less informative. The new description "Process a div in document order. Paragraphs and formulas are yielded exactly where they appear in the XML." doesn't fully explain the method's responsibilities, such as handling nested divs, section headers, and the various back section types mentioned in the original docstring. Consider expanding this to provide a more complete description of what the method does.
| Process a div **in document order**. Paragraphs and formulas are yielded | |
| exactly where they appear in the XML. | |
| Traverse a single TEI <div> element and its nested content in document order | |
| and yield passage dictionaries (e.g. paragraphs and formulas) exactly where | |
| they appear in the XML. | |
| This helper is responsible for: | |
| - Interpreting section headers from a child <head> element when present. | |
| If the <div> has a <head> but no top-level <p> children, the cleaned | |
| <head> text is treated as a paragraph (``current_head_paragraph``). | |
| Otherwise, the cleaned <head> text is treated as the section title | |
| (``head_section``) for the passages that follow. | |
| - Synthesizing a section heading from the ``type`` attribute when there is | |
| no explicit <head>, including common back-section types like | |
| ``acknowledgement``, ``conflict``, ``contribution``, ``availability``, | |
| and ``annex``. These are mapped to human-readable headings (for example, | |
| ``"acknowledgement"`` -> ``"Acknowledgements"``), or, if not in the | |
| mapping, converted from snake_case to Title Case. | |
| - Respecting any heading context passed via ``head_paragraph`` from a | |
| parent or surrounding structure, so that nested <div> elements inherit | |
| the appropriate heading/paragraph context where necessary. | |
| - Walking the children of the <div> (and any relevant nested elements) in | |
| XML document order so that paragraphs, formulas, and other supported | |
| passage types are yielded in the same sequence as they appear in the | |
| source TEI. | |
| The exact structure of the yielded dictionaries matches what | |
| ``_iter_passages_from_soup_for_text`` / ``get_formatted_passage`` expect, | |
| and no reordering of content is performed beyond what is implied by the TEI | |
| structure itself. |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple syntax child.name in ("div",) is redundant. Since you're checking a single value, this can be simplified to child.name == "div" for better readability. Alternatively, if you expect to add more tag names in the future, use a set or list without a trailing comma.
| if child.name in ("div",) or child.name.endswith(":div"): | |
| if child.name == "div" or child.name.endswith(":div"): |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactored code no longer uses the self.validate_refs attribute that was set in __init__. The previous implementation included validation logic to check reference offsets when validate_refs was True. This validation has been removed in the refactoring, making the validate_refs parameter in the constructor non-functional. Consider either restoring the validation logic or removing the unused parameter from the class initializer.
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formula passages always have an empty refs list, unlike paragraphs which extract references using get_refs_with_offsets(). If formulas can contain references (e.g., citations or cross-references), consider extracting them similar to how paragraphs are processed. If formulas intentionally don't need reference extraction, consider adding a comment explaining this design decision.
| "coords": coords, | |
| "coords": coords, | |
| # Note: unlike paragraphs, formulas currently do not extract references | |
| # (citations or cross-references); `refs` is intentionally left empty. |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula extraction does not include a "type" field in the structure returned by paragraphs. While formulas have "type": "formula" (line 756), paragraph passages don't have a type field. For consistency in the output structure, consider adding a "type": "paragraph" or "type": "sentence" field to paragraph passages as well, or document why formulas need this field but paragraphs don't.
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new formula extraction feature lacks test coverage. The existing test file (tests/test_conversions.py) contains tests for TEI to JSON conversion but doesn't include any tests for formula extraction. Consider adding tests that verify formulas are correctly extracted from TEI XML, including their coordinates, text content, and proper placement within the document structure.
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block serves no purpose. The condition checks if current_head_paragraph is not None but only executes a pass statement. This appears to be leftover code from the refactoring. Either remove this block entirely, or if there was intended logic here, it should be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains only trailing whitespace with no code or comments. Consider removing this empty line or ensuring it contains no whitespace characters, as trailing whitespace can cause issues with some linters and version control systems.