Skip to content

Implement Data Ingest Folder Structure and Conventions #9

@shreddd

Description

@shreddd

Objective

Establish a standardized folder structure and file naming convention for new data ingest processes, ensuring compatibility with the latest release schema and efficient storage/validation practices.

Requirements

  • Create a new ingest folder in the repository.
  • Within the ingest folder, create a subfolder for each data provider.
  • All ingests must support the latest release schema.
  • Depending on total data size, files should be split to limit each to ~25 MB.
  • Do not split records between files: each file must contain only complete records so that validation can be performed independently.
  • All data files are to be formatted as JSON lists (enclosed in brackets). Consider https://jsonlines.org/ as an alternative approach if more appropriate for downstream usage.
  • File naming convention: <data provider>_<padded 5 number>.json (e.g., emsl_00001.json).
  • Future - explore jsonlines formate

Acceptance Criteria

  • New ingest folder structure is documented and implemented.
  • Each data provider has its own subfolder.
  • All files conform to the current release schema.
  • No file exceeds ~25 MB; splitting strategy is documented.
  • No records are split between files; all files independently valid.
  • Naming convention is followed for all new files.
  • JSON format (list or dict) is clearly specified and documented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions