Skip to content

Conversation

@edufuga
Copy link
Contributor

@edufuga edufuga commented Nov 24, 2025

Silk — Improve plugin documentation (second batch)

https://jira.eccenca.com/browse/CMEM-7013

This PR adds documentation for the following Silk dataset plugins:

  • RDF file dataset (local RDF file + ZIP ingestion, format/graph handling, in-memory limits, optional N-Triples output).
  • In-memory dataset (embedded RDF store for temporary workflow graphs, SPARQL-based read/write, lifecycle + “clear before run” semantics).
  • Alignment dataset (write-only export of link results as Alignment files following the AlignAPI specification, with <Cell>-level structure and optional relation/measure).

RdfFileDataset.md

RDF file reads RDF data from a local file (or ZIP archive) into the project as an in-memory dataset and, for supported formats, can also write RDF back to a file.

The doc starts with the intended usage window (small/medium files, snapshots for exploration/mapping/linking, simple export) and immediately flags the hard constraint: everything is loaded in memory, so very large files belong in an external store. Then it walks the data shape and IO story: single file vs ZIP input (plus the regex gate for which ZIP entries are considered), dataset output as queryable graph(s), and the graph-selection rule (named graph only where the chosen format supports it; otherwise default graph, with the graph parameter ignored for graph-less formats). Configuration notes focus on how to think, not just what to fill: file/ZIP behavior, format auto-detection (and the “can’t detect → error” path), the write restriction (only N-Triples as output), advanced narrowing via an entity list, and ZIP file filtering via regex. Behavior is described as a sequence you can predict: size check → parse into an in-memory dataset (default + possibly named graphs) → select graph → serve repeated reads from memory until the underlying file timestamp changes → reload on next access → write path serializes as N-Triples only. It ends with limitations + “when to use” guidance and concrete examples (simple Turtle, N-Quads with an explicit graph, ZIP with multiple RDF files).

InMemoryDataset.md

In-memory dataset is a small embedded RDF store that keeps all data in memory and exposes it via SPARQL as a temporary working graph inside workflows.

The doc frames it as a deliberately non-persistent scratch graph: one in-memory RDF model, all reads and writes mediated through a SPARQL endpoint, and an empty state after application restart. Within workflows it’s explicitly bidirectional—usable as both source and sink—so upstream components can write entities/links/triples into it and downstream components query it like a normal SPARQL dataset (entity retrieval, path/type discovery, sampling, etc.), with no file backing at all. Writing is explained by sink type but unified in effect: entity sink converts entities to triples, link sink writes link triples, triple sink adds triples directly; all converge into the same single in-memory graph. The one configuration knob (“Clear graph before workflow execution”, default true) is treated as the semantic switch: either a fresh empty graph per run, or a longer-lived in-memory graph across runs within the same process. Limitations are stated as operational consequences (memory-bound, no persistence, best for small/medium intermediates and prototyping) and the examples reinforce the intended patterns: temporary integration graph, scratch experimentation area, small lookup store.

AlignmentDataset.md

Alignment is a write-only dataset that exports link results as Alignment files following the AlignAPI format specification (and the SWJ60 description).

The doc keeps scope tight from the start: it exists to serialize links between entities in a standardized alignment format, not to read entities, run transformations, or do extra processing. It motivates the shape via separation of concerns and interoperability: a focused exporter that produces files consumable by alignment-aware tooling and usable in subsequent workflows. The core mechanics are explained at the link-record level—each link becomes one <Cell> with explicit source URI, target URI, optional relation (e.g., =), and an optional confidence measure (0.0–1.0)—and the plugin is responsible for emitting a well-formed file (structure, header/footer, UTF-8). A minimal example anchors how multiple links map to multiple <Cell> entries, and the references section points to the AlignAPI format spec and the SWJ60 paper for full semantics and edge details.

@edufuga edufuga marked this pull request as ready for review December 16, 2025 15:40
@edufuga edufuga requested a review from robertisele December 16, 2025 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants