Improve plugin documentation (second batch) #987

edufuga · 2025-11-24T08:20:43Z

Silk — Improve plugin documentation (second batch)

https://jira.eccenca.com/browse/CMEM-7013

This PR adds documentation for the following Silk dataset plugins:

RDF file dataset (local RDF file + ZIP ingestion, format/graph handling, in-memory limits, optional N-Triples output).
In-memory dataset (embedded RDF store for temporary workflow graphs, SPARQL-based read/write, lifecycle + “clear before run” semantics).
Alignment dataset (write-only export of link results as Alignment files following the AlignAPI specification, with <Cell>-level structure and optional relation/measure).

RdfFileDataset.md

RDF file reads RDF data from a local file (or ZIP archive) into the project as an in-memory dataset and, for supported formats, can also write RDF back to a file.

The doc starts with the intended usage window (small/medium files, snapshots for exploration/mapping/linking, simple export) and immediately flags the hard constraint: everything is loaded in memory, so very large files belong in an external store. Then it walks the data shape and IO story: single file vs ZIP input (plus the regex gate for which ZIP entries are considered), dataset output as queryable graph(s), and the graph-selection rule (named graph only where the chosen format supports it; otherwise default graph, with the graph parameter ignored for graph-less formats). Configuration notes focus on how to think, not just what to fill: file/ZIP behavior, format auto-detection (and the “can’t detect → error” path), the write restriction (only N-Triples as output), advanced narrowing via an entity list, and ZIP file filtering via regex. Behavior is described as a sequence you can predict: size check → parse into an in-memory dataset (default + possibly named graphs) → select graph → serve repeated reads from memory until the underlying file timestamp changes → reload on next access → write path serializes as N-Triples only. It ends with limitations + “when to use” guidance and concrete examples (simple Turtle, N-Quads with an explicit graph, ZIP with multiple RDF files).

InMemoryDataset.md

In-memory dataset is a small embedded RDF store that keeps all data in memory and exposes it via SPARQL as a temporary working graph inside workflows.

The doc frames it as a deliberately non-persistent scratch graph: one in-memory RDF model, all reads and writes mediated through a SPARQL endpoint, and an empty state after application restart. Within workflows it’s explicitly bidirectional—usable as both source and sink—so upstream components can write entities/links/triples into it and downstream components query it like a normal SPARQL dataset (entity retrieval, path/type discovery, sampling, etc.), with no file backing at all. Writing is explained by sink type but unified in effect: entity sink converts entities to triples, link sink writes link triples, triple sink adds triples directly; all converge into the same single in-memory graph. The one configuration knob (“Clear graph before workflow execution”, default true) is treated as the semantic switch: either a fresh empty graph per run, or a longer-lived in-memory graph across runs within the same process. Limitations are stated as operational consequences (memory-bound, no persistence, best for small/medium intermediates and prototyping) and the examples reinforce the intended patterns: temporary integration graph, scratch experimentation area, small lookup store.

AlignmentDataset.md

Alignment is a write-only dataset that exports link results as Alignment files following the AlignAPI format specification (and the SWJ60 description).

The doc keeps scope tight from the start: it exists to serialize links between entities in a standardized alignment format, not to read entities, run transformations, or do extra processing. It motivates the shape via separation of concerns and interoperability: a focused exporter that produces files consumable by alignment-aware tooling and usable in subsequent workflows. The core mechanics are explained at the link-record level—each link becomes one <Cell> with explicit source URI, target URI, optional relation (e.g., =), and an optional confidence measure (0.0–1.0)—and the plugin is responsible for emitting a well-formed file (structure, header/footer, UTF-8). A minimal example anchors how multiple links map to multiple <Cell> entries, and the references section points to the AlignAPI format spec and the SWJ60 paper for full semantics and edge details.

Eduard Fugarolas added 2 commits November 24, 2025 09:17

Alignment dataset documentation.

eaa00cb

New dataset documentations.

122c46b

edufuga marked this pull request as ready for review December 16, 2025 15:40

edufuga requested a review from robertisele December 16, 2025 15:40

Merge branch 'develop' into feature/furtherImprovePluginDoc-CMEM-7013

d9e447a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve plugin documentation (second batch) #987

Improve plugin documentation (second batch) #987

Uh oh!

edufuga commented Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve plugin documentation (second batch) #987

Are you sure you want to change the base?

Improve plugin documentation (second batch) #987

Uh oh!

Conversation

edufuga commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Silk — Improve plugin documentation (second batch)

RdfFileDataset.md

InMemoryDataset.md

AlignmentDataset.md

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edufuga commented Nov 24, 2025 •

edited

Loading