Skip to content

docs: what if we also want to separate metadata from their target (source?) of raw data? #17

@tianjianjiang

Description

@tianjianjiang

Background

For replying to @timoschick 's questions (via Slack DM), my thought gets stuck with the question that may sounds similar to what @cccntu asked with #12: what is the expected interfaces for us to collaborate?

Three Sub-questions

  1. mC4, one of my preferred datasets, comes with URLs and timestamps directly. Do we still want to have our own JSONL files?
  2. Some datasets for https://github.com/bigscience-workshop/evaluation may also work for us, e.g., GEM (those originated from Wikipedia or WikiHow), CRD3, WikiANN, etc. The (1) situation appears again an may be a bit more complicated.
  3. (1)+(2), suppose we also want to use some parts from https://github.com/bigscience-workshop/promptsource, especially the "applying templates" part, which I noticed that https://github.com/bigscience-workshop/evaluation has been duplicating and my gut tells me such duplication can be avoided, just not sure whether we can get helps from data-tooling.

Thoughts

  • Again, like @cccntu aforementioned in (WIP) Add loading script for arxiv dataset #12, shall we go for quick solutions or not?
    • To my limited knowledge, fastai got some callback function based transformation;
    • Or we simply want to have our own copies of datasets.
  • Perhaps we will have more than one Python packages.
    • The current one;
    • The one that may be shared with promptsource, evaluation, metadata, and probably more?

Side Notes

  • The title of this issue follows https://github.com/angular/angular/blob/master/CONTRIBUTING.md#commit , just my habit;
    • Which sometimes overlaps with GitHub's default Labels, and yet in my opinion, labels should serve different purpose (e.g., search filters);
  • The milestone and the project attached to this issue ticket are also just my habit... My apologies;
  • I took liberty of assigning some of us, pardon me for the intrusion or the unintended exclusion;
  • Perhaps we will also need some issue ticket templates and code owner file (and a PR review policy?).

Metadata

Metadata

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions