-
Notifications
You must be signed in to change notification settings - Fork 11
Closed
Labels
questionFurther information is requestedFurther information is requested
Milestone
Description
Background
For replying to @timoschick 's questions (via Slack DM), my thought gets stuck with the question that may sounds similar to what @cccntu asked with #12: what is the expected interfaces for us to collaborate?
Three Sub-questions
- mC4, one of my preferred datasets, comes with URLs and timestamps directly. Do we still want to have our own JSONL files?
- Some datasets for https://github.com/bigscience-workshop/evaluation may also work for us, e.g., GEM (those originated from Wikipedia or WikiHow), CRD3, WikiANN, etc. The (1) situation appears again an may be a bit more complicated.
- (1)+(2), suppose we also want to use some parts from https://github.com/bigscience-workshop/promptsource, especially the "applying templates" part, which I noticed that https://github.com/bigscience-workshop/evaluation has been duplicating and my gut tells me such duplication can be avoided, just not sure whether we can get helps from data-tooling.
Thoughts
- Again, like @cccntu aforementioned in (WIP) Add loading script for arxiv dataset #12, shall we go for quick solutions or not?
- To my limited knowledge, fastai got some callback function based transformation;
- Or we simply want to have our own copies of datasets.
- Perhaps we will have more than one Python packages.
- The current one;
- The one that may be shared with promptsource, evaluation, metadata, and probably more?
Side Notes
- The title of this issue follows https://github.com/angular/angular/blob/master/CONTRIBUTING.md#commit , just my habit;
- Which sometimes overlaps with GitHub's default Labels, and yet in my opinion, labels should serve different purpose (e.g., search filters);
- The milestone and the project attached to this issue ticket are also just my habit... My apologies;
- I took liberty of assigning some of us, pardon me for the intrusion or the unintended exclusion;
- Perhaps we will also need some issue ticket templates and code owner file (and a PR review policy?).
changjonathanc
Metadata
Metadata
Labels
questionFurther information is requestedFurther information is requested