- Task metadata datasets advertised by your task app so users can choose seeds, difficulty tiers, or asset bundles when launching jobs.
- Training datasets uploaded as JSONL files when running
uvx synth-ai train --type sft
or seeding RL jobs with offline rollouts.
1. Advertise datasets via TaskInfo
- Inside
TaskAppConfig
, settask_info.datasets
to aTaskDatasetRegistry
(see Crafter atsynth_ai/task/apps/grpo_crafter.py
). - Each entry defines an
id
,display_name
, optionaldescription
, and metadata such as default seed lists or storage URIs. /task_info
exposes these datasets to the CLI, so they appear in pickers when someone runsuvx synth-ai train --type rl
or evaluation scripts.- Keep the registry lightweight: store IDs and URLs, fetch large assets (S3, GCS, HTTPS) on demand inside the task app.
2. Map seeds to rich metadata
- Implement
provide_task_instances
to translate dataset seed IDs into contextual information (difficulty, scenario tags, curriculum steps). - Return
TaskInstanceInfo
objects with user-friendly labels; the CLI prints them before launching jobs and traces preserve them for analytics.
3. Prepare supervised JSONL files
- SFT jobs expect newline-delimited JSON following the schema defined in
synth_ai/api/train/jsonl_schema.py
(prompt/completion pairs plus optional metadata). - Before upload the CLI validates each file with
validate_sft_jsonl
; fix any schema errors locally so jobs don’t fail post-submission. - When you run
uvx synth-ai train --type sft
, the CLI uploads the files to Synth’s backend and stores their IDs in the job payload (train_file_id
,validation_file_id
). - Use
--examples N
to run smoke tests; the CLI copies the firstN
rows to a temporary file vialimit_jsonl_examples
to save time.
4. Generate datasets from hosted traces
- Enable tracing for your task app (set
TASKAPP_TRACING_ENABLED=1
in Modal or use CLI flags during deploy). - After RL or SFT runs, download the trace database and convert it with
examples/warming_up_to_rl/export_trace_sft.py
to produce high-quality JSONL data. - Store exported files in a versioned
datasets/
directory so_discover_dataset_candidates
can suggest them automatically the next time you calluvx synth-ai train --type sft
.
5. Reference datasets in TOML configs
- SFT configs (e.g.,
examples/warming_up_to_rl/configs/crafter_fft.toml
) point to dataset paths under[training.dataset]
. Use relative paths so teammates can reuse the config. - RL configs reference dataset IDs under
[data]
and[training.rollouts]
; these must match the IDs you advertised inTaskDatasetRegistry
. - The CLI merges overrides (
--dataset
,--task-url
,--model
) with TOML values insidebuild_sft_payload
andbuild_rl_payload
, ensuring jobs stay reproducible.
6. Keep documentation in sync
- Add README snippets next to your dataset files explaining schema, provenance, and contact points.
- When deprecating a dataset, remove it from
TaskDatasetRegistry
or mark it clearly so users do not launch jobs with stale assets. - For large artifacts, prefer object storage and reference signed URLs or versioned keys; the task app can download them just-in-time during rollouts.