Skip to main content
Datasets power two parts of the Synth workflow:
  1. Task metadata datasets advertised by your task app so users can choose seeds, difficulty tiers, or asset bundles when launching jobs.
  2. Training datasets uploaded as JSONL files when running uvx synth-ai train --type sft or seeding RL jobs with offline rollouts.
Use the patterns below to keep both in sync with the hosted SDK.

1. Advertise datasets via TaskInfo

  • Inside TaskAppConfig, set task_info.datasets to a TaskDatasetRegistry (see Crafter at synth_ai/task/apps/grpo_crafter.py).
  • Each entry defines an id, display_name, optional description, and metadata such as default seed lists or storage URIs.
  • /task_info exposes these datasets to the CLI, so they appear in pickers when someone runs uvx synth-ai train --type rl or evaluation scripts.
  • Keep the registry lightweight: store IDs and URLs, fetch large assets (S3, GCS, HTTPS) on demand inside the task app.

2. Map seeds to rich metadata

  • Implement provide_task_instances to translate dataset seed IDs into contextual information (difficulty, scenario tags, curriculum steps).
  • Return TaskInstanceInfo objects with user-friendly labels; the CLI prints them before launching jobs and traces preserve them for analytics.

3. Prepare supervised JSONL files

  • SFT jobs expect newline-delimited JSON following the schema defined in synth_ai/api/train/jsonl_schema.py (prompt/completion pairs plus optional metadata).
  • Before upload the CLI validates each file with validate_sft_jsonl; fix any schema errors locally so jobs don’t fail post-submission.
  • When you run uvx synth-ai train --type sft, the CLI uploads the files to Synth’s backend and stores their IDs in the job payload (train_file_id, validation_file_id).
  • Use --examples N to run smoke tests; the CLI copies the first N rows to a temporary file via limit_jsonl_examples to save time.

4. Generate datasets from hosted traces

  • Enable tracing for your task app (set TASKAPP_TRACING_ENABLED=1 in Modal or use CLI flags during deploy).
  • After RL or SFT runs, download the trace database and convert it with examples/warming_up_to_rl/export_trace_sft.py to produce high-quality JSONL data.
  • Store exported files in a versioned datasets/ directory so _discover_dataset_candidates can suggest them automatically the next time you call uvx synth-ai train --type sft.

5. Reference datasets in TOML configs

  • SFT configs (e.g., examples/warming_up_to_rl/configs/crafter_fft.toml) point to dataset paths under [training.dataset]. Use relative paths so teammates can reuse the config.
  • RL configs reference dataset IDs under [data] and [training.rollouts]; these must match the IDs you advertised in TaskDatasetRegistry.
  • The CLI merges overrides (--dataset, --task-url, --model) with TOML values inside build_sft_payload and build_rl_payload, ensuring jobs stay reproducible.

6. Keep documentation in sync

  • Add README snippets next to your dataset files explaining schema, provenance, and contact points.
  • When deprecating a dataset, remove it from TaskDatasetRegistry or mark it clearly so users do not launch jobs with stale assets.
  • For large artifacts, prefer object storage and reference signed URLs or versioned keys; the task app can download them just-in-time during rollouts.
By pairing accurate task metadata with well-structured JSONL files, you make it effortless for Synth’s CLI and backend to surface your datasets, launch jobs, and reproduce results across the team.
I