CLI tool to create a synthetic dataset with LLMs.
This tool uses octoai.cloud as LLM as a service.
To use dataforge
directly with poetry
, you have to do once:
poetry install
The tool dataforge
needs to call octoai.cloud. You have to create a .env
file with the API token.
OCTOAI_TOKEN=your_token
First you can check the number of calls and the cost to run it with the --dry-run
/-n
option:
poetry run dataforge -n --lang=en --chunk-size=2000 --chunk-overlap=100 data/IPCC_AR6_SYR_FullVolume.txt
When you are ready, you can run the effective calls:
poetry run dataforge --lang=en --chunk-size=2000 --chunk-overlap=100 data/IPCC_AR6_SYR_FullVolume.txt