mirror of
https://github.com/NomaDamas/k-skill.git
synced 2026-06-24 02:04:11 +00:00
parent
6831b3147e
commit
5b08b4c86e
7 changed files with 168 additions and 10 deletions
|
|
@ -8,7 +8,7 @@ Source tree for **k-skill-qa-bot**, an automated QA daemon for the k-skill repos
|
|||
- Every 3 days (launchd LaunchAgent), the daemon:
|
||||
1. Refreshes a shallow clone of `NomaDamas/k-skill` `main`.
|
||||
2. Discovers every `<skill>/SKILL.md`, classifies each skill (read-only / location / login / destructive / api-key / proxy-dependent / deprecated).
|
||||
3. Runs each suitable skill through `codex exec` (read-only sandbox) with a smoke-test prompt synthesized from the skill's `## When to use`.
|
||||
3. Runs each suitable skill through `codex exec --dangerously-bypass-approvals-and-sandbox` with a smoke-test prompt synthesized from the skill's `## When to use`, while keeping the separate LLM judge on a read-only/no-approval Codex path.
|
||||
4. An LLM judge (`codex exec --output-schema`) grades pass / fail / skip.
|
||||
5. Failed skills are filed as dedup'd issues on `NomaDamas/k-skill`. Skipped skills (login required, deprecated, missing API key) never create issues.
|
||||
|
||||
|
|
@ -18,6 +18,13 @@ After running `install.sh`, the runtime lives at `~/.local/share/k-skill-qa-bot/
|
|||
|
||||
The k-skill repository itself is **never modified** by the bot — it is read-only SSOT. Test prompts are synthesized from each `SKILL.md`.
|
||||
|
||||
## Trust-boundary notes
|
||||
|
||||
- Smoke tests intentionally run unsandboxed and may contact public skill endpoints, plus git, Codex, GitHub, and k-skill-proxy health-check endpoints.
|
||||
- A dedicated LaunchAgent is scheduling isolation only; it is not a separate OS user, container, or filesystem sandbox.
|
||||
- The bot-managed clone is not write-protected from the unsandboxed smoke agent; treat it as mutable bot state rather than a write-protected filesystem boundary.
|
||||
- The judge uses read-only/no-approval Codex settings, but is still a tool-capable Codex agent over untrusted transcripts and skill Markdown. Do not describe it as a no-tools or file-isolated model call unless the implementation changes to enforce that boundary.
|
||||
|
||||
## Design rules
|
||||
|
||||
- **SSOT**: All test prompts and skill metadata come from `SKILL.md` files in the bot's own shallow clone of `NomaDamas/k-skill` `main`. The k-skill repo gets no QA-bot-specific edits.
|
||||
|
|
|
|||
|
|
@ -1,14 +1,14 @@
|
|||
# k-skill-qa-bot
|
||||
|
||||
Automated QA daemon for the **k-skill** skill library. Runs every 3 days via macOS launchd, tests every skill via `codex exec --json --sandbox read-only`, has an LLM judge grade pass/fail/skip, and files dedup'd GitHub issues for skills that have broken.
|
||||
Automated QA daemon for the **k-skill** skill library. Runs every 3 days via macOS launchd, tests every suitable skill via `codex exec --json --dangerously-bypass-approvals-and-sandbox`, has a read-only/no-approval LLM judge grade pass/fail/skip, and files dedup'd GitHub issues for skills that have broken.
|
||||
|
||||
## What it does
|
||||
|
||||
1. **Refreshes** a shallow clone of `NomaDamas/k-skill` `main` every 3 days.
|
||||
2. **Discovers** every `<skill>/SKILL.md`.
|
||||
3. **Classifies** each skill (read-only / location / login / destructive / api-key / proxy-dependent / deprecated).
|
||||
4. **Runs** each suitable skill through `codex exec --json --sandbox read-only` with a smoke-test prompt synthesized from the skill's `## When to use` bullets.
|
||||
5. **Judges** the result via a second `codex exec` call using a cheaper model and a strict JSON Schema.
|
||||
4. **Runs** each suitable skill through `codex exec --json --dangerously-bypass-approvals-and-sandbox` with a smoke-test prompt synthesized from the skill's `## When to use` bullets. The daemon runs as a dedicated LaunchAgent with non-interactive approvals; avoiding the Codex sandbox prevents false DNS/network failures during skill smoke tests.
|
||||
5. **Judges** the result via a second read-only/no-approval `codex exec` call using the configured judge model and a strict JSON Schema.
|
||||
6. **Files** dedup'd issues on `NomaDamas/k-skill` for true failures (with `auto-qa` label). Skipped skills (deprecated, login-required, missing API key) never create issues.
|
||||
|
||||
The k-skill repo itself is **never modified** by the bot — it is read-only SSOT. Test prompts are synthesized from each `SKILL.md`.
|
||||
|
|
@ -50,7 +50,8 @@ Overridable variables (see `config/defaults.sh`):
|
|||
|---|---|---|
|
||||
| `CREATE_ISSUES` | `false` | File GH issues for failures |
|
||||
| `CODEX_MODEL` | `gpt-5.5` | Model for skill exec |
|
||||
| `JUDGE_MODEL` | `gpt-5.4-mini` | Model for LLM judge |
|
||||
| `JUDGE_MODEL` | `gpt-5.5` | Model for LLM judge |
|
||||
| `CODEX_PROVIDER` | `openai` | Codex model provider for skill exec and judge calls |
|
||||
| `TIMEOUT_SECS` | `180` | Per-skill timeout |
|
||||
| `JUDGE_TIMEOUT_SECS` | `60` | Per-judge timeout |
|
||||
| `MAX_PARALLEL` | `4` | Concurrent skill tests |
|
||||
|
|
@ -85,12 +86,15 @@ bash ~/.local/share/k-skill-qa-bot/uninstall.sh --yes --purge --purge-logs
|
|||
|
||||
## Safety
|
||||
|
||||
- `--sandbox read-only` pins the codex sandbox.
|
||||
- Skill smoke tests use `--dangerously-bypass-approvals-and-sandbox` because the Codex sandbox can block legitimate DNS/network lookups for public skill endpoints exercised by smoke tests.
|
||||
- A dedicated LaunchAgent is scheduling isolation only; it is not a separate OS user, container, or filesystem sandbox.
|
||||
- The bot-managed clone is not write-protected from the unsandboxed smoke agent; treat it as mutable bot state and judge only against inputs whose provenance is understood.
|
||||
- The LLM judge stays on the safer `-s read-only` path with `approval_policy="never"`; read-only/no-approval limits writes and approval prompts, but does not make the judge a no-tools or file-isolated model call. Treat transcript and skill Markdown as untrusted input.
|
||||
- 10 destructive/login-required skills are force-skipped before any codex call is issued.
|
||||
- Deprecated skills (`~~name~~ ⚠️ 지원 중단` in README) are detected and skipped.
|
||||
- `update-clone.sh` refuses any `K_SKILL_CLONE` outside `K_QA_HOME/k-skill-clone` unless `ALLOW_EXTERNAL_CLONE_TARGET=1` (prevents the script from git-reset'ing the wrong directory).
|
||||
- `CREATE_ISSUES=false` first-run default prevents accidental issue spam.
|
||||
- Local state only: `~/.local/share/k-skill-qa-bot/`. No network egress except git fetch, codex API, gh API, k-skill-proxy health check.
|
||||
- Local state only: `~/.local/share/k-skill-qa-bot/`. Expected network egress is limited to git fetch, codex API, gh API, k-skill-proxy health checks, and the public skill endpoints exercised by smoke tests.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
|
|
|||
|
|
@ -105,9 +105,10 @@ def _call_judge(prompt: str, schema_path, model: str, timeout: int) -> dict:
|
|||
if gtimeout:
|
||||
cmd += [gtimeout, str(timeout)]
|
||||
cmd += [codex, "exec", "--json", "--ephemeral",
|
||||
"--dangerously-bypass-approvals-and-sandbox",
|
||||
"-s", "read-only",
|
||||
"--skip-git-repo-check", "-m", model,
|
||||
"--output-schema", str(schema_path),
|
||||
"-c", 'approval_policy="never"',
|
||||
"-c", f'model_provider="{provider}"',
|
||||
prompt]
|
||||
try:
|
||||
|
|
@ -172,7 +173,7 @@ def main(argv=None) -> int:
|
|||
ap.add_argument("--skill-md", type=Path, required=True)
|
||||
ap.add_argument("--prompt-template", type=Path, default=_CFG / "judge-prompt.md")
|
||||
ap.add_argument("--schema", type=Path, default=_CFG / "judge-schema.json")
|
||||
ap.add_argument("--model", default=os.environ.get("JUDGE_MODEL", "gpt-5.4-mini"))
|
||||
ap.add_argument("--model", default=os.environ.get("JUDGE_MODEL", "gpt-5.5"))
|
||||
ap.add_argument("--timeout", type=int, default=int(os.environ.get("JUDGE_TIMEOUT_SECS", "60")))
|
||||
ap.add_argument("--timeout-secs", type=int, default=int(os.environ.get("TIMEOUT_SECS", "180")))
|
||||
ap.add_argument("--offline", action="store_true",
|
||||
|
|
|
|||
27
tools/k-skill-qa-bot/test/bats/docs_trust_boundary.bats
Normal file
27
tools/k-skill-qa-bot/test/bats/docs_trust_boundary.bats
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
#!/usr/bin/env bats
|
||||
|
||||
setup() {
|
||||
QA_BOT_ROOT="$(cd "$BATS_TEST_DIRNAME/../.." && pwd)"
|
||||
README="$QA_BOT_ROOT/README.md"
|
||||
AGENTS="$QA_BOT_ROOT/AGENTS.md"
|
||||
}
|
||||
|
||||
@test "README accurately documents judge trust boundary" {
|
||||
run grep -F 'it only reads transcripts/prompts and emits JSON' "$README"
|
||||
[ "$status" -ne 0 ]
|
||||
|
||||
grep -Fq 'read-only/no-approval limits writes and approval prompts, but does not make the judge a no-tools or file-isolated model call' "$README"
|
||||
grep -Fq 'Treat transcript and skill Markdown as untrusted input' "$README"
|
||||
}
|
||||
|
||||
@test "README accurately documents smoke-test egress and LaunchAgent boundary" {
|
||||
grep -Fq 'public skill endpoints exercised by smoke tests' "$README"
|
||||
grep -Fq 'bot-managed clone is not write-protected from the unsandboxed smoke agent' "$README"
|
||||
grep -Fq 'A dedicated LaunchAgent is scheduling isolation only; it is not a separate OS user, container, or filesystem sandbox' "$README"
|
||||
}
|
||||
|
||||
@test "QA-bot AGENTS guidance preserves split trust boundary" {
|
||||
grep -Fq 'Smoke tests intentionally run unsandboxed and may contact public skill endpoints' "$AGENTS"
|
||||
grep -Fq 'bot-managed clone is not write-protected from the unsandboxed smoke agent' "$AGENTS"
|
||||
grep -Fq 'The judge uses read-only/no-approval Codex settings, but is still a tool-capable Codex agent over untrusted transcripts and skill Markdown' "$AGENTS"
|
||||
}
|
||||
|
|
@ -8,7 +8,7 @@ setup() {
|
|||
@test "env.sh sets all default values when nothing else is set" {
|
||||
run env -i HOME="$HOME" PATH="$PATH" ENV_SH="$ENV_SH" bash -c '. "$ENV_SH" && echo "$CODEX_MODEL|$MAX_PARALLEL|$GH_REPO|$LAST_RUN_MIN_AGE|$CREATE_ISSUES|$JUDGE_MODEL"'
|
||||
[ "$status" -eq 0 ]
|
||||
[ "$output" = "gpt-5.5|4|NomaDamas/k-skill|259200|false|gpt-5.4-mini" ]
|
||||
[ "$output" = "gpt-5.5|4|NomaDamas/k-skill|259200|false|gpt-5.5" ]
|
||||
}
|
||||
|
||||
@test "env.sh respects existing environment variables" {
|
||||
|
|
|
|||
54
tools/k-skill-qa-bot/test/bats/judge_command.bats
Normal file
54
tools/k-skill-qa-bot/test/bats/judge_command.bats
Normal file
|
|
@ -0,0 +1,54 @@
|
|||
#!/usr/bin/env bats
|
||||
|
||||
setup() {
|
||||
QA_BOT_ROOT="$(cd "$BATS_TEST_DIRNAME/../.." && pwd)"
|
||||
TMP="$(mktemp -d)"
|
||||
STUB="$TMP/codex"
|
||||
CAPTURE="$TMP/argv.txt"
|
||||
TRANSCRIPT="$TMP/transcript.jsonl"
|
||||
SKILL_MD="$TMP/SKILL.md"
|
||||
cat > "$STUB" <<'SH'
|
||||
#!/usr/bin/env bash
|
||||
printf '%s\n' "$@" > "$CODEX_ARGV_CAPTURE"
|
||||
printf '%s\n' '{"type":"item.completed","item":{"type":"agent_message","text":"{\"verdict\":\"pass\",\"reason\":\"judge accepted transcript\",\"symptom_class\":\"success\",\"confidence\":0.99,\"evidence_quote\":\"VERDICT: PASS\"}"}}'
|
||||
SH
|
||||
chmod +x "$STUB"
|
||||
cat > "$TRANSCRIPT" <<'JSONL'
|
||||
{"type":"item.completed","item":{"type":"agent_message","text":"VERDICT: PASS\nEverything worked."}}
|
||||
JSONL
|
||||
echo '# Test Skill' > "$SKILL_MD"
|
||||
}
|
||||
|
||||
teardown() {
|
||||
rm -rf "$TMP"
|
||||
}
|
||||
|
||||
@test "judge-skill standalone defaults to gpt-5.5" {
|
||||
receipt="{\"name\":\"demo\",\"status\":\"executed\",\"exit_code\":0,\"duration_ms\":100,\"transcript_path\":\"$TRANSCRIPT\",\"test_prompt\":\"run demo\"}"
|
||||
|
||||
run env -i HOME="$HOME" PATH="$PATH" CODEX_BIN="$STUB" CODEX_ARGV_CAPTURE="$CAPTURE" \
|
||||
bash -c 'printf "%s" "$0" | "$1" --skill-md "$2"' "$receipt" "$QA_BOT_ROOT/bin/judge-skill.py" "$SKILL_MD"
|
||||
|
||||
[ "$status" -eq 0 ]
|
||||
echo "$output" | python3 -c 'import json,sys; data=json.load(sys.stdin); assert data["judge_model"] == "gpt-5.5", data'
|
||||
grep -qx -- '-m' "$CAPTURE"
|
||||
grep -qx -- 'gpt-5.5' "$CAPTURE"
|
||||
}
|
||||
|
||||
@test "judge-skill keeps judge codex execution read-only and pins provider" {
|
||||
receipt="{\"name\":\"demo\",\"status\":\"executed\",\"exit_code\":0,\"duration_ms\":100,\"transcript_path\":\"$TRANSCRIPT\",\"test_prompt\":\"run demo\"}"
|
||||
|
||||
run env -i HOME="$HOME" PATH="$PATH" CODEX_BIN="$STUB" CODEX_ARGV_CAPTURE="$CAPTURE" CODEX_PROVIDER="example-provider" \
|
||||
bash -c 'printf "%s" "$0" | "$1" --skill-md "$2" --timeout 5' "$receipt" "$QA_BOT_ROOT/bin/judge-skill.py" "$SKILL_MD"
|
||||
|
||||
[ "$status" -eq 0 ]
|
||||
grep -qx -- '-s' "$CAPTURE"
|
||||
grep -qx -- 'read-only' "$CAPTURE"
|
||||
grep -qx -- '-c' "$CAPTURE"
|
||||
grep -qx -- 'approval_policy="never"' "$CAPTURE"
|
||||
grep -qx -- 'model_provider="example-provider"' "$CAPTURE"
|
||||
if grep -qx -- '--dangerously-bypass-approvals-and-sandbox' "$CAPTURE"; then
|
||||
echo "unexpected sandbox-bypass flag in judge argv"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
65
tools/k-skill-qa-bot/test/bats/smoke_command.bats
Normal file
65
tools/k-skill-qa-bot/test/bats/smoke_command.bats
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
#!/usr/bin/env bats
|
||||
|
||||
setup() {
|
||||
QA_BOT_ROOT="$(cd "$BATS_TEST_DIRNAME/../.." && pwd)"
|
||||
TMP="$(mktemp -d)"
|
||||
STUB_BIN="$TMP/bin"
|
||||
mkdir -p "$STUB_BIN" "$TMP/clone" "$TMP/run"
|
||||
CAPTURE="$TMP/argv.txt"
|
||||
cat > "$STUB_BIN/codex" <<'SH'
|
||||
#!/usr/bin/env bash
|
||||
printf '%s\n' "$@" > "$CODEX_ARGV_CAPTURE"
|
||||
printf '%s\n' '{"type":"item.completed","item":{"type":"agent_message","text":"smoke ok"}}'
|
||||
SH
|
||||
chmod +x "$STUB_BIN/codex"
|
||||
cat > "$STUB_BIN/gtimeout" <<'SH'
|
||||
#!/usr/bin/env bash
|
||||
if [ "$1" = "--kill-after=15" ]; then
|
||||
shift 2
|
||||
fi
|
||||
exec "$@"
|
||||
SH
|
||||
chmod +x "$STUB_BIN/gtimeout"
|
||||
}
|
||||
|
||||
teardown() {
|
||||
rm -rf "$TMP"
|
||||
}
|
||||
|
||||
@test "test-skill keeps smoke codex execution on the documented sandbox-bypass path" {
|
||||
classification='{"name":"demo","skip_reason":null,"default_test_prompt":"run demo smoke"}'
|
||||
|
||||
run env -i HOME="$HOME" PATH="$STUB_BIN:$PATH" CODEX_BIN="codex" CODEX_ARGV_CAPTURE="$CAPTURE" \
|
||||
K_QA_HOME="$TMP/home" K_SKILL_CLONE="$TMP/clone" CODEX_MODEL="smoke-model" CODEX_PROVIDER="smoke-provider" TIMEOUT_SECS="5" \
|
||||
bash -c 'printf "%s" "$0" | "$1" --run-dir "$2"' "$classification" "$QA_BOT_ROOT/bin/test-skill.sh" "$TMP/run"
|
||||
|
||||
[ "$status" -eq 0 ]
|
||||
[ -f "$TMP/run/results/demo.exec.json" ]
|
||||
grep -qx -- 'exec' "$CAPTURE"
|
||||
grep -qx -- '--json' "$CAPTURE"
|
||||
grep -qx -- '--dangerously-bypass-approvals-and-sandbox' "$CAPTURE"
|
||||
grep -qx -- '--skip-git-repo-check' "$CAPTURE"
|
||||
grep -qx -- '--ephemeral' "$CAPTURE"
|
||||
grep -qx -- '-C' "$CAPTURE"
|
||||
grep -qx -- "$TMP/clone" "$CAPTURE"
|
||||
grep -qx -- '-m' "$CAPTURE"
|
||||
grep -qx -- 'smoke-model' "$CAPTURE"
|
||||
grep -qx -- 'model_provider="smoke-provider"' "$CAPTURE"
|
||||
grep -qx -- 'run demo smoke' "$CAPTURE"
|
||||
if grep -qx -- '-s' "$CAPTURE"; then
|
||||
echo "unexpected sandbox flag in smoke argv"
|
||||
return 1
|
||||
fi
|
||||
if grep -qx -- 'read-only' "$CAPTURE"; then
|
||||
echo "unexpected read-only sandbox in smoke argv"
|
||||
return 1
|
||||
fi
|
||||
python3 - "$TMP/run/results/demo.exec.json" <<'PY'
|
||||
import json, sys
|
||||
with open(sys.argv[1], encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
assert data["status"] == "executed", data
|
||||
assert data["codex_model"] == "smoke-model", data
|
||||
assert data["test_prompt"] == "run demo smoke", data
|
||||
PY
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue