11 skills that need specific inputs (not just a 'demonstrate' query) now
ship with a hardcoded test_prompt in config/skill-overrides.yml:
flight-ticket-search ICN -> NRT, 2026-08-20 one-way
nts-business-registration 124-81-00998 (Samsung Electronics)
korean-stock-search 005930 Samsung 5-day quote
joseon-sillok-search 키워드 훈민정음
korean-law-search 산업안전보건법 제5조
library-book-search 코스모스 칼 세이건
lotto-results latest round
k-schoollunch-menu 서울특별시교육청 초등학교 오늘 식단
delivery-tracking CJ dummy invoice (negative case ok)
ticket-availability YES24 / 인터파크 sample
zipcode-search 서울특별시 강남구 테헤란로 152
These were previously synthesized from the SKILL.md first When-to-use bullet,
which is a one-line teaser without concrete inputs. The agent would then
either ask the user for the missing input (partial-success) or fall back
to a generic demo (often producing a VERDICT: FAIL response). Both got
mis-classified as fail by the judge.
qa_utils.synthesize_test_prompt now honors default_inputs.test_prompt as a
verbatim override (only appending the VERDICT line if the override does not
already include it).
Two additional fixes for negative-case correctness:
1. judge-prompt.md: explicitly tells the judge that the agent's literal
VERDICT: PASS / VERDICT: FAIL is just a hint, not binding. A skill that
correctly returns 'no such business number' or 'invoice not found' for
a deliberately invalid input is PASS, not fail.
2. judge-skill.py: drop the deterministic gate that flipped pass to fail
when 'VERDICT: PASS' literal was missing from the transcript. That gate
was producing false fails for negative-case tests where the agent
correctly responded with VERDICT: FAIL because the skill rejected an
invalid input. The judge LLM (gpt-5.5) is now trusted to evaluate the
transcript against the SKILL.md 'Done when' criteria.
Verified live:
- nts-business-registration with valid number -> pass/success (0.99)
- nts-business-registration with fake number -> pass/success (0.99)
- flight-ticket-search ICN->NRT 2026-08-20 -> pass/success (0.99)
PR #257 follow-up. Two changes:
1. JUDGE_MODEL default: gpt-5.4-mini -> gpt-5.5
The cheaper judge was misclassifying every wrong-output verdict because
the offline matcher fell through to the dumb 'VERDICT: FAIL in transcript'
check. Re-running the same 10 historical fail cases with gpt-5.5 +
real LLM judge correctly reclassified 7 of them as pass (the codex agent
actually accomplished the skill goal) and the remaining 3 as
network-error / partial-success / skip with accurate reasons.
2. Drop -s read-only, add --dangerously-bypass-approvals-and-sandbox
The read-only codex sandbox was triggering spurious DNS resolution
failures inside the test runs (host blocked at the syscall level even
for legitimate proxy / public-API calls). Live re-test with the bypass
flag and provider pin produced clean transcripts: cheap-gas-nearby,
daangn-realty-search, han-river-water-level, naver-news-search,
naver-shopping-search, seoul-density, seoul-subway-arrival all PASS.
The QA bot is sandboxed externally by launchd anyway.
3. New CODEX_PROVIDER env (default: openai)
Lets users pin the codex model_provider explicitly so the bot does not
accidentally route through a private OpenAI-compatible proxy that may
not have keys registered for all model names.
External macOS daemon that clones NomaDamas/k-skill main every 3 days, runs
each skill through codex exec, has an LLM judge grade pass/fail/skip via
codex exec --output-schema, and files dedup'd GitHub issues for true failures.
Layout:
- install.sh copies tools/k-skill-qa-bot/ to ~/.local/share/k-skill-qa-bot/
and registers a LaunchAgent at ~/Library/LaunchAgents/.
- update-clone.sh has a hard guard: refuses any K_SKILL_CLONE outside
K_QA_HOME/k-skill-clone unless ALLOW_EXTERNAL_CLONE_TARGET=1.
- Force-skip 10 destructive/login-required skills (ktx-booking, srt-booking,
catchtable-sniper, kakaotalk-mac, hipass-receipt, toss-securities, etc.)
so the bot never triggers reservation abuse.
- Deprecated skills (strike-through + 지원 중단 in README) auto-detected
and skipped, never failed.
- First-run safety: CREATE_ISSUES=false by default.
- mkdir-based concurrency lock with atomic stale reclaim.
- Issue dedup: sha1(skill_name + symptom_class)[:12] body marker.
- Deterministic gates override LLM judge to FAIL on exit_code != 0, missing
VERDICT line, or near-timeout duration.