Commit graph

4 commits

Author SHA1 Message Date
Jeffrey (Dongkyu) Kim
5b08b4c86e
Merge pull request #263 from NomaDamas/feature/#257
Feature/#257
2026-05-18 21:18:56 +09:00
Jeffrey (Dongkyu) Kim
cf8e96acdc fix(qa-bot): per-skill test_prompt overrides and smarter judge
11 skills that need specific inputs (not just a 'demonstrate' query) now
ship with a hardcoded test_prompt in config/skill-overrides.yml:

  flight-ticket-search           ICN -> NRT, 2026-08-20 one-way
  nts-business-registration      124-81-00998 (Samsung Electronics)
  korean-stock-search            005930 Samsung 5-day quote
  joseon-sillok-search           키워드 훈민정음
  korean-law-search              산업안전보건법 제5조
  library-book-search            코스모스 칼 세이건
  lotto-results                  latest round
  k-schoollunch-menu             서울특별시교육청 초등학교 오늘 식단
  delivery-tracking              CJ dummy invoice (negative case ok)
  ticket-availability            YES24 / 인터파크 sample
  zipcode-search                 서울특별시 강남구 테헤란로 152

These were previously synthesized from the SKILL.md first When-to-use bullet,
which is a one-line teaser without concrete inputs. The agent would then
either ask the user for the missing input (partial-success) or fall back
to a generic demo (often producing a VERDICT: FAIL response). Both got
mis-classified as fail by the judge.

qa_utils.synthesize_test_prompt now honors default_inputs.test_prompt as a
verbatim override (only appending the VERDICT line if the override does not
already include it).

Two additional fixes for negative-case correctness:

1. judge-prompt.md: explicitly tells the judge that the agent's literal
   VERDICT: PASS / VERDICT: FAIL is just a hint, not binding. A skill that
   correctly returns 'no such business number' or 'invoice not found' for
   a deliberately invalid input is PASS, not fail.

2. judge-skill.py: drop the deterministic gate that flipped pass to fail
   when 'VERDICT: PASS' literal was missing from the transcript. That gate
   was producing false fails for negative-case tests where the agent
   correctly responded with VERDICT: FAIL because the skill rejected an
   invalid input. The judge LLM (gpt-5.5) is now trusted to evaluate the
   transcript against the SKILL.md 'Done when' criteria.

Verified live:

- nts-business-registration with valid number  -> pass/success (0.99)
- nts-business-registration with fake number   -> pass/success (0.99)
- flight-ticket-search ICN->NRT 2026-08-20     -> pass/success (0.99)
2026-05-18 15:53:26 +09:00
Jeffrey (Dongkyu) Kim
136d2afce1 fix(qa-bot): upgrade judge to gpt-5.5 and run codex with sandbox bypass
PR #257 follow-up. Two changes:

1. JUDGE_MODEL default: gpt-5.4-mini -> gpt-5.5

   The cheaper judge was misclassifying every wrong-output verdict because
   the offline matcher fell through to the dumb 'VERDICT: FAIL in transcript'
   check. Re-running the same 10 historical fail cases with gpt-5.5 +
   real LLM judge correctly reclassified 7 of them as pass (the codex agent
   actually accomplished the skill goal) and the remaining 3 as
   network-error / partial-success / skip with accurate reasons.

2. Drop -s read-only, add --dangerously-bypass-approvals-and-sandbox

   The read-only codex sandbox was triggering spurious DNS resolution
   failures inside the test runs (host blocked at the syscall level even
   for legitimate proxy / public-API calls). Live re-test with the bypass
   flag and provider pin produced clean transcripts: cheap-gas-nearby,
   daangn-realty-search, han-river-water-level, naver-news-search,
   naver-shopping-search, seoul-density, seoul-subway-arrival all PASS.
   The QA bot is sandboxed externally by launchd anyway.

3. New CODEX_PROVIDER env (default: openai)

   Lets users pin the codex model_provider explicitly so the bot does not
   accidentally route through a private OpenAI-compatible proxy that may
   not have keys registered for all model names.
2026-05-18 14:26:32 +09:00
Jeffrey (Dongkyu) Kim
7f73e55011 feat(qa-bot): add k-skill-qa-bot under tools/
External macOS daemon that clones NomaDamas/k-skill main every 3 days, runs
each skill through codex exec, has an LLM judge grade pass/fail/skip via
codex exec --output-schema, and files dedup'd GitHub issues for true failures.

Layout:
- install.sh copies tools/k-skill-qa-bot/ to ~/.local/share/k-skill-qa-bot/
  and registers a LaunchAgent at ~/Library/LaunchAgents/.
- update-clone.sh has a hard guard: refuses any K_SKILL_CLONE outside
  K_QA_HOME/k-skill-clone unless ALLOW_EXTERNAL_CLONE_TARGET=1.
- Force-skip 10 destructive/login-required skills (ktx-booking, srt-booking,
  catchtable-sniper, kakaotalk-mac, hipass-receipt, toss-securities, etc.)
  so the bot never triggers reservation abuse.
- Deprecated skills (strike-through + 지원 중단 in README) auto-detected
  and skipped, never failed.
- First-run safety: CREATE_ISSUES=false by default.
- mkdir-based concurrency lock with atomic stale reclaim.
- Issue dedup: sha1(skill_name + symptom_class)[:12] body marker.
- Deterministic gates override LLM judge to FAIL on exit_code != 0, missing
  VERDICT line, or near-timeout duration.
2026-05-17 18:24:11 +09:00