mirror of
https://github.com/NomaDamas/k-skill.git
synced 2026-06-24 02:04:11 +00:00
11 skills that need specific inputs (not just a 'demonstrate' query) now ship with a hardcoded test_prompt in config/skill-overrides.yml: flight-ticket-search ICN -> NRT, 2026-08-20 one-way nts-business-registration 124-81-00998 (Samsung Electronics) korean-stock-search 005930 Samsung 5-day quote joseon-sillok-search 키워드 훈민정음 korean-law-search 산업안전보건법 제5조 library-book-search 코스모스 칼 세이건 lotto-results latest round k-schoollunch-menu 서울특별시교육청 초등학교 오늘 식단 delivery-tracking CJ dummy invoice (negative case ok) ticket-availability YES24 / 인터파크 sample zipcode-search 서울특별시 강남구 테헤란로 152 These were previously synthesized from the SKILL.md first When-to-use bullet, which is a one-line teaser without concrete inputs. The agent would then either ask the user for the missing input (partial-success) or fall back to a generic demo (often producing a VERDICT: FAIL response). Both got mis-classified as fail by the judge. qa_utils.synthesize_test_prompt now honors default_inputs.test_prompt as a verbatim override (only appending the VERDICT line if the override does not already include it). Two additional fixes for negative-case correctness: 1. judge-prompt.md: explicitly tells the judge that the agent's literal VERDICT: PASS / VERDICT: FAIL is just a hint, not binding. A skill that correctly returns 'no such business number' or 'invoice not found' for a deliberately invalid input is PASS, not fail. 2. judge-skill.py: drop the deterministic gate that flipped pass to fail when 'VERDICT: PASS' literal was missing from the transcript. That gate was producing false fails for negative-case tests where the agent correctly responded with VERDICT: FAIL because the skill rejected an invalid input. The judge LLM (gpt-5.5) is now trusted to evaluate the transcript against the SKILL.md 'Done when' criteria. Verified live: - nts-business-registration with valid number -> pass/success (0.99) - nts-business-registration with fake number -> pass/success (0.99) - flight-ticket-search ICN->NRT 2026-08-20 -> pass/success (0.99) |
||
|---|---|---|
| .. | ||
| env.sh | ||
| lock.sh | ||
| log.sh | ||
| parse_skill_md.py | ||
| qa_utils.py | ||