k-skill

mirror of https://github.com/NomaDamas/k-skill.git synced 2026-06-24 02:04:11 +00:00

History

Jeffrey (Dongkyu) Kim cf8e96acdc fix(qa-bot): per-skill test_prompt overrides and smarter judge 11 skills that need specific inputs (not just a 'demonstrate' query) now ship with a hardcoded test_prompt in config/skill-overrides.yml: flight-ticket-search ICN -> NRT, 2026-08-20 one-way nts-business-registration 124-81-00998 (Samsung Electronics) korean-stock-search 005930 Samsung 5-day quote joseon-sillok-search 키워드 훈민정음 korean-law-search 산업안전보건법 제5조 library-book-search 코스모스 칼 세이건 lotto-results latest round k-schoollunch-menu 서울특별시교육청 초등학교 오늘 식단 delivery-tracking CJ dummy invoice (negative case ok) ticket-availability YES24 / 인터파크 sample zipcode-search 서울특별시 강남구 테헤란로 152 These were previously synthesized from the SKILL.md first When-to-use bullet, which is a one-line teaser without concrete inputs. The agent would then either ask the user for the missing input (partial-success) or fall back to a generic demo (often producing a VERDICT: FAIL response). Both got mis-classified as fail by the judge. qa_utils.synthesize_test_prompt now honors default_inputs.test_prompt as a verbatim override (only appending the VERDICT line if the override does not already include it). Two additional fixes for negative-case correctness: 1. judge-prompt.md: explicitly tells the judge that the agent's literal VERDICT: PASS / VERDICT: FAIL is just a hint, not binding. A skill that correctly returns 'no such business number' or 'invoice not found' for a deliberately invalid input is PASS, not fail. 2. judge-skill.py: drop the deterministic gate that flipped pass to fail when 'VERDICT: PASS' literal was missing from the transcript. That gate was producing false fails for negative-case tests where the agent correctly responded with VERDICT: FAIL because the skill rejected an invalid input. The judge LLM (gpt-5.5) is now trusted to evaluate the transcript against the SKILL.md 'Done when' criteria. Verified live: - nts-business-registration with valid number -> pass/success (0.99) - nts-business-registration with fake number -> pass/success (0.99) - flight-ticket-search ICN->NRT 2026-08-20 -> pass/success (0.99)		2026-05-18 15:53:26 +09:00
..
env.sh	fix(qa-bot): upgrade judge to gpt-5.5 and run codex with sandbox bypass	2026-05-18 14:26:32 +09:00
lock.sh	feat(qa-bot): add k-skill-qa-bot under tools/	2026-05-17 18:24:11 +09:00
log.sh	feat(qa-bot): add k-skill-qa-bot under tools/	2026-05-17 18:24:11 +09:00
parse_skill_md.py	feat(qa-bot): add k-skill-qa-bot under tools/	2026-05-17 18:24:11 +09:00
qa_utils.py	fix(qa-bot): per-skill test_prompt overrides and smarter judge	2026-05-18 15:53:26 +09:00