mirrors/k-skill

mirror of https://github.com/NomaDamas/k-skill.git synced 2026-06-24 02:04:11 +00:00

Author	SHA1	Message	Date
Jeffrey (Dongkyu) Kim	5b08b4c86e	Merge pull request #263 from NomaDamas/feature/#257 Feature/#257	2026-05-18 21:18:56 +09:00
Jeffrey (Dongkyu) Kim	cf8e96acdc	fix(qa-bot): per-skill test_prompt overrides and smarter judge 11 skills that need specific inputs (not just a 'demonstrate' query) now ship with a hardcoded test_prompt in config/skill-overrides.yml: flight-ticket-search ICN -> NRT, 2026-08-20 one-way nts-business-registration 124-81-00998 (Samsung Electronics) korean-stock-search 005930 Samsung 5-day quote joseon-sillok-search 키워드 훈민정음 korean-law-search 산업안전보건법 제5조 library-book-search 코스모스 칼 세이건 lotto-results latest round k-schoollunch-menu 서울특별시교육청 초등학교 오늘 식단 delivery-tracking CJ dummy invoice (negative case ok) ticket-availability YES24 / 인터파크 sample zipcode-search 서울특별시 강남구 테헤란로 152 These were previously synthesized from the SKILL.md first When-to-use bullet, which is a one-line teaser without concrete inputs. The agent would then either ask the user for the missing input (partial-success) or fall back to a generic demo (often producing a VERDICT: FAIL response). Both got mis-classified as fail by the judge. qa_utils.synthesize_test_prompt now honors default_inputs.test_prompt as a verbatim override (only appending the VERDICT line if the override does not already include it). Two additional fixes for negative-case correctness: 1. judge-prompt.md: explicitly tells the judge that the agent's literal VERDICT: PASS / VERDICT: FAIL is just a hint, not binding. A skill that correctly returns 'no such business number' or 'invoice not found' for a deliberately invalid input is PASS, not fail. 2. judge-skill.py: drop the deterministic gate that flipped pass to fail when 'VERDICT: PASS' literal was missing from the transcript. That gate was producing false fails for negative-case tests where the agent correctly responded with VERDICT: FAIL because the skill rejected an invalid input. The judge LLM (gpt-5.5) is now trusted to evaluate the transcript against the SKILL.md 'Done when' criteria. Verified live: - nts-business-registration with valid number -> pass/success (0.99) - nts-business-registration with fake number -> pass/success (0.99) - flight-ticket-search ICN->NRT 2026-08-20 -> pass/success (0.99)	2026-05-18 15:53:26 +09:00
Jeffrey (Dongkyu) Kim	136d2afce1	fix(qa-bot): upgrade judge to gpt-5.5 and run codex with sandbox bypass PR #257 follow-up. Two changes: 1. JUDGE_MODEL default: gpt-5.4-mini -> gpt-5.5 The cheaper judge was misclassifying every wrong-output verdict because the offline matcher fell through to the dumb 'VERDICT: FAIL in transcript' check. Re-running the same 10 historical fail cases with gpt-5.5 + real LLM judge correctly reclassified 7 of them as pass (the codex agent actually accomplished the skill goal) and the remaining 3 as network-error / partial-success / skip with accurate reasons. 2. Drop -s read-only, add --dangerously-bypass-approvals-and-sandbox The read-only codex sandbox was triggering spurious DNS resolution failures inside the test runs (host blocked at the syscall level even for legitimate proxy / public-API calls). Live re-test with the bypass flag and provider pin produced clean transcripts: cheap-gas-nearby, daangn-realty-search, han-river-water-level, naver-news-search, naver-shopping-search, seoul-density, seoul-subway-arrival all PASS. The QA bot is sandboxed externally by launchd anyway. 3. New CODEX_PROVIDER env (default: openai) Lets users pin the codex model_provider explicitly so the bot does not accidentally route through a private OpenAI-compatible proxy that may not have keys registered for all model names.	2026-05-18 14:26:32 +09:00
Jeffrey (Dongkyu) Kim	7f73e55011	feat(qa-bot): add k-skill-qa-bot under tools/ External macOS daemon that clones NomaDamas/k-skill main every 3 days, runs each skill through codex exec, has an LLM judge grade pass/fail/skip via codex exec --output-schema, and files dedup'd GitHub issues for true failures. Layout: - install.sh copies tools/k-skill-qa-bot/ to ~/.local/share/k-skill-qa-bot/ and registers a LaunchAgent at ~/Library/LaunchAgents/. - update-clone.sh has a hard guard: refuses any K_SKILL_CLONE outside K_QA_HOME/k-skill-clone unless ALLOW_EXTERNAL_CLONE_TARGET=1. - Force-skip 10 destructive/login-required skills (ktx-booking, srt-booking, catchtable-sniper, kakaotalk-mac, hipass-receipt, toss-securities, etc.) so the bot never triggers reservation abuse. - Deprecated skills (strike-through + 지원 중단 in README) auto-detected and skipped, never failed. - First-run safety: CREATE_ISSUES=false by default. - mkdir-based concurrency lock with atomic stale reclaim. - Issue dedup: sha1(skill_name + symptom_class)[:12] body marker. - Deterministic gates override LLM judge to FAIL on exit_code != 0, missing VERDICT line, or near-timeout duration.	2026-05-17 18:24:11 +09:00

4 commits