k-skill/korean-slang-writing
Jeffrey (Dongkyu) Kim cc91e55682 korean-slang-writing (#133): harden extractor with numbered-h2 gate + category-nav strip
Implements the three non-blocking observations from PR #161 round-3 review:

1. Numbered-h2 gate (reviewer-flagged fragility):
   Refactored _extract_first_section_between_h2 to extract h2 inner text
   (stripping nested tags) and filter by '^\\s*\\d+(?:\\.\\d+)*\\.\\s+\\S'.
   Sidebar widgets like <h2>관련 문서</h2> or <h2>외부 링크</h2> can no longer
   anchor the extractor - only numbered section headers (1., 1.2., 2.3.4.) do.
   Handles live Namu Wiki structure where the number sits inside an <a> tag
   (<a>1.</a> <span>개요</span>), which the round-3 suggested regex-only gate
   missed. All 29 seed pages continue to produce valid summaries on live
   fetches.

2. Category-nav template strip (reviewer-flagged long-page noise):
   a. CATEGORY_NAV_RE strips the inline '[펼치기 · 접기]' marker plus its
      same-line aftermath (the category list items on the same line).
   b. DETAILS_PELCHIGI_RE strips the entire <details> block whose <summary>
      contains 펼치기. Namu Wiki today wraps category nav in exactly this
      structure, so the strip removes the full noise block (not just the
      marker line).
   꿀잼 summary drops from 3482 chars of category dump to 562 chars
   starting with the real definition '무언가가 매우 재미있다는 의미의 인터넷
   유행어'. Non-category <details> blocks (spoilers, footnotes) are
   preserved.

3. TDD + mutation coverage:
   6 new tests total: 2 numbered-h2 gate tests, 2 inline category-nav tests,
   1 <details>-block strip test, 1 <details>-keep test (negative case).
   All 6 were written first and confirmed RED against the round-2 baseline,
   then made GREEN after the implementation landed. Each fix path was also
   mutation-tested (revert regex, remove .sub line) to confirm the tests
   genuinely catch the target bug class.

Suite grows from 45 to 51 tests. All pass. npm run ci exits 0.
2026-04-22 14:18:42 +09:00
..
data korean-slang-writing (#133): fix broken seed namuwiki URLs + add encoding invariant test 2026-04-22 13:23:11 +09:00
scripts korean-slang-writing (#133): harden extractor with numbered-h2 gate + category-nav strip 2026-04-22 14:18:42 +09:00
SKILL.md korean-slang-writing (#133): fix broken seed namuwiki URLs + add encoding invariant test 2026-04-22 13:23:11 +09:00