Implements the three non-blocking observations from PR #161 round-3 review:
1. Numbered-h2 gate (reviewer-flagged fragility):
Refactored _extract_first_section_between_h2 to extract h2 inner text
(stripping nested tags) and filter by '^\\s*\\d+(?:\\.\\d+)*\\.\\s+\\S'.
Sidebar widgets like <h2>관련 문서</h2> or <h2>외부 링크</h2> can no longer
anchor the extractor - only numbered section headers (1., 1.2., 2.3.4.) do.
Handles live Namu Wiki structure where the number sits inside an <a> tag
(<a>1.</a> <span>개요</span>), which the round-3 suggested regex-only gate
missed. All 29 seed pages continue to produce valid summaries on live
fetches.
2. Category-nav template strip (reviewer-flagged long-page noise):
a. CATEGORY_NAV_RE strips the inline '[펼치기 · 접기]' marker plus its
same-line aftermath (the category list items on the same line).
b. DETAILS_PELCHIGI_RE strips the entire <details> block whose <summary>
contains 펼치기. Namu Wiki today wraps category nav in exactly this
structure, so the strip removes the full noise block (not just the
marker line).
꿀잼 summary drops from 3482 chars of category dump to 562 chars
starting with the real definition '무언가가 매우 재미있다는 의미의 인터넷
유행어'. Non-category <details> blocks (spoilers, footnotes) are
preserved.
3. TDD + mutation coverage:
6 new tests total: 2 numbered-h2 gate tests, 2 inline category-nav tests,
1 <details>-block strip test, 1 <details>-keep test (negative case).
All 6 were written first and confirmed RED against the round-2 baseline,
then made GREEN after the implementation landed. Each fix path was also
mutation-tested (revert regex, remove .sub line) to confirm the tests
genuinely catch the target bug class.
Suite grows from 45 to 51 tests. All pass. npm run ci exits 0.
Namu Wiki's current HTML layout uses build-time-obfuscated CSS class
names (e.g. _36R8DWTn, OZVChh+l) and has no <article>/<main>/<section>
tags, so all six MAIN_CONTENT_CLASSES anchors fail to match and
extract_summary() returned empty with a 'Main content region not
detected' warning on every live page.
Replace the single class-based strategy with a three-tier fallback
chain that pins to progressively weaker but more structurally stable
anchors:
1. First h2 section boundary. Namu Wiki articles consistently open
with '<h2>1. 개요[편집]</h2>' and mark subsequent sections with
numbered h2 headings. Extracting text between the first and
second h2 reliably captures the overview section on every page
sampled (중꺾마, 갓생, 럭키비키, 어쩔티비).
2. MAIN_CONTENT_CLASSES / <article> - kept as a legacy fallback
for older Namu Wiki layouts and for third-party fixtures.
3. og:description meta tag - final safety net before returning
empty, gives the agent at least a ~64-char preview when the
article has unusual structure.
Strip '[편집]' edit-affordance markers and numbered section prefixes
(e.g. '1.2.') from the extracted text so headings don't leak through
as noise.
Live verification (text format):
slang_lookup.py 중꺾마 -> Title + 286-char summary
slang_lookup.py 갓생 -> Title + 96-char summary
slang_lookup.py 럭키비키 -> Title + 59-char summary
slang_lookup.py 어쩔티비 -> Title + 20-char summary
All previously-empty. Not-found / blocked / upstream-error paths and
exit codes are unchanged.
Reviewer flagged 4/30 seed namuwiki_url values returning HTTP 404 on live
Namu Wiki. These URLs are part of the documented response contract and get
surfaced directly to agents, so broken links are a functional bug, not a
cosmetic one.
Root causes per entry:
- 중꺾마: wrong 꺾 codepoint (U+AFFA 꿺 instead of U+AEBE 꺾).
- 아아: typo in aliased title (아이스 아메리칸노 instead of 아메리카노).
- 어쩔티비: missing 받침 (어쩌티비 instead of 어쩔티비).
- 당모치: encoding correct but no live Namu Wiki article exists; dropped.
Also fixes two separately-broken 중꺾마 example URLs in SKILL.md
(U+AFBE 꾾 instead of U+AEBE 꺾) — these were discovered while auditing
the seed and would have surfaced as 404 to agents following the example
snippets.
Adds two regression tests:
- test_each_seed_url_decodes_to_term_or_alias: decodes every seed URL's
path segment and asserts it equals the term or one of its aliases.
Catches Hangul-codepoint typos offline (no network dependency) and
would have caught all 3 encoding bugs in this PR.
- test_no_seed_entry_points_at_known_missing_namuwiki_page: locks the
당모치 drop so nobody re-adds an entry pointing at a page that does
not exist on Namu Wiki.
Fixes the existing LookupNetworkTest assertion that was hard-coding the
broken URL — it now derives the expected URL via build_namuwiki_url()
so the test cannot drift out of sync with the helper again.
Verification:
- PYTHONPATH=.:scripts python3 -m unittest scripts.test_korean_slang_writing -> 40/40 pass
- Live GET with browser headers against all 29 remaining seed URLs -> 29/29 return 200
- npm run ci -> exit 0
- Manual QA: slang_search on 중꺾마, 어쩔티비, 아이스 아메리카노 returns
correct URLs; slang_lookup live-fetches 중꺾마 and extracts the
canonical title '중요한 것은 꺾이지 않는 마음'.