Commit graph

5 commits

Author SHA1 Message Date
Jeffrey (Dongkyu) Kim
cc91e55682 korean-slang-writing (#133): harden extractor with numbered-h2 gate + category-nav strip
Implements the three non-blocking observations from PR #161 round-3 review:

1. Numbered-h2 gate (reviewer-flagged fragility):
   Refactored _extract_first_section_between_h2 to extract h2 inner text
   (stripping nested tags) and filter by '^\\s*\\d+(?:\\.\\d+)*\\.\\s+\\S'.
   Sidebar widgets like <h2>관련 문서</h2> or <h2>외부 링크</h2> can no longer
   anchor the extractor - only numbered section headers (1., 1.2., 2.3.4.) do.
   Handles live Namu Wiki structure where the number sits inside an <a> tag
   (<a>1.</a> <span>개요</span>), which the round-3 suggested regex-only gate
   missed. All 29 seed pages continue to produce valid summaries on live
   fetches.

2. Category-nav template strip (reviewer-flagged long-page noise):
   a. CATEGORY_NAV_RE strips the inline '[펼치기 · 접기]' marker plus its
      same-line aftermath (the category list items on the same line).
   b. DETAILS_PELCHIGI_RE strips the entire <details> block whose <summary>
      contains 펼치기. Namu Wiki today wraps category nav in exactly this
      structure, so the strip removes the full noise block (not just the
      marker line).
   꿀잼 summary drops from 3482 chars of category dump to 562 chars
   starting with the real definition '무언가가 매우 재미있다는 의미의 인터넷
   유행어'. Non-category <details> blocks (spoilers, footnotes) are
   preserved.

3. TDD + mutation coverage:
   6 new tests total: 2 numbered-h2 gate tests, 2 inline category-nav tests,
   1 <details>-block strip test, 1 <details>-keep test (negative case).
   All 6 were written first and confirmed RED against the round-2 baseline,
   then made GREEN after the implementation landed. Each fix path was also
   mutation-tested (revert regex, remove .sub line) to confirm the tests
   genuinely catch the target bug class.

Suite grows from 45 to 51 tests. All pass. npm run ci exits 0.
2026-04-22 14:18:42 +09:00
Jeffrey (Dongkyu) Kim
4f31dae11f korean-slang-writing (#133): extract summaries via h2 section anchor + og:description fallback
Namu Wiki's current HTML layout uses build-time-obfuscated CSS class
names (e.g. _36R8DWTn, OZVChh+l) and has no <article>/<main>/<section>
tags, so all six MAIN_CONTENT_CLASSES anchors fail to match and
extract_summary() returned empty with a 'Main content region not
detected' warning on every live page.

Replace the single class-based strategy with a three-tier fallback
chain that pins to progressively weaker but more structurally stable
anchors:

  1. First h2 section boundary. Namu Wiki articles consistently open
     with '<h2>1. 개요[편집]</h2>' and mark subsequent sections with
     numbered h2 headings. Extracting text between the first and
     second h2 reliably captures the overview section on every page
     sampled (중꺾마, 갓생, 럭키비키, 어쩔티비).
  2. MAIN_CONTENT_CLASSES / <article> - kept as a legacy fallback
     for older Namu Wiki layouts and for third-party fixtures.
  3. og:description meta tag - final safety net before returning
     empty, gives the agent at least a ~64-char preview when the
     article has unusual structure.

Strip '[편집]' edit-affordance markers and numbered section prefixes
(e.g. '1.2.') from the extracted text so headings don't leak through
as noise.

Live verification (text format):
  slang_lookup.py 중꺾마   -> Title + 286-char summary
  slang_lookup.py 갓생     -> Title + 96-char summary
  slang_lookup.py 럭키비키 -> Title + 59-char summary
  slang_lookup.py 어쩔티비 -> Title + 20-char summary

All previously-empty. Not-found / blocked / upstream-error paths and
exit codes are unchanged.
2026-04-22 13:44:13 +09:00
Jeffrey (Dongkyu) Kim
541967e96c korean-slang-writing (#133): fix broken seed namuwiki URLs + add encoding invariant test
Reviewer flagged 4/30 seed namuwiki_url values returning HTTP 404 on live
Namu Wiki. These URLs are part of the documented response contract and get
surfaced directly to agents, so broken links are a functional bug, not a
cosmetic one.

Root causes per entry:
- 중꺾마: wrong 꺾 codepoint (U+AFFA 꿺 instead of U+AEBE 꺾).
- 아아: typo in aliased title (아이스 아메리칸노 instead of 아메리카노).
- 어쩔티비: missing 받침 (어쩌티비 instead of 어쩔티비).
- 당모치: encoding correct but no live Namu Wiki article exists; dropped.

Also fixes two separately-broken 중꺾마 example URLs in SKILL.md
(U+AFBE 꾾 instead of U+AEBE 꺾) — these were discovered while auditing
the seed and would have surfaced as 404 to agents following the example
snippets.

Adds two regression tests:
- test_each_seed_url_decodes_to_term_or_alias: decodes every seed URL's
  path segment and asserts it equals the term or one of its aliases.
  Catches Hangul-codepoint typos offline (no network dependency) and
  would have caught all 3 encoding bugs in this PR.
- test_no_seed_entry_points_at_known_missing_namuwiki_page: locks the
  당모치 drop so nobody re-adds an entry pointing at a page that does
  not exist on Namu Wiki.

Fixes the existing LookupNetworkTest assertion that was hard-coding the
broken URL — it now derives the expected URL via build_namuwiki_url()
so the test cannot drift out of sync with the helper again.

Verification:
- PYTHONPATH=.:scripts python3 -m unittest scripts.test_korean_slang_writing -> 40/40 pass
- Live GET with browser headers against all 29 remaining seed URLs -> 29/29 return 200
- npm run ci -> exit 0
- Manual QA: slang_search on 중꺾마, 어쩔티비, 아이스 아메리카노 returns
  correct URLs; slang_lookup live-fetches 중꺾마 and extracts the
  canonical title '중요한 것은 꺾이지 않는 마음'.
2026-04-22 13:23:11 +09:00
Jeffrey (Dongkyu) Kim
2785bc3c17 korean-slang-writing (#133): fix module-loader sys.modules registration 2026-04-22 12:48:21 +09:00
Jeffrey (Dongkyu) Kim
3080b535ec WIP korean-slang-writing (#133): add test suite 2026-04-22 12:48:21 +09:00