k-skill/packages/k-skill-rhwp/README.md
Jeffrey (Dongkyu) Kim c563ef535b rhwp-edit (#155): guard replace-all case-insensitive path against UTF-16 length-drift
Round 2 review flagged a latent Unicode safety bug: when replaceAll's
caseSensitive=false branch encounters characters whose toLowerCase()
changes UTF-16 length (e.g. Turkish İ U+0130 → i + U+0307 combining dot
above), offsets taken in the lowercased haystack drift by the expansion
delta for every subsequent match and silently corrupt the document.
Reviewer repro: 'ABCİABCİXYZ' + case-insensitive İ→Z reported
{ok:true,count:2} but rendered 'ABCZABCİZYZ' instead of 'ABCZABCZXYZ'
(the X at index 8 was corrupted while the second İ survived).

Surface a descriptive error rather than silently drift:
- findAllMatchOffsets: in the case-insensitive branch, verify that the
  paragraph text and the query each preserve UTF-16 length under
  toLowerCase; otherwise throw with an actionable message pointing the
  user to --case-sensitive or input normalization.
- This is strictly a safety guard: the 2025→2026 headline workflow,
  ASCII, Hangul, and every existing test are unaffected.

Tests (TDD red → green, net +4 in packages/k-skill-rhwp):
- 'replaceAll refuses case-insensitive matching when source text
  contains case-folding length-changing chars (e.g. Turkish İ U+0130)'
  reproduces the exact reviewer input and asserts rejection + no output
  file
- 'replaceAll refuses case-insensitive matching when the query itself
  contains case-folding length-changing chars' covers the query-side path
- 'replaceAll with --case-sensitive succeeds on inputs containing İ'
  confirms the guard only fires in the case-insensitive path and that
  case-sensitive produces ABCZABCZXYZ with no X corruption
- 'replaceAll case-insensitive still works for normal ASCII/Hangul'
  regression-guards against the fix over-rejecting the common case

Doc disclosure in all 4 surfaces called out by the reviewer:
- rhwp-edit/SKILL.md: new failure-mode bullet naming U+0130 specifically
- docs/features/rhwp-edit.md: Unicode 대소문자 무시 주의 paragraph
  under scenario 3 (replace-all)
- packages/k-skill-rhwp/README.md: extended Scope section
- packages/k-skill-rhwp/src/cli.js: USAGE 'Scope note' appended
- scripts/skill-docs.test.js: 2 new assertions locking the SKILL.md and
  feature-doc disclosure so they can't be silently removed
- .changeset: note the guard in the pending v0.1.0 release notes

Manual QA (end-to-end via the published CLI):
  $ k-skill-rhwp replace-all … --query İ --replacement Z
  → exit 1 + 'case-insensitive matching is unsafe because case folding
    changes the UTF-16 length …'
  → no output file written
  $ k-skill-rhwp replace-all … --query İ --replacement Z --case-sensitive
  → {ok:true,count:2}, render shows 'ABCZABCZXYZ', search İ ⇒ found:false
  $ replace-all '2025'→'2026' on '2025 2025 2025' ⇒ {ok:true,count:3}
  $ replace-all 'hello'→'hi' (case-insens.) on 'hello WORLD 안녕 HELLO'
    ⇒ {ok:true,count:2}

Verification:
- npm test --workspace k-skill-rhwp: 35 pass / 0 fail (+4 vs Round 2)
- node --test scripts/skill-docs.test.js: 114 pass / 0 fail
- npm run ci: exit 0 (lint + typecheck + all workspace tests +
  pack:dry-run + validate-skills.sh all green)

Refs PR #162 Round 2 review 'Non-blocking residual risk — Unicode
case-insensitive offset drift'.
2026-04-22 15:23:23 +09:00

4.8 KiB

k-skill-rhwp

Node-side HWP editing CLI that wraps @rhwp/core (Rust + WebAssembly, MIT, by Edward Kim) as subcommands.

  • Ships the k-skill-rhwp binary for the rhwp-edit skill in NomaDamas/k-skill.
  • Round-trip safe HWP 5.x editing — insert/delete text, replace-all, create tables, set cell text, and render pages to SVG or HTML.
  • Node 18+ only. No Rust toolchain required; the shipped WASM does the work.

For debugging the upstream rhwp Rust CLI (export-svg --debug-overlay, dump, ir-diff, thumbnail, convert), see the rhwp-advanced skill — this package does not wrap those commands.

For .hwp → Markdown / JSON / form-field extraction, see the hwp skill (kordoc-based). This package is editing-only.

Install

npm install k-skill-rhwp
# or run one-off
npx --yes k-skill-rhwp --help

CLI

# Metadata / structure
k-skill-rhwp info <input.hwp>
k-skill-rhwp list-paragraphs <input.hwp> [--section N]
k-skill-rhwp search <input.hwp> --query TEXT [--from-section N] [--from-paragraph N] [--from-char N] [--case-sensitive]

# Body editing
k-skill-rhwp insert-text <input> <output> --section N --paragraph N --offset N --text TEXT
k-skill-rhwp delete-text <input> <output> --section N --paragraph N --offset N --count N
k-skill-rhwp replace-all <input> <output> --query TEXT --replacement TEXT [--case-sensitive]

# Tables
k-skill-rhwp create-table <input> <output> --section N --paragraph N --offset N --rows N --cols N
k-skill-rhwp set-cell-text <input> <output> --section N --parent-paragraph N --control N --cell N --text TEXT [--cell-paragraph N] [--no-replace]

# Rendering / creation
k-skill-rhwp create-blank <output.hwp>
k-skill-rhwp render <input.hwp> [--page N] [--format svg|html]

Every editing subcommand writes a brand-new HWP file (never overwrites the input) and prints a JSON summary including ok, post-edit cursor position, bytesWritten, and the resolved outputPath.

Scope of search and replace-all

Both search and replace-all operate on body paragraphs only. Text inside table cells, headers/footers, or footnotes is not scanned. This mirrors the upstream @rhwp/core searchText scope. For cell text, use info or list-paragraphs to locate the table and then set-cell-text to write. replace-all also rejects any --replacement that contains newline or paragraph-break characters (\n, \r, U+2028, U+2029) because they would split a paragraph — split those into multiple insert-text calls instead. replace-all uses non-overlapping replacement semantics: matches are computed against the original text before any replacement runs, so --query a --replacement aa against aaa replaces 3 originals and yields aaaaaa, not an infinite loop.

Case-insensitive matching (the default) relies on String.prototype.toLowerCase() preserving UTF-16 length so offsets taken in the lowercased haystack still apply to the original text. A handful of Unicode characters (notably Turkish İ U+0130, which lowercases to i + combining dot above U+0307) violate that invariant. When either the query or a paragraph contains such a character, replace-all refuses the operation with exit code 1 and a case-insensitive matching is unsafe because case folding changes the UTF-16 length message rather than silently drifting every subsequent offset. Rerun with --case-sensitive, or normalize the input. ASCII, Hangul, and the common HWP use cases (e.g. 2025 → 2026) are not affected.

Node API

const { insertText, createTable, setCellText, getDocumentInfo } = require("k-skill-rhwp");

await insertText({
  input: "./draft.hwp",
  output: "./draft-with-title.hwp",
  section: 0,
  paragraph: 0,
  offset: 0,
  text: "2026년 신청서"
});

console.log(await getDocumentInfo("./draft-with-title.hwp"));

The first call loads @rhwp/core WASM once per process. The WASM requires a globalThis.measureTextWidth(font, text) callback for text layout; this package auto-installs a deterministic approximation shim on first use so it works headless on Node without canvas. Replace the shim before the first call if you need pixel-accurate metrics.

Known limitations

  • HWPX round-trip is disabled upstream (rhwp #196). HWPX input is accepted, but output is always written as HWP 5.x binary.
  • rhwp v0.7.x is beta. Complex tables, images, charts, or form fields may occasionally lose fidelity on round-trip; verify with info and visual render after non-trivial edits.
  • Windows security modules, Hancom GUI automation, read-only distribution documents beyond rhwp convert are out of scope.

Upstream references

License

MIT