mirror of
https://github.com/epoko77-ai/im-not-ai.git
synced 2026-06-21 13:18:09 +00:00
본진 분류 체계에 한국 번역학계 8대 번역투 유형(이근희·김정우·김도훈·곽은주·
김순영·박옥수·김혜영·이영옥)을 흡수하고, Toral 2019 post-editese 3축
(simplification·normalisation·interference)을 14개 신규 metric 트랙으로
도입. monolith·5인 정의는 무수정, 도구 호출 3회 캡(v1.6.1) 보존.
신규 패턴 4건 (본진 등재):
- A-16 영어 대명사 직역 [S1, 김도훈 2009 + Cho et al. 2019 ACL GeBNLP]
- A-18 관계대명사절 직역 — 좌향 수식 [S2, 박옥수 2018]
- A-19 이중 조사 결합 [S2, 김정우 2007, 단순 ~의 명시 제외]
- E-7 청자 경어법 일관성 손실 [S2 estimated, 김혜영 2019, dialogue 가드]
본진 hold 1건 (v2.0b 외부 회차 결과):
- A-17 무정물·추상명사 '-들' 부착 — v1.6 5편 + 외부 위키 6편 양성 0건.
학술 anchor·metric·scholarship.md §4 보존, 본진 등재만 보류. NMT 원본
회차 후 v2.1에서 동일 ID로 부활. patternID 안정성 보존.
본진 보강 4건 (본문 무수정 + 처방 추가):
- A-15 추상 주어 — 사역·인지·발화 동사 3축 처방
- A-7 가지고 있다 — light verb construction(have/make/take/give) 일반화
- F-4 과잉 접두·접미 — 영어 명사화 -tion/-ment/-ness/-ity 통합
- E-2 동일 종결어미 — 진행형 '~고 있다' 자동 매핑 처방
post-editese metric-only 트랙:
- lexical_diversity_ttr·lexical_density·ending_diversity (simplification)
- normalisation_score·da_streak_rate (normalisation)
- inanimate_subject_rate·by/double_passive·pronoun_density·deul_overuse_rate
·relative_clause_nesting·have_make_literal·double_particle·progressive_aspect
+ interference_index 합성 (interference, T1~T8)
- 14건 모두 본진 패턴 ID 미부여 (caveat C3: 한국어 정량 검증 부재).
metrics_v2.py로 분리, baseline_v2.json 70셀 placeholder 상태.
회귀 검증:
- v1.6 5편 input·final 점수 산출 (재윤문 없음). 회귀 0건.
- lexical_diversity 5편 전수 상승 (post-editese 단순화 가설 1차 반증).
- 외부 회차 위키 6편 — A-16 양성 50%·A-18 양성 67%, interference_index
외부 평균 0.251 vs v1.6 0.05~0.10 (Toral 가설 1차 부합).
학술 인용 양면 보존:
- SSOT 메타필드 한 줄 (ai-tell-taxonomy.md) + 외부 SSOT 전문
(scholarship.md, 학자 29명·Caveat 6건 verbatim).
- 룰북 슬림성 보존: quick-rules.md 126→133줄 (≤180 한도).
4대 철칙 준수:
- monolith·5인 정의 무수정 (humanize-monolith·detector·rewriter·auditor·
reviewer git diff 0줄).
- 재윤문 없는 회귀 (사용자 토큰 통제 원칙).
- 양면 인용 보존.
- patternID 참조 안정성 (A-1~A-15·E-1~E-6 본문 무수정).
상세 PR 본문: _workspace/v2.0-2026-05-07/07_pr/07_pr_draft.md
외부 회차 보고: _workspace/v2.0-2026-05-07/05_regression/v2_external_samples/H1_revisited.md
255 lines
9.2 KiB
Python
255 lines
9.2 KiB
Python
#!/usr/bin/env python3
|
|
"""Humanize KR v1.6 — monolith input shim.
|
|
|
|
Pre-processes user input by computing v1.6 quantitative metrics and
|
|
prepending the result to the text the monolith agent reads. The monolith
|
|
keeps its 4-tool-call cap (Read input + Read rules + Write final + Write
|
|
summary) because the metrics block is folded into the same input file.
|
|
|
|
Inputs:
|
|
--run-dir DIR existing run directory containing 01_input.txt
|
|
--text STR ad-hoc text; if --run-dir is omitted, a new run dir
|
|
`_workspace/<YYYY-MM-DD>-NNN/` is created and 01_input.txt
|
|
written.
|
|
--genre STR essay|column|report|blog|abstract|... (default: essay)
|
|
|
|
Outputs (in {run_dir}):
|
|
00_metrics.json — full compute_all() output (or error stub)
|
|
01_input.txt — original text (created if --text used)
|
|
01_input_with_metrics.txt — combined file the monolith Reads
|
|
00_metrics.error — only on graceful-degrade fallback
|
|
|
|
Hard rules:
|
|
- stdlib only (argparse/json/os/sys/datetime/pathlib/traceback)
|
|
- never modify the original text body inside the combined file
|
|
- on metrics failure, write the combined file *without* the score block
|
|
so the monolith degrades to v1.5 behaviour automatically.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import json
|
|
import os
|
|
import sys
|
|
import traceback
|
|
from datetime import date
|
|
from pathlib import Path
|
|
|
|
# Resolve project layout. This file lives at:
|
|
# {project_root}/scripts/prepare_monolith_input.py
|
|
# metrics.py is at:
|
|
# {project_root}/.claude/skills/humanize-korean/references/metrics.py
|
|
HERE = Path(__file__).resolve().parent
|
|
PROJECT_ROOT = HERE.parent
|
|
METRICS_DIR = PROJECT_ROOT / ".claude" / "skills" / "humanize-korean" / "references"
|
|
|
|
# Make metrics.py importable without polluting global state.
|
|
sys.path.insert(0, str(METRICS_DIR))
|
|
# v2.0 우선 import — compute_all 별칭으로 v1.6 호환. metrics_v2 부재·로드 실패 시
|
|
# v1.6 metrics fallback. graceful degrade로 monolith 동작은 항상 보장.
|
|
try:
|
|
import metrics_v2 as _metrics_mod # type: ignore # v2.0 (post-editese 14 metric)
|
|
except Exception: # pragma: no cover
|
|
try:
|
|
import metrics as _metrics_mod # type: ignore # v1.6 fallback
|
|
except Exception:
|
|
_metrics_mod = None
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Run directory discovery / creation
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def _next_run_dir(workspace: Path) -> Path:
|
|
"""Allocate _workspace/<today>-NNN/ with the smallest free NNN."""
|
|
workspace.mkdir(parents=True, exist_ok=True)
|
|
today = date.today().isoformat()
|
|
n = 1
|
|
while True:
|
|
candidate = workspace / f"{today}-{n:03d}"
|
|
if not candidate.exists():
|
|
return candidate
|
|
n += 1
|
|
|
|
|
|
def _resolve_run_dir(run_dir_arg: str | None, text_arg: str | None) -> Path:
|
|
if run_dir_arg:
|
|
rd = Path(run_dir_arg)
|
|
if not rd.is_absolute():
|
|
rd = PROJECT_ROOT / rd
|
|
rd.mkdir(parents=True, exist_ok=True)
|
|
return rd
|
|
if text_arg is None:
|
|
raise SystemExit("Either --run-dir or --text is required")
|
|
workspace = PROJECT_ROOT / "_workspace"
|
|
rd = _next_run_dir(workspace)
|
|
rd.mkdir(parents=True, exist_ok=True)
|
|
return rd
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Combined-file rendering
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def _fmt_z(z: float | None) -> str:
|
|
if z is None:
|
|
return "n/a"
|
|
sign = "+" if z >= 0 else ""
|
|
return f"z={sign}{z:.2f}"
|
|
|
|
|
|
def _z_marker(z: float | None) -> str:
|
|
"""Emit a small star for values clearly above the AI band."""
|
|
if z is None:
|
|
return ""
|
|
if z >= 1.5:
|
|
return " ★ S1 트리거"
|
|
if z >= 1.0:
|
|
return " · S2 시그널"
|
|
return ""
|
|
|
|
|
|
def _render_block(metrics_obj: dict) -> str:
|
|
m = metrics_obj.get("metrics", {})
|
|
z = metrics_obj.get("z_scores", {})
|
|
ev = metrics_obj.get("evidence", {})
|
|
pivots = ev.get("conclusion_pivots") or []
|
|
safe = ev.get("safe_balances") or []
|
|
|
|
lines: list[str] = []
|
|
lines.append("[정량 사전 점수 v1.6 / KatFish baseline]")
|
|
lines.append(
|
|
f"risk_band: {metrics_obj.get('risk_band', 'unknown')} "
|
|
f"(score {metrics_obj.get('risk_score', 0)})"
|
|
)
|
|
lines.append(f"genre: {metrics_obj.get('genre', 'essay')}")
|
|
lines.append(f"char_count: {metrics_obj.get('char_count', 0)}")
|
|
if metrics_obj.get("warning"):
|
|
lines.append(f"warning: {metrics_obj['warning']}")
|
|
lines.append("")
|
|
lines.append("[지표]")
|
|
|
|
def row(key: str, value_fmt: str, with_z: bool = True, suffix: str = "") -> str:
|
|
val = m.get(key)
|
|
if val is None:
|
|
return f"- {key}: n/a"
|
|
z_part = ""
|
|
if with_z:
|
|
z_part = f" ({_fmt_z(z.get(key))} vs {metrics_obj.get('genre','essay')} 인간 baseline){_z_marker(z.get(key))}"
|
|
return f"- {key}: {value_fmt.format(val)}{z_part}{suffix}"
|
|
|
|
lines.append(row("comma_inclusion_rate", "{:.2f}"))
|
|
lines.append(row("comma_usage_rate", "{:.2f}"))
|
|
lines.append(row("ending_comma_rate", "{:.2f}"))
|
|
lines.append(row("comma_segment_length", "{:.2f}"))
|
|
|
|
pivot_suffix = f" (lexicon 매치: {', '.join(repr(p) for p in pivots)})" if pivots else ""
|
|
lines.append(
|
|
f"- conclusion_pivot_count: {int(m.get('conclusion_pivot_count', 0))}{pivot_suffix}"
|
|
)
|
|
safe_suffix = f" (lexicon 매치: {', '.join(repr(s) for s in safe)})" if safe else ""
|
|
lines.append(
|
|
f"- safe_balance_count: {int(m.get('safe_balance_count', 0))}{safe_suffix}"
|
|
)
|
|
lines.append(row("hanja_nominalizer_density", "{:.3f}"))
|
|
lines.append(row("lexical_diversity", "{:.2f}"))
|
|
lines.append("")
|
|
lines.append("[근거 사용 가이드]")
|
|
lines.append("- 위 점수는 *근거 보조*다. 단독 판정 금지(보고서 명시).")
|
|
lines.append("- z>1.0 지표는 quick-rules.md S1·S2 패턴과 교차 확인 후 윤문할 것.")
|
|
lines.append("- ending_comma_rate가 ★ S1 트리거인 경우 C-11(연결어미 뒤 쉼표) 우선 손질.")
|
|
lines.append("- conclusion_pivot 매치 토큰은 D-1·H-1 처방 적용 대상.")
|
|
lines.append("")
|
|
return "\n".join(lines)
|
|
|
|
|
|
def _render_combined(text: str, metrics_obj: dict | None) -> str:
|
|
parts: list[str] = []
|
|
if metrics_obj is not None:
|
|
parts.append(_render_block(metrics_obj))
|
|
parts.append("[원문 시작]")
|
|
parts.append(text.rstrip("\n"))
|
|
parts.append("[원문 끝]")
|
|
parts.append("")
|
|
return "\n".join(parts)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Main
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def main(argv: list[str] | None = None) -> int:
|
|
p = argparse.ArgumentParser(description="Humanize KR v1.6 monolith input shim")
|
|
p.add_argument("--run-dir", help="Existing run directory (relative ok)")
|
|
p.add_argument("--text", help="Inline text input (creates new run dir)")
|
|
p.add_argument("--genre", default="essay", help="Genre hint (default: essay)")
|
|
p.add_argument(
|
|
"--baseline",
|
|
default=None,
|
|
help="Override baseline JSON path (default: project default)",
|
|
)
|
|
args = p.parse_args(argv)
|
|
|
|
run_dir = _resolve_run_dir(args.run_dir, args.text)
|
|
input_path = run_dir / "01_input.txt"
|
|
|
|
# Ensure 01_input.txt exists.
|
|
if args.text is not None:
|
|
input_path.write_text(args.text, encoding="utf-8")
|
|
if not input_path.exists():
|
|
raise SystemExit(f"01_input.txt not found in {run_dir}; pass --text to create")
|
|
|
|
text = input_path.read_text(encoding="utf-8")
|
|
|
|
metrics_obj: dict | None = None
|
|
metrics_path = run_dir / "00_metrics.json"
|
|
error_path = run_dir / "00_metrics.error"
|
|
|
|
if _metrics_mod is None:
|
|
error_path.write_text(
|
|
"metrics module import failed; combined file emitted without score block",
|
|
encoding="utf-8",
|
|
)
|
|
else:
|
|
try:
|
|
metrics_obj = _metrics_mod.compute_all(
|
|
text, genre=args.genre, baseline_path=args.baseline
|
|
)
|
|
metrics_path.write_text(
|
|
json.dumps(metrics_obj, ensure_ascii=False, indent=2),
|
|
encoding="utf-8",
|
|
)
|
|
# On success any stale error file is cleared.
|
|
if error_path.exists():
|
|
try:
|
|
error_path.unlink()
|
|
except OSError:
|
|
pass
|
|
except Exception as exc: # noqa: BLE001 — graceful degrade is the point.
|
|
metrics_obj = None
|
|
error_path.write_text(
|
|
f"metrics_failed: {type(exc).__name__}: {exc}\n\n"
|
|
+ traceback.format_exc(),
|
|
encoding="utf-8",
|
|
)
|
|
|
|
combined_path = run_dir / "01_input_with_metrics.txt"
|
|
combined_path.write_text(_render_combined(text, metrics_obj), encoding="utf-8")
|
|
|
|
rb = (metrics_obj or {}).get("risk_band", "absent")
|
|
rs = (metrics_obj or {}).get("risk_score", "absent")
|
|
print(
|
|
f"run_dir={run_dir}\n"
|
|
f"combined={combined_path}\n"
|
|
f"risk_band={rb} risk_score={rs}\n"
|
|
f"degraded={metrics_obj is None}"
|
|
)
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|