Keep dev aligned with main after the Naver blog skill landed

Merged origin/main into dev so the branch regains direct ancestry with the
mainline release history while preserving newer dev-side docs, proxy, and
helper updates. Conflict resolution kept the richer dev documentation and
proxy behavior, took the mainline Naver blog skill files as the canonical
source, and preserved the released package-version/changelog updates from
main where they were the authoritative record of published state.

Constraint: dev already contained newer docs/proxy work that would have regressed if main-side older variants won
Constraint: main carried release metadata that should remain authoritative for published package versions
Rejected: Cherry-pick main-only commits again | fixes content but does not restore formal dev<-main merge ancestry
Rejected: Prefer main on all conflicts | would drop newer dev-only skill and proxy guidance
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: On future dev<-main syncs, preserve newer dev docs/proxy behavior first, then re-run full workspace CI before concluding
Tested: npm run ci; inspected README/proxy/setup/package metadata after conflict resolution
Not-tested: push to origin/dev; GitHub mergeability after remote update
This commit is contained in:
Jeffrey (Dongkyu) Kim 2026-04-13 16:18:19 +09:00
commit 7b82d4d409
20 changed files with 1074 additions and 7 deletions

View file

@ -57,6 +57,7 @@ Claude Code, Codex, OpenCode, OpenClaw/ClawHub 등 각종 코딩 에이전트
| 번개장터 검색 | 번개장터 검색, 상세조회, 선택적 찜/채팅, AI TOON export | 불필요 | [번개장터 검색 가이드](docs/features/bunjang-search.md) |
| 중고차 가격 조회 | 중고차 인수가/월 렌트료 비교 조회 | 불필요 | [중고차 가격 조회 가이드](docs/features/used-car-price-search.md) |
| 한국어 맞춤법 검사 | 한국어 텍스트 맞춤법/문법 검사 및 교정안 정리 | 불필요 | [한국어 맞춤법 검사 가이드](docs/features/korean-spell-check.md) |
| 네이버 블로그 리서치 | 네이버 블로그 검색, 원문 읽기, 이미지 다운로드, 한국어 콘텐츠 교차 검증 | 불필요 | [네이버 블로그 리서치 가이드](docs/features/naver-blog-research.md) |
| 한국어 글자 수 세기 | 한국어 텍스트의 글자 수·줄 수·UTF-8/NEIS byte 수를 결정론적으로 계산 | 불필요 | [한국어 글자 수 세기 가이드](docs/features/korean-character-count.md) |
> ## ⚠️ 근처 블루리본 맛집 스킬 — 지원 중단
@ -126,6 +127,7 @@ Claude Code, Codex, OpenCode, OpenClaw/ClawHub 등 각종 코딩 에이전트
- [번개장터 검색 가이드](docs/features/bunjang-search.md)
- [중고차 가격 조회 가이드](docs/features/used-car-price-search.md)
- [한국어 맞춤법 검사 가이드](docs/features/korean-spell-check.md)
- [네이버 블로그 리서치 가이드](docs/features/naver-blog-research.md)
- [한국어 글자 수 세기 가이드](docs/features/korean-character-count.md)
- [릴리스/배포 가이드](docs/releasing.md)

View file

@ -0,0 +1,63 @@
# 네이버 블로그 리서치 가이드
## 이 기능으로 할 수 있는 일
- 네이버 블로그 키워드 검색 (관련도순/최신순 정렬)
- 블로그 포스트 원문 텍스트 추출
- 블로그 포스트 내 이미지 URL 추출 및 로컬 다운로드
- 구글 검색과 병행한 한국어 콘텐츠 교차 검증 리서치
## 먼저 필요한 것
- `python3` 3.8+
- 인터넷 연결
- API 키 불필요
## 입력값
- 검색: 검색어 문자열 (예: `"서울 맛집 추천"`)
- 원문 읽기: 네이버 블로그 포스트 URL (PC 또는 모바일)
- 이미지 다운로드: 이미지 URL 목록 또는 `naver_read.py` 파이프 출력
## 공식 표면
- 검색: `https://search.naver.com/search.naver?where=blog&query={query}`
- 블로그 원문 (모바일): `https://m.blog.naver.com/{userId}/{postId}`
- 이미지 CDN: `blogfiles.naver.net`, `postfiles.pstatic.net`
## 기본 흐름
1. `naver_search.py`로 네이버 블로그 검색 실행
2. 검색 결과에서 상위 3~5개 포스트 선택
3. `naver_read.py`로 선택한 포스트의 원문 읽기
4. 필요 시 `naver_download_images.py`로 이미지 로컬 저장
5. 구글 검색(WebSearch) 결과와 교차 검증하여 정보 신뢰도 확보
## 예시
블로그 검색:
```bash
python3 scripts/naver_search.py "제주도 여행 코스" --count 5 --sort sim
```
블로그 원문 읽기:
```bash
python3 scripts/naver_read.py "https://blog.naver.com/user123/224212849946"
```
이미지 다운로드:
```bash
python3 scripts/naver_read.py "https://blog.naver.com/user123/224212849946" \
| python3 scripts/naver_download_images.py --output ./images/ --max 5
```
## 주의 사항
- 네이버 검색엔진에 직접 요청하므로 대량 자동화 시 IP 차단 가능성이 있다. 한 세션에 과도한 요청을 자제한다.
- 이 스킬은 소량·비상업적 콘텐츠 리서치 용도로 설계되었다.
- 네이버 HTML 구조 변경 시 파싱이 실패할 수 있다. 에러 발생 시 스크립트 업데이트가 필요하다.
- PC 버전(`blog.naver.com`)은 iframe 구조여서 모바일 버전(`m.blog.naver.com`)을 사용한다.
- 블로그 출처(URL, 작성자)를 사용자에게 반드시 함께 안내한다.

3
naver-blog-research/.gitignore vendored Normal file
View file

@ -0,0 +1,3 @@
__pycache__/
*.pyc
naver-images/

View file

@ -0,0 +1,138 @@
---
name: naver-blog-research
description: Search Naver blogs, read full post content, and download images using only python3 stdlib — no API key required.
license: MIT
metadata:
category: research
locale: ko-KR
phase: v1
---
# 네이버 블로그 리서치
## What this skill does
네이버 블로그를 검색하고, 개별 포스트의 원문을 읽고, 이미지를 로컬에 다운로드한다.
- API 키 없이 `python3` 표준 라이브러리만으로 동작한다.
- 검색 결과를 구조화된 JSON으로 출력한다.
- 모바일 버전(`m.blog.naver.com`)을 이용해 iframe 없이 본문을 직접 추출한다.
- 블로그 이미지 CDN(`blogfiles.naver.net`, `postfiles.pstatic.net`)에서 이미지를 다운로드한다.
## When to use
- "네이버 블로그에서 결혼식 체크리스트 검색해줘"
- "네이버 블로그 리서치 해줘"
- "한국 블로그에서 관련 정보 조사해줘"
- "네이버 블로그 글 읽어줘"
- "이 네이버 블로그 포스트에서 이미지 다운로드해줘"
- 한국어 콘텐츠 리서치에서 구글 외 네이버 블로그 소스가 필요한 상황
## When not to use
- 네이버 뉴스, 카페, 지식iN 등 블로그 외 네이버 서비스 검색
- 대량 크롤링/스크래핑 (한 세션에 수십 건 이상의 요청)
- 상업적 데이터 수집
## Prerequisites
- 인터넷 연결
- `python3` 3.8+
- 이 스킬 디렉토리의 `scripts/` 안에 포함된 helper 스크립트
## Workflow
### 1. 네이버 블로그 검색
```bash
python3 scripts/naver_search.py "검색어" --count 10 --sort sim
```
| 인자 | 필수 | 설명 | 기본값 |
|------|------|------|--------|
| query | O | 검색어 | - |
| --count | X | 결과 수 (최대 30) | 10 |
| --sort | X | sim(관련도), date(최신) | sim |
| --timeout | X | 요청 타임아웃(초) | 15 |
출력 예시:
```json
{
"query": "결혼식 체크리스트",
"total_results": 7,
"results": [
{
"title": "결혼식 체크리스트 총정리",
"url": "https://blog.naver.com/user123/224212849946",
"mobile_url": "https://m.blog.naver.com/user123/224212849946",
"snippet": "결혼식 1주일 전에 반드시 확인해야 할...",
"author": "user123"
}
]
}
```
### 2. 블로그 원문 읽기
검색 결과에서 관심 있는 포스트의 URL을 선택하여 원문을 읽는다.
```bash
python3 scripts/naver_read.py "https://blog.naver.com/user123/224212849946"
```
| 인자 | 필수 | 설명 | 기본값 |
|------|------|------|--------|
| url | O | 블로그 포스트 URL (PC 또는 모바일) | - |
| --no-images | X | 이미지 URL 제외 | false |
| --max-length | X | 본문 최대 글자 수 (0=무제한) | 0 |
| --timeout | X | 요청 타임아웃(초) | 20 |
PC URL을 넣어도 자동으로 모바일 URL로 변환하여 요청한다.
### 3. 이미지 다운로드 (필요 시)
```bash
python3 scripts/naver_download_images.py --urls "url1,url2,url3" --output ./images/
```
또는 `naver_read.py` 결과를 파이프로 전달:
```bash
python3 scripts/naver_read.py "https://..." | python3 scripts/naver_download_images.py --output ./images/
```
| 인자 | 필수 | 설명 | 기본값 |
|------|------|------|--------|
| --urls | X | 쉼표 구분 이미지 URL | - |
| --output | X | 저장 디렉토리 | ./naver-images/ |
| --max | X | 최대 다운로드 수 | 10 |
| --timeout | X | 요청 타임아웃(초) | 15 |
### 추천 워크플로우
1. `naver_search.py`로 검색 → 상위 3~5개 결과 확인
2. 관련도 높은 포스트를 `naver_read.py`로 원문 읽기
3. 필요 시 `naver_download_images.py`로 이미지 저장
4. WebSearch(구글) 결과와 교차 검증하여 정보 신뢰도 높이기
## Response policy
- 검색 결과와 본문은 사용자에게 요약하여 전달한다.
- 블로그 출처(URL, 작성자)를 반드시 함께 안내한다.
- 한 세션에 과도한 요청(수십 건 이상)을 자제한다.
- 이미지 다운로드 시 사용자에게 저장 경로를 안내한다.
## Done when
- 검색 결과가 JSON으로 정상 출력된다.
- 블로그 원문 텍스트가 추출된다.
- 필요한 이미지가 로컬에 저장된다.
- 출처가 명시된다.
## Notes
- 네이버 검색엔진을 직접 요청하므로 대량/자동화 사용 시 IP 차단 가능성이 있다.
- 이 스킬은 소량, 비상업적 콘텐츠 리서치 용도로 설계되었다.
- 네이버 HTML 구조는 변경될 수 있어, 파싱 실패 시 에러 메시지를 확인하고 스크립트 업데이트가 필요할 수 있다.
- PC 버전(`blog.naver.com`)은 iframe 구조여서 모바일 버전(`m.blog.naver.com`)을 사용한다.

View file

@ -0,0 +1,58 @@
"""Shared HTTP utilities for Naver blog scripts (SSL handling, URL validation, urlopen wrapper)."""
from __future__ import annotations
import re
import ssl
import sys
import urllib.error
import urllib.parse
import urllib.request
TAG_RE = re.compile(r"<[^>]+>")
_ssl_ctx_secure: ssl.SSLContext | None = None
_ssl_ctx_insecure: ssl.SSLContext | None = None
def _get_ssl_context(*, insecure: bool = False) -> ssl.SSLContext:
global _ssl_ctx_secure, _ssl_ctx_insecure
if insecure:
if _ssl_ctx_insecure is None:
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
_ssl_ctx_insecure = ctx
return _ssl_ctx_insecure
if _ssl_ctx_secure is None:
_ssl_ctx_secure = ssl.create_default_context()
return _ssl_ctx_secure
_NAVER_DOMAINS = (".naver.com", ".naver.net", ".pstatic.net")
def is_naver_url(url: str) -> bool:
host = urllib.parse.urlparse(url).hostname or ""
return any(host == d.lstrip(".") or host.endswith(d) for d in _NAVER_DOMAINS)
def urlopen(request: urllib.request.Request, timeout: int, *, insecure: bool = False):
"""urlopen with explicit SSL insecure mode for Naver domains.
When *insecure* is True and the target is a Naver domain, SSL certificate
verification is skipped. A warning is printed to stderr on every call so
the caller is always aware.
"""
if insecure:
if not is_naver_url(request.full_url):
raise ValueError("insecure 모드는 네이버 도메인에만 사용할 수 있습니다.")
print(
"[warn] SSL 인증서 검증이 비활성화되었습니다. 연결이 안전하지 않을 수 있습니다.",
file=sys.stderr,
)
return urllib.request.urlopen(
request, timeout=timeout, context=_get_ssl_context(insecure=True),
)
return urllib.request.urlopen(request, timeout=timeout, context=_get_ssl_context())

View file

@ -0,0 +1,233 @@
from __future__ import annotations
import argparse
import json
import os
import sys
import urllib.error
import urllib.request
from concurrent.futures import ThreadPoolExecutor, as_completed
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from _naver_http import is_naver_url, urlopen
DEFAULT_OUTPUT_DIR = "./naver-images"
DEFAULT_MAX = 10
DEFAULT_TIMEOUT = 15
DEFAULT_HEADERS = {
"Accept": "image/webp,image/apng,image/*,*/*;q=0.8",
"Accept-Language": "ko,en-US;q=0.9,en;q=0.8",
"Referer": "https://m.blog.naver.com/",
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"
),
}
CONTENT_TYPE_TO_EXT = {
"image/jpeg": ".jpg",
"image/png": ".png",
"image/gif": ".gif",
"image/webp": ".webp",
"image/bmp": ".bmp",
"image/svg+xml": ".svg",
}
_MAGIC_BYTES = (
(b"\x89PNG\r\n\x1a\n", ".png"),
(b"GIF87a", ".gif"),
(b"GIF89a", ".gif"),
(b"RIFF", ".webp"), # WebP: RIFF....WEBP (check first 4 bytes)
(b"BM", ".bmp"),
)
def guess_extension(url: str, content_type: str | None = None, data: bytes | None = None) -> str:
if content_type:
ct = content_type.split(";")[0].strip().lower()
if ct in CONTENT_TYPE_TO_EXT:
return CONTENT_TYPE_TO_EXT[ct]
lower_url = url.lower().split("?")[0]
for ext in (".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp", ".svg"):
if lower_url.endswith(ext):
return ".jpg" if ext == ".jpeg" else ext
if data:
for magic, ext in _MAGIC_BYTES:
if data[:len(magic)] == magic:
if ext == ".webp" and data[8:12] != b"WEBP":
continue
return ext
if data[:2] in (b"\xff\xd8",):
return ".jpg"
return ".jpg"
def download_image(url: str, output_path: str, output_dir: str, timeout: int = DEFAULT_TIMEOUT, *, insecure: bool = False) -> dict:
"""Download a single image from a Naver CDN URL.
*output_dir* is used solely for path-traversal protection: the resolved
*output_path* must reside inside *output_dir*.
"""
if not is_naver_url(url):
return {"url": url, "error": "Not a Naver CDN URL. Skipped."}
real_dir = os.path.realpath(output_dir)
if not os.path.realpath(output_path).startswith(real_dir + os.sep):
return {"url": url, "error": "Output path escapes target directory. Skipped."}
request = urllib.request.Request(url, headers=DEFAULT_HEADERS)
try:
with urlopen(request, timeout, insecure=insecure) as response:
data = response.read()
content_type = response.headers.get("Content-Type", "")
except (urllib.error.HTTPError, urllib.error.URLError, OSError) as error:
return {"url": url, "error": str(error)}
ext = guess_extension(url, content_type, data)
if not os.path.splitext(output_path)[1]:
output_path += ext
os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
with open(output_path, "wb") as f:
f.write(data)
size_kb = round(len(data) / 1024, 1)
return {"url": url, "path": output_path, "size_kb": size_kb}
def download_images(
urls: list[str],
output_dir: str = DEFAULT_OUTPUT_DIR,
max_count: int = DEFAULT_MAX,
timeout: int = DEFAULT_TIMEOUT,
*,
insecure: bool = False,
) -> dict:
os.makedirs(output_dir, exist_ok=True)
max_count = max(1, max_count)
targets = urls[:max_count]
downloaded: list[dict] = []
failed: list[dict] = []
# index → result 순서를 보장하기 위해 dict로 매핑
results_by_index: dict[int, dict] = {}
with ThreadPoolExecutor(max_workers=min(4, max(1, len(targets)))) as executor:
future_to_index = {}
for i, url in enumerate(targets, start=1):
filename = f"{i:03d}"
output_path = os.path.join(output_dir, filename)
future = executor.submit(download_image, url, output_path, output_dir, timeout, insecure=insecure)
future_to_index[future] = i
for future in as_completed(future_to_index):
idx = future_to_index[future]
try:
results_by_index[idx] = future.result()
except Exception as exc:
results_by_index[idx] = {"url": targets[idx - 1], "error": str(exc)}
# 원래 순서대로 정렬
for idx in sorted(results_by_index):
result = results_by_index[idx]
if "error" in result:
failed.append(result)
else:
downloaded.append(result)
return {
"downloaded": len(downloaded),
"files": downloaded,
"failed": failed,
}
def parse_args(argv: list[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Download images from Naver blog CDN URLs."
)
parser.add_argument(
"--urls", type=str, default="",
help="Comma-separated image URLs.",
)
parser.add_argument(
"--output", type=str, default=DEFAULT_OUTPUT_DIR,
help=f"Output directory. Default: {DEFAULT_OUTPUT_DIR}",
)
parser.add_argument(
"--max", type=int, default=DEFAULT_MAX,
help=f"Maximum number of images to download. Default: {DEFAULT_MAX}",
)
parser.add_argument(
"--timeout", type=int, default=DEFAULT_TIMEOUT,
help=f"HTTP request timeout in seconds. Default: {DEFAULT_TIMEOUT}",
)
parser.add_argument(
"--insecure", action="store_true",
help="Skip SSL certificate verification (use only when certificate errors occur).",
)
return parser.parse_args(argv)
def read_urls_from_stdin() -> list[str]:
try:
data = json.load(sys.stdin)
if isinstance(data, dict) and "images" in data:
return [img["url"] for img in data["images"] if isinstance(img, dict) and img.get("url")]
if isinstance(data, list):
return [
u for item in data
if (u := (item if isinstance(item, str) else item.get("url", "")))
]
if isinstance(data, dict):
print(
"[warn] stdin JSON에 'images' 키가 없습니다. "
"naver_read.py 실행 시 --no-images 플래그를 사용하지 않았는지 확인하세요.",
file=sys.stderr,
)
except (json.JSONDecodeError, KeyError, TypeError) as exc:
print(f"[warn] stdin JSON 파싱 실패: {exc}", file=sys.stderr)
return []
return []
def main(argv: list[str] | None = None) -> int:
args = parse_args(argv or sys.argv[1:])
urls: list[str] = []
if args.urls:
urls = [u.strip() for u in args.urls.split(",") if u.strip()]
if not urls and not sys.stdin.isatty():
urls = read_urls_from_stdin()
if not urls:
print(
json.dumps({"error": "No image URLs provided. Use --urls or pipe naver_read.py output via stdin."}, ensure_ascii=False),
file=sys.stderr,
)
return 1
result = download_images(
urls,
output_dir=args.output,
max_count=args.max,
timeout=args.timeout,
insecure=args.insecure,
)
print(json.dumps(result, ensure_ascii=False, indent=2))
return 0
if __name__ == "__main__":
raise SystemExit(main())

View file

@ -0,0 +1,256 @@
from __future__ import annotations
import argparse
import json
import os
import re
import sys
import urllib.error
import urllib.request
from html import unescape
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from _naver_http import TAG_RE, is_naver_url, urlopen
MOBILE_UA = (
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1"
)
DEFAULT_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "ko,en-US;q=0.9,en;q=0.8",
"User-Agent": MOBILE_UA,
}
BR_RE = re.compile(r"<br\s*/?>", re.IGNORECASE)
BLOCK_END_RE = re.compile(r"</(p|div|li)>", re.IGNORECASE)
WHITESPACE_RE = re.compile(r"[ \t]+")
BLANK_LINES_RE = re.compile(r"\n{3,}")
_IMG_CDN_HOSTS = r"(?:blogfiles\.naver\.net|postfiles\.pstatic\.net|mblogthumb-phinf\.pstatic\.net)"
IMAGE_LAZY_PATTERN = re.compile(
rf'data-lazy-src="(https?://{_IMG_CDN_HOSTS}[^"]+)"'
)
IMAGE_SRC_PATTERN = re.compile(
rf'src="(https?://{_IMG_CDN_HOSTS}[^"]+)"'
)
IMAGE_ALT_PATTERN = re.compile(
r'alt="([^"]*)"'
)
TITLE_PATTERN = re.compile(
r'<title[^>]*>(.*?)</title>', re.DOTALL | re.IGNORECASE
)
SCRIPT_STYLE_RE = re.compile(r"<(script|style|noscript)[^>]*>.*?</\1>", re.DOTALL | re.IGNORECASE)
PC_BLOG_RE = re.compile(r"^https?://blog\.naver\.com/")
BLOG_ID_RE = re.compile(r"blog\.naver\.com/([a-zA-Z0-9_]+)/(\d+)")
def to_mobile_url(url: str) -> str:
url = url.strip()
url = PC_BLOG_RE.sub("https://m.blog.naver.com/", url)
if not url.startswith("https://m.blog.naver.com/"):
match = BLOG_ID_RE.search(url)
if match:
url = f"https://m.blog.naver.com/{match.group(1)}/{match.group(2)}"
return url
def fetch_blog_page(url: str, timeout: int = 20, *, insecure: bool = False) -> str:
mobile_url = to_mobile_url(url)
if not is_naver_url(mobile_url):
raise ValueError(f"Not a Naver blog URL: {url}")
request = urllib.request.Request(mobile_url, headers=DEFAULT_HEADERS)
try:
with urlopen(request, timeout, insecure=insecure) as response:
return response.read().decode("utf-8", "ignore")
except urllib.error.HTTPError as error:
raise RuntimeError(
f"Naver blog returned HTTP {error.code} for {mobile_url}. "
"The post may not exist or access may be restricted."
) from error
def extract_title(html: str) -> str:
match = TITLE_PATTERN.search(html)
if not match:
return ""
title = unescape(TAG_RE.sub("", match.group(1))).strip()
title = re.sub(r"\s*[-:|]?\s*네이버\s*블로그$", "", title).strip()
return title
def _extract_div_block(html: str, start_pos: int) -> str:
tag_start = html.rfind("<div", 0, start_pos)
if tag_start < 0:
tag_start = start_pos
depth = 0
pos = tag_start
started = False
length = len(html)
while pos < length:
# HTML 주석 건너뛰기
if html[pos : pos + 4] == "<!--":
end = html.find("-->", pos + 4)
pos = end + 3 if end >= 0 else length
continue
if html[pos : pos + 4] == "<div" and (pos + 4 >= length or html[pos + 4] in (" ", ">", "\t", "\n", "/")):
depth += 1
started = True
elif html[pos : pos + 6] == "</div>":
depth -= 1
if started and depth == 0:
return html[tag_start : pos + 6]
pos += 1
return html[tag_start:]
def extract_content_area(html: str) -> str:
cleaned = SCRIPT_STYLE_RE.sub("", html)
match = re.search(r'class="[^"]*\bse-main-container\b[^"]*"', cleaned)
if match:
return _extract_div_block(cleaned, match.start())
for class_name in ("post_ct", "postViewArea", "post-view"):
match = re.search(rf'class="[^"]*\b{re.escape(class_name)}\b[^"]*"', cleaned)
if match:
return _extract_div_block(cleaned, match.start())
marker = cleaned.find('id="viewTypeSelector"')
if marker >= 0:
return _extract_div_block(cleaned, marker)
return ""
def extract_text(html_fragment: str) -> str:
text = BR_RE.sub("\n", html_fragment)
text = BLOCK_END_RE.sub("\n", text)
text = TAG_RE.sub("", text)
text = unescape(text)
lines = []
for line in text.split("\n"):
stripped = WHITESPACE_RE.sub(" ", line).strip()
if stripped:
lines.append(stripped)
result = "\n".join(lines)
result = BLANK_LINES_RE.sub("\n\n", result)
return result.strip()
def extract_images(html_fragment: str) -> list[dict]:
images: list[dict] = []
seen_base: set[str] = set()
img_tags = re.finditer(r"<img\s[^>]+>", html_fragment, re.IGNORECASE)
for img_match in img_tags:
img_tag = img_match.group(0)
lazy_match = IMAGE_LAZY_PATTERN.search(img_tag)
src_match = IMAGE_SRC_PATTERN.search(img_tag)
url_match = lazy_match or src_match
if not url_match:
continue
url = url_match.group(1)
base_url = re.sub(r"\?type=.*$", "", url)
if base_url in seen_base:
continue
seen_base.add(base_url)
if "?type=" not in url:
url = base_url
elif "_blur" in url:
url = re.sub(r"\?type=w\d+_blur", "?type=w800", url)
alt_match = IMAGE_ALT_PATTERN.search(img_tag)
alt = unescape(alt_match.group(1)).strip() if alt_match else ""
images.append({"url": url, "alt": alt})
return images
def read_blog(url: str, include_images: bool = True, max_length: int = 0, timeout: int = 20, *, insecure: bool = False) -> dict:
html = fetch_blog_page(url, timeout=timeout, insecure=insecure)
mobile_url = to_mobile_url(url)
title = extract_title(html)
content_area = extract_content_area(html)
content = extract_text(content_area)
if max_length > 0 and len(content) > max_length:
content = content[:max_length] + "..."
result: dict = {
"url": mobile_url,
"title": title,
"content": content,
"char_count": len(content),
}
if not content:
result["warning"] = "본문 영역을 찾지 못했습니다. 네이버 HTML 구조가 변경되었을 수 있습니다."
if include_images:
result["images"] = extract_images(content_area)
return result
def parse_args(argv: list[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Read a Naver blog post and extract text content and images."
)
parser.add_argument("url", help="Naver blog post URL (PC or mobile).")
parser.add_argument(
"--no-images", action="store_true",
help="Exclude image URLs from output.",
)
parser.add_argument(
"--max-length", type=int, default=0,
help="Maximum content length in characters (0 = unlimited). Default: 0.",
)
parser.add_argument(
"--timeout", type=int, default=20,
help="HTTP request timeout in seconds. Default: 20.",
)
parser.add_argument(
"--insecure", action="store_true",
help="Skip SSL certificate verification (use only when certificate errors occur).",
)
return parser.parse_args(argv)
def main(argv: list[str] | None = None) -> int:
args = parse_args(argv or sys.argv[1:])
try:
result = read_blog(
args.url,
include_images=not args.no_images,
max_length=args.max_length,
timeout=args.timeout,
insecure=args.insecure,
)
except (RuntimeError, ValueError) as error:
print(json.dumps({"error": str(error)}, ensure_ascii=False), file=sys.stderr)
return 1
print(json.dumps(result, ensure_ascii=False, indent=2))
return 0
if __name__ == "__main__":
raise SystemExit(main())

View file

@ -0,0 +1,192 @@
from __future__ import annotations
import argparse
import json
import os
import re
import sys
import time
import urllib.parse
import urllib.request
from html import unescape
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from _naver_http import TAG_RE, urlopen
SEARCH_URL = "https://search.naver.com/search.naver"
DEFAULT_COUNT = 10
MAX_COUNT = 30
FIRST_PAGE_START = 1
RESULTS_PER_PAGE = 15
DEFAULT_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "ko,en-US;q=0.9,en;q=0.8",
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"
),
}
BLOG_ANCHOR_PATTERN = re.compile(
r'<a[^>]*href="(https?://blog\.naver\.com/([a-zA-Z0-9_]+)/(\d+))"[^>]*>(.*?)</a>',
re.DOTALL,
)
def strip_html(text: str) -> str:
return unescape(TAG_RE.sub("", text)).strip()
def build_search_params(query: str, start: int = FIRST_PAGE_START, sort: str = "sim") -> dict[str, str]:
return {
"query": query,
"ssc": "tab.blog.all",
"sm": "tab_jum" if start <= FIRST_PAGE_START else "tab_pge",
"start": str(start),
"nso": {"sim": "so:r,p:all,a:all", "date": "so:dd,p:all,a:all"}.get(sort, "so:r,p:all,a:all"),
}
def fetch_search_page(query: str, start: int = 1, sort: str = "sim", timeout: int = 15, *, insecure: bool = False) -> str:
params = build_search_params(query, start=start, sort=sort)
url = f"{SEARCH_URL}?{urllib.parse.urlencode(params)}"
request = urllib.request.Request(url, headers=DEFAULT_HEADERS)
try:
with urlopen(request, timeout, insecure=insecure) as response:
return response.read().decode("utf-8", "ignore")
except urllib.error.HTTPError as error:
raise RuntimeError(
f"Naver search returned HTTP {error.code}. "
"The request may have been blocked. Retry later or reduce request volume."
) from error
def parse_search_results(html: str) -> list[dict]:
results: list[dict] = []
anchors = BLOG_ANCHOR_PATTERN.findall(html)
pending: dict[str, dict] = {}
for full_url, user_id, post_id, inner_html in anchors:
if full_url not in pending:
pending[full_url] = {
"url": full_url,
"mobile_url": f"https://m.blog.naver.com/{user_id}/{post_id}",
"author": user_id,
"title": "",
"snippet": "",
}
text = strip_html(inner_html)
if not text:
continue
entry = pending[full_url]
if "headline1" in inner_html or "text-type-headline" in inner_html:
if not entry["title"]:
entry["title"] = text
elif "body1" in inner_html or "text-type-body" in inner_html:
if not entry["snippet"]:
entry["snippet"] = text
else:
if not entry["title"]:
entry["title"] = text
for entry in pending.values():
results.append(entry)
return results
def search(query: str, count: int = DEFAULT_COUNT, sort: str = "sim", timeout: int = 15, *, insecure: bool = False) -> dict:
count = max(1, min(count, MAX_COUNT))
all_results: list[dict] = []
seen_urls: set[str] = set()
start = FIRST_PAGE_START
# 네이버 검색이 페이지당 정확히 RESULTS_PER_PAGE개를 반환하지 않을 수 있으므로 여유 페이지 확보
max_pages = (count // RESULTS_PER_PAGE) + 3
for page_num in range(max_pages):
if len(all_results) >= count:
break
if page_num > 0:
time.sleep(0.5)
html = fetch_search_page(query, start=start, sort=sort, timeout=timeout, insecure=insecure)
page_results = parse_search_results(html)[:RESULTS_PER_PAGE]
if not page_results:
if start == 1:
print("[warn] 검색 결과 파싱 실패. 네이버 HTML 구조가 변경되었을 수 있습니다.", file=sys.stderr)
break
new_count = 0
for result in page_results:
if result["url"] not in seen_urls:
seen_urls.add(result["url"])
all_results.append(result)
new_count += 1
if len(all_results) >= count:
break
if new_count == 0:
break
start += RESULTS_PER_PAGE
return {
"query": query,
"total_results": len(all_results),
"results": all_results,
}
def parse_args(argv: list[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Search Naver blogs and return structured JSON results."
)
parser.add_argument("query", help="Search query string.")
parser.add_argument(
"--count", type=int, default=DEFAULT_COUNT,
help=f"Number of results to return (max {MAX_COUNT}, default {DEFAULT_COUNT}).",
)
parser.add_argument(
"--sort", choices=["sim", "date"], default="sim",
help="Sort order: sim (relevance) or date (newest first). Default: sim.",
)
parser.add_argument(
"--timeout", type=int, default=15,
help="HTTP request timeout in seconds. Default: 15.",
)
parser.add_argument(
"--insecure", action="store_true",
help="Skip SSL certificate verification (use only when certificate errors occur).",
)
return parser.parse_args(argv)
def main(argv: list[str] | None = None) -> int:
args = parse_args(argv or sys.argv[1:])
try:
result = search(
args.query,
count=args.count,
sort=args.sort,
timeout=args.timeout,
insecure=args.insecure,
)
except RuntimeError as error:
print(json.dumps({"error": str(error)}, ensure_ascii=False), file=sys.stderr)
return 1
print(json.dumps(result, ensure_ascii=False, indent=2))
return 0
if __name__ == "__main__":
raise SystemExit(main())

View file

@ -9,9 +9,9 @@
],
"scripts": {
"build": "npm run build --workspaces --if-present",
"lint": "node --check scripts/skill-docs.test.js scripts/korean_character_count.js scripts/test_korean_character_count.js && python3 -m py_compile scripts/fine_dust.py scripts/test_fine_dust.py scripts/ktx_booking.py scripts/test_ktx_booking.py scripts/sillok_search.py scripts/test_sillok_search.py scripts/korean_spell_check.py scripts/test_korean_spell_check.py scripts/patent_search.py scripts/test_patent_search.py scripts/mfds_drug_safety.py scripts/test_mfds_drug_safety.py scripts/mfds_food_safety.py scripts/test_mfds_food_safety.py scripts/zipcode_search.py scripts/test_zipcode_search.py scripts/subway_lost_property.py scripts/test_subway_lost_property.py scripts/geeknews_search.py scripts/test_geeknews_search.py && npm run lint --workspaces --if-present && ./scripts/validate-skills.sh",
"lint": "node --check scripts/skill-docs.test.js scripts/korean_character_count.js scripts/test_korean_character_count.js && python3 -m py_compile scripts/fine_dust.py scripts/test_fine_dust.py scripts/ktx_booking.py scripts/test_ktx_booking.py scripts/sillok_search.py scripts/test_sillok_search.py scripts/korean_spell_check.py scripts/test_korean_spell_check.py scripts/patent_search.py scripts/test_patent_search.py scripts/mfds_drug_safety.py scripts/test_mfds_drug_safety.py scripts/mfds_food_safety.py scripts/test_mfds_food_safety.py scripts/zipcode_search.py scripts/test_zipcode_search.py scripts/subway_lost_property.py scripts/test_subway_lost_property.py scripts/geeknews_search.py scripts/test_geeknews_search.py scripts/test_naver_blog_search.py naver-blog-research/scripts/_naver_http.py naver-blog-research/scripts/naver_search.py naver-blog-research/scripts/naver_read.py naver-blog-research/scripts/naver_download_images.py && npm run lint --workspaces --if-present && ./scripts/validate-skills.sh",
"typecheck": "tsc --noEmit",
"test": "node --test scripts/skill-docs.test.js scripts/test_korean_character_count.js && PYTHONPATH=.:scripts python3 -m unittest scripts.test_fine_dust scripts.test_ktx_booking scripts.test_sillok_search scripts.test_korean_spell_check scripts.test_patent_search scripts.test_mfds_drug_safety scripts.test_mfds_food_safety scripts.test_zipcode_search scripts.test_subway_lost_property scripts.test_geeknews_search && npm run test --workspaces --if-present && ./scripts/validate-skills.sh",
"test": "node --test scripts/skill-docs.test.js scripts/test_korean_character_count.js && PYTHONPATH=.:scripts python3 -m unittest scripts.test_fine_dust scripts.test_ktx_booking scripts.test_sillok_search scripts.test_korean_spell_check scripts.test_patent_search scripts.test_mfds_drug_safety scripts.test_mfds_food_safety scripts.test_zipcode_search scripts.test_subway_lost_property scripts.test_geeknews_search scripts.test_naver_blog_search && npm run test --workspaces --if-present && ./scripts/validate-skills.sh",
"pack:dry-run": "npm pack --workspace k-lotto --dry-run && npm pack --workspace daiso-product-search --dry-run && npm pack --workspace market-kurly-search --dry-run && npm pack --workspace blue-ribbon-nearby --dry-run && npm pack --workspace kakao-bar-nearby --dry-run && npm pack --workspace cheap-gas-nearby --dry-run && npm pack --workspace kleague-results --dry-run && npm pack --workspace lck-analytics --dry-run && npm pack --workspace toss-securities --dry-run && npm pack --workspace hipass-receipt --dry-run && npm pack --workspace used-car-price-search --dry-run",
"ci": "npm run lint && npm run typecheck && npm run test && npm run pack:dry-run",
"version-packages": "changeset version",

View file

@ -1,5 +1,11 @@
# blue-ribbon-nearby
## 0.2.2
### Patch Changes
- 1be3f44: Handle Blue Ribbon `PREMIUM_REQUIRED` nearby responses with a domain error and document the current premium gate on live nearby results.
## 0.2.1
### Patch Changes

View file

@ -1,6 +1,6 @@
{
"name": "blue-ribbon-nearby",
"version": "0.2.1",
"version": "0.2.2",
"description": "Official Blue Ribbon Survey nearby restaurant client for asking a user's location and finding nearby ribbon picks",
"license": "MIT",
"main": "src/index.js",

View file

@ -1,5 +1,11 @@
# cheap-gas-nearby
## 0.3.0
### Minor Changes
- 1be3f44: Publish the first official Opinet-powered nearby cheapest gas station lookup package and skill docs.
## 0.2.0
### Minor Changes

View file

@ -1,6 +1,6 @@
{
"name": "cheap-gas-nearby",
"version": "0.2.0",
"version": "0.3.0",
"description": "Official Opinet based nearby cheapest gas station lookup for Korean location queries",
"license": "MIT",
"main": "src/index.js",

View file

@ -0,0 +1,7 @@
# hipass-receipt
## 0.2.0
### Minor Changes
- 1be3f44: Publish the first logged-in-session helper package and skill docs for Hi-Pass receipt workflows.

View file

@ -1,6 +1,6 @@
{
"name": "hipass-receipt",
"version": "0.1.0",
"version": "0.2.0",
"description": "Hi-Pass logged-in browser-session helper for usage-history and receipt workflows",
"license": "MIT",
"main": "src/index.js",

View file

@ -1,5 +1,11 @@
# lck-analytics
## 0.3.0
### Minor Changes
- 1be3f44: Add the first LCK analytics package and skill pack adapted from jerjangmin's original upstream implementation.
## 0.2.0
### Minor Changes

View file

@ -1,6 +1,6 @@
{
"name": "lck-analytics",
"version": "0.2.0",
"version": "0.3.0",
"description": "LCK match analytics and insights powered by Riot LoL Esports data",
"license": "MIT",
"main": "src/index.js",

View file

@ -1,5 +1,11 @@
# used-car-price-search
## 0.4.0
### Minor Changes
- 1be3f44: Publish the first reusable used-car-price-search package with the SK direct inventory parser and skill docs.
## 0.3.0
### Minor Changes

View file

@ -1,6 +1,6 @@
{
"name": "used-car-price-search",
"version": "0.3.0",
"version": "0.4.0",
"description": "SK렌터카 다이렉트 타고BUY 기반 중고차 가격 조회 client",
"license": "MIT",
"main": "src/index.js",

View file

@ -0,0 +1,91 @@
import importlib.util
import pathlib
import unittest
from unittest import mock
MODULE_PATH = pathlib.Path(__file__).resolve().parents[1] / "naver-blog-research" / "scripts" / "naver_search.py"
MODULE_SPEC = importlib.util.spec_from_file_location("naver_search", MODULE_PATH)
naver_search = importlib.util.module_from_spec(MODULE_SPEC)
assert MODULE_SPEC.loader is not None
MODULE_SPEC.loader.exec_module(naver_search)
def make_result(index: int) -> dict[str, str]:
return {
"url": f"https://blog.naver.com/author{index}/{200000000000 + index}",
"mobile_url": f"https://m.blog.naver.com/author{index}/{200000000000 + index}",
"author": f"author{index}",
"title": f"title-{index}",
"snippet": f"snippet-{index}",
}
class RequestBuilderTest(unittest.TestCase):
def test_build_search_params_target_blog_tab_and_switch_sm_for_paging(self):
page_one = naver_search.build_search_params("서울 맛집", start=1, sort="sim")
page_two = naver_search.build_search_params("서울 맛집", start=16, sort="date")
self.assertEqual(page_one["ssc"], "tab.blog.all")
self.assertEqual(page_one["sm"], "tab_jum")
self.assertEqual(page_one["start"], "1")
self.assertEqual(page_one["nso"], "so:r,p:all,a:all")
self.assertEqual(page_two["ssc"], "tab.blog.all")
self.assertEqual(page_two["sm"], "tab_pge")
self.assertEqual(page_two["start"], "16")
self.assertEqual(page_two["nso"], "so:dd,p:all,a:all")
class SearchWorkflowTest(unittest.TestCase):
def test_search_uses_15_result_pages_and_ignores_extra_anchors_beyond_page_window(self):
fetch_starts: list[int] = []
parsed_pages = {
"page-1": [make_result(index) for index in range(1, 16)] + [make_result(101), make_result(102)],
"page-16": [make_result(index) for index in range(16, 31)] + [make_result(101), make_result(102)],
}
def fake_fetch(query: str, start: int = 1, sort: str = "sim", timeout: int = 15, *, insecure: bool = False) -> str:
self.assertEqual(query, "서울 맛집")
self.assertEqual(sort, "sim")
self.assertEqual(timeout, 15)
self.assertFalse(insecure)
fetch_starts.append(start)
return f"page-{start}"
def fake_parse(html: str) -> list[dict]:
return parsed_pages[html]
with (
mock.patch.object(naver_search, "fetch_search_page", side_effect=fake_fetch),
mock.patch.object(naver_search, "parse_search_results", side_effect=fake_parse),
mock.patch.object(naver_search.time, "sleep"),
):
result = naver_search.search("서울 맛집", count=20)
self.assertEqual(fetch_starts, [1, 16])
self.assertEqual(result["total_results"], 20)
self.assertEqual(
[item["url"] for item in result["results"]],
[make_result(index)["url"] for index in range(1, 21)],
)
def test_search_passes_date_sort_through_to_fetcher(self):
captured_sorts: list[str] = []
def fake_fetch(query: str, start: int = 1, sort: str = "sim", timeout: int = 15, *, insecure: bool = False) -> str:
captured_sorts.append(sort)
return "page-1"
with (
mock.patch.object(naver_search, "fetch_search_page", side_effect=fake_fetch),
mock.patch.object(naver_search, "parse_search_results", return_value=[make_result(1)]),
):
result = naver_search.search("서울 맛집", count=1, sort="date")
self.assertEqual(captured_sorts, ["date"])
self.assertEqual(result["results"][0]["url"], make_result(1)["url"])
if __name__ == "__main__":
unittest.main()