Clarify crawler skill discovery guidance

Constraint: Crawling k-skills need site-dependent recipes, but should derive them through a reusable discovery pass. Rejected: Leaving guidance only in docs/adding-a-skill.md | AGENTS.md and CLAUDE.md also guide future agents. Confidence: high Scope-risk: narrow Directive: Use site-agnostic discovery to find, then explicitly package, the target site's stable access path. Tested: npm run ci Not-tested: Live crawler behavior; documentation-only change.
Guide crawler skills toward reusable discovery
2026-06-24 02:04:11 +00:00 · 2026-05-01 22:00:57 +09:00 · 2026-05-01 10:35:09 +09:00
3 changed files with 47 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -28,6 +28,13 @@ These rules are repo-specific and apply to everything under this directory.
 - Respect existing home-directory indirection such as symlinks when syncing `~/.agents/skills`.
 - Do **not** create repo-local `.claude` or `.agents` directories for skill installation unless the user explicitly asks for a repository-local test fixture.

+## Crawling/search skill authoring
+
+- For any k-skill that crawls or searches a website, the expected output is a site-dependent recipe packaged into that skill.
+- Before fixing that recipe, use an insane-search-style, site-agnostic discovery pass: identify public entry points, observe browser-visible data flows when needed, prefer stable public/data endpoints over brittle screen scraping, and classify login/CAPTCHA/empty/blocked responses as explicit failure modes.
+- Record the discovered site-dependent access path, fallback order, inputs/outputs, and failure modes in `SKILL.md` and any helper package code. See `docs/adding-a-skill.md` for the canonical checklist.
+- Do not add crawling dependencies by default; first prefer existing runtime capabilities, public endpoints, or narrow allowlisted proxy routes.
+
 ## Free API proxy policy

 - The built-in `k-skill-proxy` is for **free APIs only**.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -4,6 +4,13 @@

 - **Never write tests that assert `.changeset/*.md` files exist.** Changesets are consumed (deleted) by `changeset version` during the release flow. Any test guarding changeset file presence will break CI on the version-bump commit and block the release pipeline.

+## Crawling/search skill authoring
+
+- 크롤링/검색 k-skill의 목표는 최종적으로 대상 사이트에 맞는 site-dependent 접근 방법을 스킬에 패키징하는 것이다.
+- 다만 방법을 고정하기 전에 `insane-search`식 site-agnostic discovery를 먼저 수행한다: 공개 입구, 브라우저에서 보이는 데이터 흐름, RSS/sitemap/정적 JSON/모바일 페이지, 차단·빈 응답·로그인벽 실패 모드를 확인한다.
+- 발견한 검색 URL, 필수 입력값, 결과 해석 규칙, fallback 순서, 실패 모드는 `SKILL.md`와 helper 코드에 명확히 남긴다. 자세한 체크리스트는 `docs/adding-a-skill.md`를 따른다.
+- 새 크롤링 dependency는 기본값으로 추가하지 말고 기존 기능, 공개 endpoint, 좁은 proxy route로 해결 가능한지 먼저 확인한다.
+
 ## Proxy server development

 - 개발 repo: `/Users/jeffrey/Projects/k-skill` (이 디렉토리, `dev` 브랜치)
--- a/docs/adding-a-skill.md
+++ b/docs/adding-a-skill.md
@ -156,6 +156,38 @@ upstream API 키를 사용자에게 노출하지 않으려면 `k-skill-proxy`를

 ---

+## 크롤링/검색 스킬을 만들 때: site-agnostic discovery 먼저
+
+웹사이트를 조회하거나 크롤링하는 스킬의 최종 산출물은 결국 **그 사이트에 맞는 site-dependent 접근 방법**이다. 다만 처음부터 특정 화면 구조나 임시 우회법을 감으로 고정하지 않는다. 먼저 `insane-search`식 접근처럼 **사이트에 상관없이 반복 가능한 탐색 절차**를 적용해 대상 사이트에서 실제로 안정적인 경로를 찾아낸 뒤, 그 발견 결과를 해당 스킬의 site-dependent 지식으로 패키징한다.
+
+적용 대상:
+
+- 검색 결과/상세 페이지를 읽어야 하는 스킬
+- 공식 API 문서가 없거나 불완전한 사이트
+- PC 페이지, 모바일 페이지, RSS, sitemap, 정적 JSON, 공개 데이터 호출 등 여러 입구가 있을 수 있는 사이트
+- 브라우저에서는 보이지만 단순 HTTP 요청에서는 빈 화면/차단/로그인 유도만 보이는 사이트
+
+권장 절차:
+
+1. **공개 입구부터 찾기**: 공식 API, 공개 JSON, RSS/Atom, sitemap, 검색 폼, 모바일 페이지, 정적 파일처럼 사이트가 공개적으로 제공하는 경로를 먼저 확인한다.
+2. **브라우저 동작을 관찰하기**: 화면을 직접 긁기 전에 검색/상세 화면이 어떤 공개 데이터 요청을 통해 채워지는지 확인한다.
+3. **안정적인 경로를 우선하기**: 화면 선택자보다 공개 데이터 호출, 문서화된 endpoint, RSS/sitemap처럼 구조가 덜 흔들리는 경로를 선호한다.
+4. **차단과 빈 응답을 실패로 분리하기**: HTTP 성공만으로 완료로 보지 말고, 실제 결과 본문이 있는지 확인한다. 로그인벽, 봇 검사, 빈 껍데기 페이지는 별도 실패 모드로 적는다.
+5. **site-dependent 방법을 명시적으로 패키징하기**: 탐색 과정에서 확인한 검색 URL, 필수 파라미터, 결과 해석 규칙, fallback 순서를 `SKILL.md`와 패키지 코드에 좁고 명확하게 기록한다.
+6. **권한 경계를 지키기**: 인증, 결제, CAPTCHA, 약관상 제한이 필요한 경로는 자동화하지 말고 사용자 개입 또는 실패 모드로 처리한다.
+
+`SKILL.md`에는 최소한 아래 내용을 남긴다.
+
+- 어떤 공개 접근 경로를 선택했는지와 그 이유
+- 검색/상세 조회의 입력값과 출력값
+- 기본 경로가 실패했을 때의 fallback 순서
+- 빈 결과, 차단, 로그인 필요, upstream 변경 등 실패 모드
+- 시크릿/인증이 필요한지 여부와 저장소에 절대 넣지 않을 값
+
+새 dependency는 기본값으로 추가하지 않는다. 기존 Node.js/Python 표준 기능, 이미 있는 패키지, 또는 `k-skill-proxy`의 좁은 allowlist route로 해결할 수 있는지 먼저 확인한다.
+
+---
+
 ## 스킬 등록 & 검증

 스킬은 **별도 레지스트리 없이 디렉토리 스캔으로 자동 발견**된다.
@ -204,6 +236,7 @@ npm run ci
 - [ ] npm 패키지라면 `packages/`에 구현체와 테스트 추가
 - [ ] npm 패키지라면 `.changeset/*.md` 파일 추가 (반드시 **기능 PR에서**, Version Packages PR에서 추가하지 말 것)
 - [ ] 프록시 경유라면 `k-skill-proxy/src/server.js`에 route 추가하고 main에 merge
+- [ ] 크롤링/검색 스킬이라면 공개 접근 경로, fallback 순서, 차단/로그인/빈 결과 실패 모드 문서화
 - [ ] 시크릿이 있다면 `KSKILL_` 접두사 규칙 준수 및 `docs/setup.md` 업데이트
 - [ ] `docs/features/my-new-skill.md` 작성 (선택, 상세 가이드)