mirror of
https://github.com/NomaDamas/k-skill.git
synced 2026-06-24 02:04:11 +00:00
Honor explicit public crawl budgets
Keep broad triathlon searches bounded by applying one detail budget across selected year lists and exposing the same budget control in the CLI. Constraint: PR #222 review requested shared triathlon crawl budget and CLI access to maxDetailsPerSource. Rejected: Per-year triathlon budget counters | they can exceed the documented per-source crawl cap on multi-year ranges. Confidence: high Scope-risk: narrow Directive: Keep public-source crawl caps source-scoped and documented when adding more list partitions. Tested: npm test --workspace korean-marathon-schedule; npm run lint --workspace korean-marathon-schedule; live CLI 고령 smoke; CLI help grep; npm run ci; git diff --check; architect verification CLEAR Not-tested: Live multi-year low-budget triathlon crawl against upstream beyond mocked regression.
This commit is contained in:
parent
c28e0a0839
commit
ec4875bd3a
5 changed files with 100 additions and 15 deletions
|
|
@ -75,7 +75,7 @@ console.log(result.items)
|
|||
CLI:
|
||||
|
||||
```bash
|
||||
node packages/korean-marathon-schedule/src/cli.js 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 10
|
||||
node packages/korean-marathon-schedule/src/cli.js 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 10 --max-details-per-source 100
|
||||
```
|
||||
|
||||
### 2. Summarize conservatively
|
||||
|
|
@ -96,7 +96,7 @@ If no deadline is present, say `신청 마감일을 공개 페이지에서 확
|
|||
### 3. Use fallback order
|
||||
|
||||
1. GoRunning list → same-host GoRunning detail pages for marathon/road-running schedules; continue through the public list until enough matching results are collected, the list is exhausted, or the explicit per-source detail budget is reached.
|
||||
2. If the user asks for triathlon or `includeTriathlon` is useful, query the 대한철인3종협회 year list and same-host public detail pages; skip non-competition list entries and continue until enough matching results are collected, the list is exhausted, or the explicit per-source detail budget is reached.
|
||||
2. If the user asks for triathlon or `includeTriathlon` is useful, query the 대한철인3종협회 year list and same-host public detail pages; skip non-competition list entries and continue until enough matching results are collected, the selected year lists are exhausted, or the explicit per-source detail budget shared across selected years is reached.
|
||||
3. If either source returns an empty, blocked, changed page, or detail-budget warning, report the source-specific failure/warning and return any successfully parsed results from the other source.
|
||||
|
||||
## Done when
|
||||
|
|
@ -109,7 +109,7 @@ If no deadline is present, say `신청 마감일을 공개 페이지에서 확
|
|||
## Failure modes
|
||||
|
||||
- 일정/접수 정보는 수시로 바뀔 수 있다; always state results are based on the current public page read.
|
||||
- GoRunning or triathlon.or.kr HTML structure may change; then parsing may return empty fields or fail. Off-origin detail links are ignored to keep the lookup bounded to documented public sources. If a public list is larger than the per-source detail budget, results can be partial and a warning is returned.
|
||||
- GoRunning or triathlon.or.kr HTML structure may change; then parsing may return empty fields or fail. Off-origin detail links are ignored to keep the lookup bounded to documented public sources. If a public list is larger than the per-source detail budget, results can be partial and a warning is returned; triathlon applies that budget once across all selected years.
|
||||
- Some official event websites may be linked only from the detail page; if absent, return the source detail URL.
|
||||
- Registration may already be closed even if the event date is upcoming.
|
||||
- Login, payment, CAPTCHA, or private member-only pages are outside scope and must not be automated.
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ Public Korean marathon and triathlon schedule lookup client for the `korean-mara
|
|||
- Marathon/road-running: `https://gorunning.kr/races/` public race list and same-host public race detail pages.
|
||||
- Triathlon: `https://triathlon.or.kr/events/tour/?sYear=<year>&vType=list` and same-host public federation detail pages; non-competition education/admin entries are skipped.
|
||||
|
||||
Both sources are unauthenticated public web surfaces. No proxy or API key is required. Off-origin detail links are ignored, and searches continue through source lists until enough matching results are collected, the source list is exhausted, or the configurable per-source detail budget is reached. The default budget is `max(300, limit * 10)`; when a budget is exhausted before the source list ends, a warning is returned.
|
||||
Both sources are unauthenticated public web surfaces. No proxy or API key is required. Off-origin detail links are ignored, and searches continue through source lists until enough matching results are collected, the source list is exhausted, or the configurable per-source detail budget is reached. The triathlon budget is shared across all selected year lists. The default budget is `max(300, limit * 10)`; when a budget is exhausted before the source list ends, a warning is returned.
|
||||
|
||||
## Usage
|
||||
|
||||
|
|
@ -29,7 +29,7 @@ console.log(result.items)
|
|||
CLI:
|
||||
|
||||
```bash
|
||||
npx korean-marathon-schedule 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 5
|
||||
npx korean-marathon-schedule 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 5 --max-details-per-source 100
|
||||
```
|
||||
|
||||
Returned event fields include `title`, `eventDate`, `region`, `venue`, `registrationDeadline`, `registrationPeriod`, `categories`, `organizer`, `officialUrl`, and source `url`.
|
||||
|
|
|
|||
|
|
@ -15,6 +15,7 @@ function parseArgs(argv) {
|
|||
else if (arg === "--from") options.from = argv[++i]
|
||||
else if (arg === "--to") options.to = argv[++i]
|
||||
else if (arg === "--limit") options.limit = Number(argv[++i])
|
||||
else if (arg === "--max-details-per-source") options.maxDetailsPerSource = Number(argv[++i])
|
||||
else if (arg === "--include-triathlon") options.includeTriathlon = true
|
||||
else if (arg === "--help" || arg === "-h") {
|
||||
printHelp()
|
||||
|
|
@ -27,10 +28,16 @@ function parseArgs(argv) {
|
|||
}
|
||||
|
||||
function printHelp() {
|
||||
console.log(`Usage: korean-marathon-schedule [query] [options]\n\nOptions:\n -q, --query <text> Filter by title, region, venue, or category\n --from <YYYY-MM-DD> Earliest event date\n --to <YYYY-MM-DD> Latest event date\n --limit <number> Maximum results (default: 10)\n --include-triathlon Include 대한철인3종협회 triathlon events when possible\n`)
|
||||
console.log(`Usage: korean-marathon-schedule [query] [options]\n\nOptions:\n -q, --query <text> Filter by title, region, venue, or category\n --from <YYYY-MM-DD> Earliest event date\n --to <YYYY-MM-DD> Latest event date\n --limit <number> Maximum results (default: 10)\n --max-details-per-source <number>\n Detail crawl budget for each public source\n --include-triathlon Include 대한철인3종협회 triathlon events when possible\n`)
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error(error && error.stack ? error.stack : String(error))
|
||||
process.exitCode = 1
|
||||
})
|
||||
function run() {
|
||||
return main().catch((error) => {
|
||||
console.error(error && error.stack ? error.stack : String(error))
|
||||
process.exitCode = 1
|
||||
})
|
||||
}
|
||||
|
||||
if (require.main === module) run()
|
||||
|
||||
module.exports = { parseArgs, printHelp, main }
|
||||
|
|
|
|||
|
|
@ -42,13 +42,17 @@ async function searchEvents(options = {}) {
|
|||
}
|
||||
|
||||
if (includeTriathlon) {
|
||||
let triathlonDetailCount = 0
|
||||
let triathlonSourceCount = 0
|
||||
for (const year of years) {
|
||||
const listUrl = `${TRIATHLON_TOUR_URL}?sYear=${encodeURIComponent(year)}&vType=list`
|
||||
try {
|
||||
const triListHtml = await fetchText(fetcher, listUrl)
|
||||
const triListItems = parseTriathlonList(triListHtml)
|
||||
const triBudgetedItems = triListItems.slice(0, detailBudget)
|
||||
for (const listItem of triBudgetedItems) {
|
||||
triathlonSourceCount += triListItems.length
|
||||
for (const listItem of triListItems) {
|
||||
if (triathlonDetailCount >= detailBudget) break
|
||||
triathlonDetailCount += 1
|
||||
try {
|
||||
const detailHtml = await fetchText(fetcher, listItem.url)
|
||||
const event = parseTriathlonDetail(detailHtml, listItem.url, listItem)
|
||||
|
|
@ -58,14 +62,14 @@ async function searchEvents(options = {}) {
|
|||
}
|
||||
if (items.length >= normalizedLimit) break
|
||||
}
|
||||
if (items.length < normalizedLimit && triListItems.length > triBudgetedItems.length) {
|
||||
warnings.push(`triathlon detail budget exhausted after ${triBudgetedItems.length} of ${triListItems.length} source links for ${year}`)
|
||||
}
|
||||
} catch (error) {
|
||||
warnings.push(`triathlon source failed for ${listUrl}: ${error.message}`)
|
||||
}
|
||||
if (items.length >= normalizedLimit) break
|
||||
}
|
||||
if (items.length < normalizedLimit && triathlonSourceCount > triathlonDetailCount && triathlonDetailCount >= detailBudget) {
|
||||
warnings.push(`triathlon detail budget exhausted after ${triathlonDetailCount} of ${triathlonSourceCount} source links`)
|
||||
}
|
||||
}
|
||||
|
||||
items.sort((a, b) => String(a.eventDate || "").localeCompare(String(b.eventDate || "")))
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
const test = require("node:test")
|
||||
const assert = require("node:assert/strict")
|
||||
const { spawnSync } = require("node:child_process")
|
||||
|
||||
const {
|
||||
parseGorunningList,
|
||||
|
|
@ -203,6 +204,79 @@ test("searchEvents warns when a configurable detail budget is exhausted before s
|
|||
assert.match(result.warnings.join("\n"), /gorunning detail budget exhausted after 3 of 6 source links/)
|
||||
})
|
||||
|
||||
test("searchEvents applies one triathlon detail budget across selected years", async () => {
|
||||
const seenDetails = []
|
||||
const fetcher = async (url) => {
|
||||
const textUrl = String(url)
|
||||
if (textUrl === "https://gorunning.kr/races/") return htmlResponse("")
|
||||
if (textUrl === "https://triathlon.or.kr/events/tour/?sYear=2026&vType=list") {
|
||||
return htmlResponse(`<!doctype html><html><body><table>
|
||||
<tr><td><a href="/events/tour/overview/?mode=overview&tourcd=6101">2026 서울 철인3종 대회</a> 장소: 서울 코스: 스탠다드</td></tr>
|
||||
</table></body></html>`)
|
||||
}
|
||||
if (textUrl === "https://triathlon.or.kr/events/tour/?sYear=2027&vType=list") {
|
||||
return htmlResponse(`<!doctype html><html><body><table>
|
||||
<tr><td><a href="/events/tour/overview/?mode=overview&tourcd=7101">2027 제주 철인3종 대회</a> 장소: 제주 코스: 스탠다드</td></tr>
|
||||
</table></body></html>`)
|
||||
}
|
||||
if (textUrl.includes("/events/tour/overview/")) {
|
||||
seenDetails.push(textUrl)
|
||||
const tourcd = new URL(textUrl).searchParams.get("tourcd")
|
||||
return htmlResponse(`<!doctype html><html><body>
|
||||
<h2>${tourcd} 철인3종 대회</h2>
|
||||
<table>
|
||||
<tr><th>대회명</th><td>${tourcd} 철인3종 대회</td></tr>
|
||||
<tr><th>대회기간</th><td>${tourcd === "6101" ? "2026" : "2027"}-07-01</td></tr>
|
||||
<tr><th>대회장소</th><td>서울 한강</td></tr>
|
||||
</table>
|
||||
</body></html>`)
|
||||
}
|
||||
return new Response("not found", { status: 404 })
|
||||
}
|
||||
|
||||
const result = await searchEvents({
|
||||
query: "부산",
|
||||
from: "2026-01-01",
|
||||
to: "2027-12-31",
|
||||
includeTriathlon: true,
|
||||
limit: 5,
|
||||
maxDetailsPerSource: 1,
|
||||
fetcher
|
||||
})
|
||||
|
||||
assert.equal(result.items.length, 0)
|
||||
assert.equal(seenDetails.length, 1)
|
||||
assert.deepEqual(seenDetails, [
|
||||
"https://triathlon.or.kr/events/tour/overview/?mode=overview&tourcd=6101"
|
||||
])
|
||||
assert.match(result.warnings.join("\n"), /triathlon detail budget exhausted after 1 of 2 source links/)
|
||||
})
|
||||
|
||||
test("CLI help documents max-details-per-source budget option", () => {
|
||||
const result = spawnSync(process.execPath, ["src/cli.js", "--help"], {
|
||||
cwd: __dirname + "/..",
|
||||
encoding: "utf8"
|
||||
})
|
||||
|
||||
assert.equal(result.status, 0)
|
||||
assert.match(result.stdout, /--max-details-per-source <number>/)
|
||||
})
|
||||
|
||||
test("CLI maps max-details-per-source argument to search options", () => {
|
||||
const { parseArgs } = require("../src/cli")
|
||||
|
||||
const options = parseArgs([
|
||||
"고령",
|
||||
"--from",
|
||||
"2026-01-01",
|
||||
"--include-triathlon",
|
||||
"--max-details-per-source",
|
||||
"7"
|
||||
])
|
||||
|
||||
assert.equal(options.maxDetailsPerSource, 7)
|
||||
})
|
||||
|
||||
test("parseTriathlonList keeps race rows isolated from neighboring education rows", () => {
|
||||
const html = `<!doctype html><html><body><table><tbody>
|
||||
<tr><td><a href="/events/tour/overview/?mode=overview&tourcd=3001">2026 철인3종 2차 대회규정 정기 교육</a> 장소: 서울 교육장</td></tr>
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue