Honor explicit public crawl budgets

Keep broad triathlon searches bounded by applying one detail budget across selected year lists and exposing the same budget control in the CLI.

Constraint: PR #222 review requested shared triathlon crawl budget and CLI access to maxDetailsPerSource.

Rejected: Per-year triathlon budget counters | they can exceed the documented per-source crawl cap on multi-year ranges.

Confidence: high

Scope-risk: narrow

Directive: Keep public-source crawl caps source-scoped and documented when adding more list partitions.

Tested: npm test --workspace korean-marathon-schedule; npm run lint --workspace korean-marathon-schedule; live CLI 고령 smoke; CLI help grep; npm run ci; git diff --check; architect verification CLEAR

Not-tested: Live multi-year low-budget triathlon crawl against upstream beyond mocked regression.
This commit is contained in:
Jeffrey (Dongkyu) Kim 2026-05-09 23:23:22 +09:00
commit ec4875bd3a
5 changed files with 100 additions and 15 deletions

View file

@ -75,7 +75,7 @@ console.log(result.items)
CLI:
```bash
node packages/korean-marathon-schedule/src/cli.js 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 10
node packages/korean-marathon-schedule/src/cli.js 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 10 --max-details-per-source 100
```
### 2. Summarize conservatively
@ -96,7 +96,7 @@ If no deadline is present, say `신청 마감일을 공개 페이지에서 확
### 3. Use fallback order
1. GoRunning list → same-host GoRunning detail pages for marathon/road-running schedules; continue through the public list until enough matching results are collected, the list is exhausted, or the explicit per-source detail budget is reached.
2. If the user asks for triathlon or `includeTriathlon` is useful, query the 대한철인3종협회 year list and same-host public detail pages; skip non-competition list entries and continue until enough matching results are collected, the list is exhausted, or the explicit per-source detail budget is reached.
2. If the user asks for triathlon or `includeTriathlon` is useful, query the 대한철인3종협회 year list and same-host public detail pages; skip non-competition list entries and continue until enough matching results are collected, the selected year lists are exhausted, or the explicit per-source detail budget shared across selected years is reached.
3. If either source returns an empty, blocked, changed page, or detail-budget warning, report the source-specific failure/warning and return any successfully parsed results from the other source.
## Done when
@ -109,7 +109,7 @@ If no deadline is present, say `신청 마감일을 공개 페이지에서 확
## Failure modes
- 일정/접수 정보는 수시로 바뀔 수 있다; always state results are based on the current public page read.
- GoRunning or triathlon.or.kr HTML structure may change; then parsing may return empty fields or fail. Off-origin detail links are ignored to keep the lookup bounded to documented public sources. If a public list is larger than the per-source detail budget, results can be partial and a warning is returned.
- GoRunning or triathlon.or.kr HTML structure may change; then parsing may return empty fields or fail. Off-origin detail links are ignored to keep the lookup bounded to documented public sources. If a public list is larger than the per-source detail budget, results can be partial and a warning is returned; triathlon applies that budget once across all selected years.
- Some official event websites may be linked only from the detail page; if absent, return the source detail URL.
- Registration may already be closed even if the event date is upcoming.
- Login, payment, CAPTCHA, or private member-only pages are outside scope and must not be automated.

View file

@ -7,7 +7,7 @@ Public Korean marathon and triathlon schedule lookup client for the `korean-mara
- Marathon/road-running: `https://gorunning.kr/races/` public race list and same-host public race detail pages.
- Triathlon: `https://triathlon.or.kr/events/tour/?sYear=<year>&vType=list` and same-host public federation detail pages; non-competition education/admin entries are skipped.
Both sources are unauthenticated public web surfaces. No proxy or API key is required. Off-origin detail links are ignored, and searches continue through source lists until enough matching results are collected, the source list is exhausted, or the configurable per-source detail budget is reached. The default budget is `max(300, limit * 10)`; when a budget is exhausted before the source list ends, a warning is returned.
Both sources are unauthenticated public web surfaces. No proxy or API key is required. Off-origin detail links are ignored, and searches continue through source lists until enough matching results are collected, the source list is exhausted, or the configurable per-source detail budget is reached. The triathlon budget is shared across all selected year lists. The default budget is `max(300, limit * 10)`; when a budget is exhausted before the source list ends, a warning is returned.
## Usage
@ -29,7 +29,7 @@ console.log(result.items)
CLI:
```bash
npx korean-marathon-schedule 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 5
npx korean-marathon-schedule 서울 --from 2026-05-01 --to 2026-12-31 --include-triathlon --limit 5 --max-details-per-source 100
```
Returned event fields include `title`, `eventDate`, `region`, `venue`, `registrationDeadline`, `registrationPeriod`, `categories`, `organizer`, `officialUrl`, and source `url`.

View file

@ -15,6 +15,7 @@ function parseArgs(argv) {
else if (arg === "--from") options.from = argv[++i]
else if (arg === "--to") options.to = argv[++i]
else if (arg === "--limit") options.limit = Number(argv[++i])
else if (arg === "--max-details-per-source") options.maxDetailsPerSource = Number(argv[++i])
else if (arg === "--include-triathlon") options.includeTriathlon = true
else if (arg === "--help" || arg === "-h") {
printHelp()
@ -27,10 +28,16 @@ function parseArgs(argv) {
}
function printHelp() {
console.log(`Usage: korean-marathon-schedule [query] [options]\n\nOptions:\n -q, --query <text> Filter by title, region, venue, or category\n --from <YYYY-MM-DD> Earliest event date\n --to <YYYY-MM-DD> Latest event date\n --limit <number> Maximum results (default: 10)\n --include-triathlon Include 대한철인3종협회 triathlon events when possible\n`)
console.log(`Usage: korean-marathon-schedule [query] [options]\n\nOptions:\n -q, --query <text> Filter by title, region, venue, or category\n --from <YYYY-MM-DD> Earliest event date\n --to <YYYY-MM-DD> Latest event date\n --limit <number> Maximum results (default: 10)\n --max-details-per-source <number>\n Detail crawl budget for each public source\n --include-triathlon Include 대한철인3종협회 triathlon events when possible\n`)
}
main().catch((error) => {
console.error(error && error.stack ? error.stack : String(error))
process.exitCode = 1
})
function run() {
return main().catch((error) => {
console.error(error && error.stack ? error.stack : String(error))
process.exitCode = 1
})
}
if (require.main === module) run()
module.exports = { parseArgs, printHelp, main }

View file

@ -42,13 +42,17 @@ async function searchEvents(options = {}) {
}
if (includeTriathlon) {
let triathlonDetailCount = 0
let triathlonSourceCount = 0
for (const year of years) {
const listUrl = `${TRIATHLON_TOUR_URL}?sYear=${encodeURIComponent(year)}&vType=list`
try {
const triListHtml = await fetchText(fetcher, listUrl)
const triListItems = parseTriathlonList(triListHtml)
const triBudgetedItems = triListItems.slice(0, detailBudget)
for (const listItem of triBudgetedItems) {
triathlonSourceCount += triListItems.length
for (const listItem of triListItems) {
if (triathlonDetailCount >= detailBudget) break
triathlonDetailCount += 1
try {
const detailHtml = await fetchText(fetcher, listItem.url)
const event = parseTriathlonDetail(detailHtml, listItem.url, listItem)
@ -58,14 +62,14 @@ async function searchEvents(options = {}) {
}
if (items.length >= normalizedLimit) break
}
if (items.length < normalizedLimit && triListItems.length > triBudgetedItems.length) {
warnings.push(`triathlon detail budget exhausted after ${triBudgetedItems.length} of ${triListItems.length} source links for ${year}`)
}
} catch (error) {
warnings.push(`triathlon source failed for ${listUrl}: ${error.message}`)
}
if (items.length >= normalizedLimit) break
}
if (items.length < normalizedLimit && triathlonSourceCount > triathlonDetailCount && triathlonDetailCount >= detailBudget) {
warnings.push(`triathlon detail budget exhausted after ${triathlonDetailCount} of ${triathlonSourceCount} source links`)
}
}
items.sort((a, b) => String(a.eventDate || "").localeCompare(String(b.eventDate || "")))

View file

@ -1,5 +1,6 @@
const test = require("node:test")
const assert = require("node:assert/strict")
const { spawnSync } = require("node:child_process")
const {
parseGorunningList,
@ -203,6 +204,79 @@ test("searchEvents warns when a configurable detail budget is exhausted before s
assert.match(result.warnings.join("\n"), /gorunning detail budget exhausted after 3 of 6 source links/)
})
test("searchEvents applies one triathlon detail budget across selected years", async () => {
const seenDetails = []
const fetcher = async (url) => {
const textUrl = String(url)
if (textUrl === "https://gorunning.kr/races/") return htmlResponse("")
if (textUrl === "https://triathlon.or.kr/events/tour/?sYear=2026&vType=list") {
return htmlResponse(`<!doctype html><html><body><table>
<tr><td><a href="/events/tour/overview/?mode=overview&tourcd=6101">2026 서울 철인3종 대회</a> : : </td></tr>
</table></body></html>`)
}
if (textUrl === "https://triathlon.or.kr/events/tour/?sYear=2027&vType=list") {
return htmlResponse(`<!doctype html><html><body><table>
<tr><td><a href="/events/tour/overview/?mode=overview&tourcd=7101">2027 제주 철인3종 대회</a> : : </td></tr>
</table></body></html>`)
}
if (textUrl.includes("/events/tour/overview/")) {
seenDetails.push(textUrl)
const tourcd = new URL(textUrl).searchParams.get("tourcd")
return htmlResponse(`<!doctype html><html><body>
<h2>${tourcd} 철인3종 대회</h2>
<table>
<tr><th>대회명</th><td>${tourcd} 3 </td></tr>
<tr><th>대회기간</th><td>${tourcd === "6101" ? "2026" : "2027"}-07-01</td></tr>
<tr><th>대회장소</th><td> </td></tr>
</table>
</body></html>`)
}
return new Response("not found", { status: 404 })
}
const result = await searchEvents({
query: "부산",
from: "2026-01-01",
to: "2027-12-31",
includeTriathlon: true,
limit: 5,
maxDetailsPerSource: 1,
fetcher
})
assert.equal(result.items.length, 0)
assert.equal(seenDetails.length, 1)
assert.deepEqual(seenDetails, [
"https://triathlon.or.kr/events/tour/overview/?mode=overview&tourcd=6101"
])
assert.match(result.warnings.join("\n"), /triathlon detail budget exhausted after 1 of 2 source links/)
})
test("CLI help documents max-details-per-source budget option", () => {
const result = spawnSync(process.execPath, ["src/cli.js", "--help"], {
cwd: __dirname + "/..",
encoding: "utf8"
})
assert.equal(result.status, 0)
assert.match(result.stdout, /--max-details-per-source <number>/)
})
test("CLI maps max-details-per-source argument to search options", () => {
const { parseArgs } = require("../src/cli")
const options = parseArgs([
"고령",
"--from",
"2026-01-01",
"--include-triathlon",
"--max-details-per-source",
"7"
])
assert.equal(options.maxDetailsPerSource, 7)
})
test("parseTriathlonList keeps race rows isolated from neighboring education rows", () => {
const html = `<!doctype html><html><body><table><tbody>
<tr><td><a href="/events/tour/overview/?mode=overview&tourcd=3001">2026 철인3종 2 대회규정 정기 교육</a> : </td></tr>