Phase 5: Scheduled Website Re-Scraping
Status: ✅ IMPLEMENTED
Prerequisites: Phase 2 complete (client_content_asset must be in use before scheduling its refresh makes sense)
Objective: Keep client_content_asset scraped pages fresh automatically so email generation always has up-to-date product/service content to reference, without requiring manual "Re-crawl" clicks.
Background
The current website crawl is user-triggered: clicking "Re-crawl website" in Settings calls crawlAndUpdateClient(), which scrapes the customer's site and stores the result in client_content_asset. Once Phase 8 is in place, this data directly feeds email generation — but only if it stays current.
This phase adds:
- A configurable re-scrape schedule per customer (off / weekly / monthly)
- A daily cron job that picks up overdue customers and re-crawls them
- A small schema addition to
customer_customerto track the schedule - A settings UI control to configure the interval
11.1 — Database Migration
File to create: front-end/supabase/migrations/<timestamp>_website_scrape_schedule.sql
-- Per-customer website re-scrape schedule
ALTER TABLE customer_customer
ADD COLUMN IF NOT EXISTS website_scrape_interval TEXT NOT NULL DEFAULT 'weekly'
CHECK (website_scrape_interval IN ('off', 'weekly', 'monthly')),
ADD COLUMN IF NOT EXISTS next_scrape_at TIMESTAMPTZ;
-- Index for the cron job query (customers due for a scrape)
CREATE INDEX IF NOT EXISTS idx_customer_scrape_schedule
ON customer_customer (next_scrape_at, website_scrape_interval)
WHERE website_scrape_interval != 'off';
COMMENT ON COLUMN customer_customer.website_scrape_interval IS
'How often to automatically re-scrape the customer website: off | weekly | monthly';
COMMENT ON COLUMN customer_customer.next_scrape_at IS
'UTC timestamp of the next scheduled website scrape. NULL = not yet scheduled.';
Default behaviour: all existing customers default to
website_scrape_interval = 'weekly'butnext_scrape_at = NULL. The cron job only processes customers wherenext_scrape_at IS NOT NULL AND next_scrape_at <= NOW(). To activate the schedule for a customer, thenext_scrape_atmust be set — this happens when the user first saves a non-'off'interval via the Settings UI (11.4).
Tasks
- [ ] Generate migration file with correct timestamp prefix
- [ ] Apply:
cd front-end && pnpm supabase db push - [ ] Verify new columns exist in Supabase Studio
11.2 — Cron Job: website-rescrape
Target file: api/src/bot/cron.ts
Add a new cron job that runs daily at 3 AM (UTC by default, respects CRON_TIMEZONE). The job is only registered when BOT_ENABLED=true, consistent with the other cron jobs in this file.
Job specification
{
name: 'website-rescrape',
cron: '0 3 * * *', // 3 AM daily
timezone: env.CRON_TIMEZONE ?? 'UTC',
}
Job logic
async function runWebsiteRescrape() {
const supabase = createSupabaseClient()
const now = new Date().toISOString()
// Find customers due for a scrape
const { data: customers, error } = await supabase
.from('customer_customer')
.select('id, url, website_scrape_interval')
.neq('website_scrape_interval', 'off')
.not('next_scrape_at', 'is', null)
.lte('next_scrape_at', now)
.limit(20) // cap per run to avoid overwhelming crawl infrastructure
if (error) {
logger.error({ error }, 'website-rescrape: failed to query customers')
return
}
for (const customer of customers ?? []) {
try {
logger.info(
{ customerId: customer.id, url: customer.url },
'website-rescrape: starting crawl'
)
await crawlAndUpdateClient(customer.id)
const nextScrapeAt = computeNextScrapeAt(
customer.website_scrape_interval as 'weekly' | 'monthly'
)
await supabase
.from('customer_customer')
.update({ next_scrape_at: nextScrapeAt })
.eq('id', customer.id)
logger.info({ customerId: customer.id, nextScrapeAt }, 'website-rescrape: complete')
} catch (err) {
logger.error({ customerId: customer.id, err }, 'website-rescrape: crawl failed, skipping')
// Still advance next_scrape_at to avoid hammering a failing site every day
const nextScrapeAt = computeNextScrapeAt(
customer.website_scrape_interval as 'weekly' | 'monthly'
)
await supabase
.from('customer_customer')
.update({ next_scrape_at: nextScrapeAt })
.eq('id', customer.id)
}
}
}
function computeNextScrapeAt(interval: 'weekly' | 'monthly'): string {
const now = new Date()
if (interval === 'weekly') {
now.setUTCDate(now.getUTCDate() + 7)
} else {
now.setUTCMonth(now.getUTCMonth() + 1)
}
return now.toISOString()
}
Import requirement
import { crawlAndUpdateClient } from '../services/crawl.service.js'
Verify this import path matches the actual location of crawlAndUpdateClient (check api/src/services/crawl.service.ts).
Tasks
- [ ] Add
runWebsiteRescrape()function tocron.ts - [ ] Add
computeNextScrapeAt()helper - [ ] Register the cron job in the existing job array / startup sequence
- [ ] Import
crawlAndUpdateClient(verify path) - [ ] Add log lines for start, success, and failure per customer
11.3 — Settings Endpoint: Accept Scrape Interval
Target file: api/src/routes/clients.ts
The customer update endpoint (likely PUT /client/:id or PATCH /client/:id) should accept website_scrape_interval in the request body.
When website_scrape_interval is provided and is not 'off', compute an initial next_scrape_at and persist both values. When set to 'off', clear next_scrape_at.
Logic
if (body.website_scrape_interval !== undefined) {
updatePayload.website_scrape_interval = body.website_scrape_interval
if (body.website_scrape_interval === 'off') {
updatePayload.next_scrape_at = null
} else {
// Only set next_scrape_at if it's not already scheduled (don't reset a pending job)
const { data: existing } = await supabase
.from('customer_customer')
.select('next_scrape_at')
.eq('id', customerId)
.single()
if (!existing?.next_scrape_at) {
updatePayload.next_scrape_at = computeNextScrapeAt(body.website_scrape_interval)
}
}
}
computeNextScrapeAtcan be extracted to a shared utility used by both the cron job and this route.
Schema update
Add website_scrape_interval to the TypeBox schema for the client update endpoint:
website_scrape_interval: Type.Optional(
Type.Union([
Type.Literal('off'),
Type.Literal('weekly'),
Type.Literal('monthly'),
])
),
Tasks
- [ ] Locate the client update endpoint in
clients.ts - [ ] Add
website_scrape_intervalto the request schema - [ ] Add the interval +
next_scrape_atupdate logic - [ ] Extract
computeNextScrapeAtto a shared location if needed (e.g.api/src/utils/schedule.ts)
11.4 — Frontend: Scrape Schedule Control
Target file: front-end/src/pages/app/client/[id]/settings.vue
Add a "Website Refresh Schedule" control below the existing "Re-crawl website" button in the Settings page.
UI
── Website Content ──────────────────────────────────
[Re-crawl website button] Last crawled: {date}
Automatic refresh schedule
[Off ▾] ← dropdown: Off / Weekly / Monthly
Next refresh: 12 May 2026 ← shown when not 'off' and next_scrape_at is set
← hidden when 'off' or next_scrape_at is null
Behaviour
- On dropdown change: call the client update API with
{ website_scrape_interval: selectedValue } - Show a brief success toast on save
- When the user selects "Off": hide the "Next refresh" line immediately
- When the user selects "Weekly" or "Monthly": optimistically show "Next refresh" ~7 days or ~30 days out, then refetch to get the server-computed value
Data binding
Load website_scrape_interval and next_scrape_at from the customer object returned by GET /client/:id/context (or the existing settings load). Ensure the client store exposes both fields.
Tasks
- [ ] Add
website_scrape_intervalandnext_scrape_atto the customer type/interface used in settings - [ ] Add the dropdown control to
settings.vuebelow the crawl button - [ ] Wire save to the client update API call
- [ ] Display "Next refresh: {formatted date}" when schedule is active
- [ ] Visual audit: take a screenshot of the Settings page showing the new control
Testing Checklist
Setup
- Enable
BOT_ENABLED=truelocally and start the API:cd api && pnpm dev
Cron job
- [ ] Set a test customer's
next_scrape_atto a past timestamp (e.g. yesterday) andwebsite_scrape_interval = 'weekly' - [ ] Wait for the 3 AM cron or manually call
runWebsiteRescrape()in a test script - [ ] Verify
client_content_assetrows for the customer are updated (checkcreated_at/updated_at) - [ ] Verify
next_scrape_atwas advanced by 7 days after successful crawl - [ ] Verify a customer with
website_scrape_interval = 'off'is NOT processed
Settings UI
- [ ] Open Settings for a test client — confirm the schedule dropdown shows the current value
- [ ] Change to "Weekly" — confirm
next_scrape_atappears within a minute of saving - [ ] Change to "Off" — confirm
next_scrape_atclears and "Next refresh" line disappears - [ ] Visual audit screenshot of Settings page
Edge cases
- [ ] Customer with no
urlset: cron job should skip gracefully (not crash) - [ ] Crawl failure (unreachable URL):
next_scrape_atstill advances so the job doesn't retry every day - [ ] 20-customer cap: if more than 20 are overdue, the first 20 are processed; the rest are picked up on the next run
Rollback
- Remove the
runWebsiteRescrapejob registration fromcron.ts - Revert the client update endpoint (remove
website_scrape_intervalhandling) - Remove the dropdown from
settings.vue - Migration down:
ALTER TABLE customer_customer DROP COLUMN IF EXISTS website_scrape_interval, DROP COLUMN IF EXISTS next_scrape_at; DROP INDEX IF EXISTS idx_customer_scrape_schedule;