Phase 5: Scheduled Website Re-Scraping

Status: ✅ IMPLEMENTED
Prerequisites: Phase 2 complete (client_content_asset must be in use before scheduling its refresh makes sense)
Objective: Keep client_content_asset scraped pages fresh automatically so email generation always has up-to-date product/service content to reference, without requiring manual "Re-crawl" clicks.

Background

The current website crawl is user-triggered: clicking "Re-crawl website" in Settings calls crawlAndUpdateClient(), which scrapes the customer's site and stores the result in client_content_asset. Once Phase 8 is in place, this data directly feeds email generation — but only if it stays current.

This phase adds:

A configurable re-scrape schedule per customer (off / weekly / monthly)
A daily cron job that picks up overdue customers and re-crawls them
A small schema addition to customer_customer to track the schedule
A settings UI control to configure the interval

11.1 — Database Migration

File to create: front-end/supabase/migrations/<timestamp>_website_scrape_schedule.sql

-- Per-customer website re-scrape schedule
ALTER TABLE customer_customer
  ADD COLUMN IF NOT EXISTS website_scrape_interval TEXT NOT NULL DEFAULT 'weekly'
    CHECK (website_scrape_interval IN ('off', 'weekly', 'monthly')),
  ADD COLUMN IF NOT EXISTS next_scrape_at TIMESTAMPTZ;

-- Index for the cron job query (customers due for a scrape)
CREATE INDEX IF NOT EXISTS idx_customer_scrape_schedule
  ON customer_customer (next_scrape_at, website_scrape_interval)
  WHERE website_scrape_interval != 'off';

COMMENT ON COLUMN customer_customer.website_scrape_interval IS
  'How often to automatically re-scrape the customer website: off | weekly | monthly';
COMMENT ON COLUMN customer_customer.next_scrape_at IS
  'UTC timestamp of the next scheduled website scrape. NULL = not yet scheduled.';

Default behaviour: all existing customers default to website_scrape_interval = 'weekly' but next_scrape_at = NULL. The cron job only processes customers where next_scrape_at IS NOT NULL AND next_scrape_at <= NOW(). To activate the schedule for a customer, the next_scrape_at must be set — this happens when the user first saves a non-'off' interval via the Settings UI (11.4).

Tasks

[ ] Generate migration file with correct timestamp prefix
[ ] Apply: cd front-end && pnpm supabase db push
[ ] Verify new columns exist in Supabase Studio

11.2 — Cron Job: `website-rescrape`

Target file: api/src/bot/cron.ts

Add a new cron job that runs daily at 3 AM (UTC by default, respects CRON_TIMEZONE). The job is only registered when BOT_ENABLED=true, consistent with the other cron jobs in this file.

Job specification

{
  name: 'website-rescrape',
  cron: '0 3 * * *',        // 3 AM daily
  timezone: env.CRON_TIMEZONE ?? 'UTC',
}

Job logic

async function runWebsiteRescrape() {
  const supabase = createSupabaseClient()
  const now = new Date().toISOString()

  // Find customers due for a scrape
  const { data: customers, error } = await supabase
    .from('customer_customer')
    .select('id, url, website_scrape_interval')
    .neq('website_scrape_interval', 'off')
    .not('next_scrape_at', 'is', null)
    .lte('next_scrape_at', now)
    .limit(20) // cap per run to avoid overwhelming crawl infrastructure

  if (error) {
    logger.error({ error }, 'website-rescrape: failed to query customers')
    return
  }

  for (const customer of customers ?? []) {
    try {
      logger.info(
        { customerId: customer.id, url: customer.url },
        'website-rescrape: starting crawl'
      )

      await crawlAndUpdateClient(customer.id)

      const nextScrapeAt = computeNextScrapeAt(
        customer.website_scrape_interval as 'weekly' | 'monthly'
      )

      await supabase
        .from('customer_customer')
        .update({ next_scrape_at: nextScrapeAt })
        .eq('id', customer.id)

      logger.info({ customerId: customer.id, nextScrapeAt }, 'website-rescrape: complete')
    } catch (err) {
      logger.error({ customerId: customer.id, err }, 'website-rescrape: crawl failed, skipping')
      // Still advance next_scrape_at to avoid hammering a failing site every day
      const nextScrapeAt = computeNextScrapeAt(
        customer.website_scrape_interval as 'weekly' | 'monthly'
      )
      await supabase
        .from('customer_customer')
        .update({ next_scrape_at: nextScrapeAt })
        .eq('id', customer.id)
    }
  }
}

function computeNextScrapeAt(interval: 'weekly' | 'monthly'): string {
  const now = new Date()
  if (interval === 'weekly') {
    now.setUTCDate(now.getUTCDate() + 7)
  } else {
    now.setUTCMonth(now.getUTCMonth() + 1)
  }
  return now.toISOString()
}

Import requirement

import { crawlAndUpdateClient } from '../services/crawl.service.js'

Verify this import path matches the actual location of crawlAndUpdateClient (check api/src/services/crawl.service.ts).

Tasks

[ ] Add runWebsiteRescrape() function to cron.ts
[ ] Add computeNextScrapeAt() helper
[ ] Register the cron job in the existing job array / startup sequence
[ ] Import crawlAndUpdateClient (verify path)
[ ] Add log lines for start, success, and failure per customer

11.3 — Settings Endpoint: Accept Scrape Interval

Target file: api/src/routes/clients.ts

The customer update endpoint (likely PUT /client/:id or PATCH /client/:id) should accept website_scrape_interval in the request body.

When website_scrape_interval is provided and is not 'off', compute an initial next_scrape_at and persist both values. When set to 'off', clear next_scrape_at.

Logic

if (body.website_scrape_interval !== undefined) {
  updatePayload.website_scrape_interval = body.website_scrape_interval

  if (body.website_scrape_interval === 'off') {
    updatePayload.next_scrape_at = null
  } else {
    // Only set next_scrape_at if it's not already scheduled (don't reset a pending job)
    const { data: existing } = await supabase
      .from('customer_customer')
      .select('next_scrape_at')
      .eq('id', customerId)
      .single()

    if (!existing?.next_scrape_at) {
      updatePayload.next_scrape_at = computeNextScrapeAt(body.website_scrape_interval)
    }
  }
}

computeNextScrapeAt can be extracted to a shared utility used by both the cron job and this route.

Schema update

Add website_scrape_interval to the TypeBox schema for the client update endpoint:

website_scrape_interval: Type.Optional(
  Type.Union([
    Type.Literal('off'),
    Type.Literal('weekly'),
    Type.Literal('monthly'),
  ])
),

Tasks

[ ] Locate the client update endpoint in clients.ts
[ ] Add website_scrape_interval to the request schema
[ ] Add the interval + next_scrape_at update logic
[ ] Extract computeNextScrapeAt to a shared location if needed (e.g. api/src/utils/schedule.ts)

11.4 — Frontend: Scrape Schedule Control

Target file: front-end/src/pages/app/client/[id]/settings.vue

Add a "Website Refresh Schedule" control below the existing "Re-crawl website" button in the Settings page.

UI

── Website Content ──────────────────────────────────
[Re-crawl website button]   Last crawled: {date}

Automatic refresh schedule
[Off ▾]  ← dropdown: Off / Weekly / Monthly
Next refresh: 12 May 2026     ← shown when not 'off' and next_scrape_at is set
                              ← hidden when 'off' or next_scrape_at is null

Behaviour

On dropdown change: call the client update API with { website_scrape_interval: selectedValue }
Show a brief success toast on save
When the user selects "Off": hide the "Next refresh" line immediately
When the user selects "Weekly" or "Monthly": optimistically show "Next refresh" ~7 days or ~30 days out, then refetch to get the server-computed value

Data binding

Load website_scrape_interval and next_scrape_at from the customer object returned by GET /client/:id/context (or the existing settings load). Ensure the client store exposes both fields.

Tasks

[ ] Add website_scrape_interval and next_scrape_at to the customer type/interface used in settings
[ ] Add the dropdown control to settings.vue below the crawl button
[ ] Wire save to the client update API call
[ ] Display "Next refresh: {formatted date}" when schedule is active
[ ] Visual audit: take a screenshot of the Settings page showing the new control

Testing Checklist

Setup

Enable BOT_ENABLED=true locally and start the API: cd api && pnpm dev

Cron job

[ ] Set a test customer's next_scrape_at to a past timestamp (e.g. yesterday) and website_scrape_interval = 'weekly'
[ ] Wait for the 3 AM cron or manually call runWebsiteRescrape() in a test script
[ ] Verify client_content_asset rows for the customer are updated (check created_at / updated_at)
[ ] Verify next_scrape_at was advanced by 7 days after successful crawl
[ ] Verify a customer with website_scrape_interval = 'off' is NOT processed

Settings UI

[ ] Open Settings for a test client — confirm the schedule dropdown shows the current value
[ ] Change to "Weekly" — confirm next_scrape_at appears within a minute of saving
[ ] Change to "Off" — confirm next_scrape_at clears and "Next refresh" line disappears
[ ] Visual audit screenshot of Settings page

Edge cases

[ ] Customer with no url set: cron job should skip gracefully (not crash)
[ ] Crawl failure (unreachable URL): next_scrape_at still advances so the job doesn't retry every day
[ ] 20-customer cap: if more than 20 are overdue, the first 20 are processed; the rest are picked up on the next run

Rollback

Remove the runWebsiteRescrape job registration from cron.ts
Revert the client update endpoint (remove website_scrape_interval handling)
Remove the dropdown from settings.vue

Migration down:

ALTER TABLE customer_customer
  DROP COLUMN IF EXISTS website_scrape_interval,
  DROP COLUMN IF EXISTS next_scrape_at;
DROP INDEX IF EXISTS idx_customer_scrape_schedule;