Skip to content

F-017: Listing Sync & Update Detection

Status: ✅ Done · Priority: P1 · Branch: feature/F-017-listing-update-detection · Updated: Mar 4, 2026

Summary

Keep public listings in sync with what's actually on the platform. Three related concerns:

  1. Update detection — when re-scraped data differs from the public listing (price, description, images, amenities), propagate changes
  2. Staleness verification — before archiving a listing not seen in 72h, verify it's actually been removed from Kijiji by visiting its URL
  3. Re-scraping existing listings — periodically re-visit platformUrl of imported listings that haven't appeared in recent search results

Problem

Previously:

  • importOne() had a hard skip: if (raw.importedListingId) return null — public listings never got updated
  • Staleness checker blindly archived after 48h without checking if the listing was actually gone
  • Scraper only covers search pages 1-25 (~500-1000 listings). Kijiji Montreal has 6000+ active listings. Listings beyond page 25 that are still active get falsely archived.

Implementation

Importer Update Logic (importer.ts)

importOne() now calls updateExistingListing() instead of returning null for already-imported listings:

  • Loads the public listing, compares all fields (price, title, description, bedrooms, bathrooms, area, amenities, contact, availability)
  • Builds a diff of changed fields
  • Applies updates to public listing
  • Skips claimed listings (don't overwrite landlord edits)
  • Reactivates archived listings that are re-scraped (status: archived → active)
  • Returns { isUpdate: true, changedFields: [...] } for logging

Staleness Verification (staleness-checker.ts)

Before archiving, now verifies the listing is actually gone:

  • verifyListingRemoved(platformUrl) — lightweight fetch() (no Playwright)
  • Detects removal: 404/410 status, redirect to search results, "no longer available" markers
  • If still live → refresh lastScrapedAt (prevents re-checking for another 72h)
  • If gone → archive as before
  • On verification error → fail safe (don't archive)
  • 2.5s delay between verifications (rate limiting)
  • Threshold increased: 48h → 72h

Re-scrape Batch (kijiji-continuous.ts)

Every 4th scrape cycle (~20min), visits 10 existing imported listings:

  • Queries raw_listings where lastScrapedAt > 24h, oldest first
  • Visits platformUrl with Playwright, extracts Apollo state
  • Upserts raw data and publishes through the normal pipeline (normalize → import)
  • The importer's new update logic handles the rest
  • Reuses the existing browser context (no extra Playwright instance)

Image Update Detection

On re-scrape, detects when images have changed and re-downloads them:

  • upsertListing() replaces raw_images with fresh set on update (delete + reinsert)
  • updateExistingListing() compares raw_images count vs listing_images count
  • Returns hasNewImages flag when counts differ
  • Importer consumer publishes to images.pending queue for updates with new images
  • Image downloader clears old listing_images before re-importing (avoids duplicates)

Pipeline Event Logging & Telegram Reports

  • Importer consumer distinguishes insert vs update in logs
  • isUpdate and changedFields included in pipeline events
  • Stats collector tracks importUpdated separately from imported
  • Update categories tracked per field (price, description, images, etc.)
  • Hourly Telegram report shows update field breakdown: Updated fields: price: 5, images: 3, description: 2

Files Modified

FileChange
services/scraper/src/pipeline/importer.tsAdded updateExistingListing() with image count diff, hasNewImages flag
services/scraper/src/pipeline/staleness-checker.tsAdded verifyListingRemoved(), verify-before-archive
services/scraper/src/pipeline/image-downloader.tsClear old listing_images before re-import
services/scraper/src/config.tsSTALE_THRESHOLD_HOURS 48 → 72
services/scraper/src/scrapers/kijiji-continuous.tsRe-scrape batch + replace raw_images on update
services/scraper/src/consumers/importer-consumer.tsPublish images for updates, track update categories
services/scraper/src/stats/collector.tsimportUpdated counter + updateCategories tracking
services/scraper/src/stats/hourly-reporter.tsUpdate field breakdown in Telegram report

Discussion Notes

Mar 4, 2026

  • Identified the skip in importOne() as the root cause of stale public listings
  • Decided to combine update detection + staleness verification + re-scraping into one feature
  • Using fetch() for staleness verification (lightweight) and Playwright for re-scraping (need Apollo state)
  • Re-scrape batch runs inside the continuous scraper loop to reuse the browser context
  • Claimed listings are never overwritten by the scraper
  • Image update detection uses count diff (raw vs public) — simpler than comparing individual URLs
  • On image change: delete old raw_images → insert fresh set → delete old listing_images → re-download all