F-017: Listing Sync & Update Detection

Status: ✅ Done · Priority: P1 · Branch: feature/F-017-listing-update-detection · Updated: Mar 4, 2026

Summary

Keep public listings in sync with what's actually on the platform. Three related concerns:

Update detection — when re-scraped data differs from the public listing (price, description, images, amenities), propagate changes
Staleness verification — before archiving a listing not seen in 72h, verify it's actually been removed from Kijiji by visiting its URL
Re-scraping existing listings — periodically re-visit platformUrl of imported listings that haven't appeared in recent search results

Problem

Previously:

importOne() had a hard skip: if (raw.importedListingId) return null — public listings never got updated
Staleness checker blindly archived after 48h without checking if the listing was actually gone
Scraper only covers search pages 1-25 (~500-1000 listings). Kijiji Montreal has 6000+ active listings. Listings beyond page 25 that are still active get falsely archived.

Implementation

Importer Update Logic (`importer.ts`)

importOne() now calls updateExistingListing() instead of returning null for already-imported listings:

Loads the public listing, compares all fields (price, title, description, bedrooms, bathrooms, area, amenities, contact, availability)
Builds a diff of changed fields
Applies updates to public listing
Skips claimed listings (don't overwrite landlord edits)
Reactivates archived listings that are re-scraped (status: archived → active)
Returns { isUpdate: true, changedFields: [...] } for logging

Staleness Verification (`staleness-checker.ts`)

Before archiving, now verifies the listing is actually gone:

verifyListingRemoved(platformUrl) — lightweight fetch() (no Playwright)
Detects removal: 404/410 status, redirect to search results, "no longer available" markers
If still live → refresh lastScrapedAt (prevents re-checking for another 72h)
If gone → archive as before
On verification error → fail safe (don't archive)
2.5s delay between verifications (rate limiting)
Threshold increased: 48h → 72h

Re-scrape Batch (`kijiji-continuous.ts`)

Every 4th scrape cycle (~20min), visits 10 existing imported listings:

Queries raw_listings where lastScrapedAt > 24h, oldest first
Visits platformUrl with Playwright, extracts Apollo state
Upserts raw data and publishes through the normal pipeline (normalize → import)
The importer's new update logic handles the rest
Reuses the existing browser context (no extra Playwright instance)

Image Update Detection

On re-scrape, detects when images have changed and re-downloads them:

upsertListing() replaces raw_images with fresh set on update (delete + reinsert)
updateExistingListing() compares raw_images count vs listing_images count
Returns hasNewImages flag when counts differ
Importer consumer publishes to images.pending queue for updates with new images
Image downloader clears old listing_images before re-importing (avoids duplicates)

Pipeline Event Logging & Telegram Reports

Importer consumer distinguishes insert vs update in logs
isUpdate and changedFields included in pipeline events
Stats collector tracks importUpdated separately from imported
Update categories tracked per field (price, description, images, etc.)
Hourly Telegram report shows update field breakdown: Updated fields: price: 5, images: 3, description: 2

Files Modified

File	Change
`services/scraper/src/pipeline/importer.ts`	Added `updateExistingListing()` with image count diff, `hasNewImages` flag
`services/scraper/src/pipeline/staleness-checker.ts`	Added `verifyListingRemoved()`, verify-before-archive
`services/scraper/src/pipeline/image-downloader.ts`	Clear old `listing_images` before re-import
`services/scraper/src/config.ts`	`STALE_THRESHOLD_HOURS` 48 → 72
`services/scraper/src/scrapers/kijiji-continuous.ts`	Re-scrape batch + replace `raw_images` on update
`services/scraper/src/consumers/importer-consumer.ts`	Publish images for updates, track update categories
`services/scraper/src/stats/collector.ts`	`importUpdated` counter + `updateCategories` tracking
`services/scraper/src/stats/hourly-reporter.ts`	Update field breakdown in Telegram report

Discussion Notes

Mar 4, 2026

Identified the skip in importOne() as the root cause of stale public listings
Decided to combine update detection + staleness verification + re-scraping into one feature
Using fetch() for staleness verification (lightweight) and Playwright for re-scraping (need Apollo state)
Re-scrape batch runs inside the continuous scraper loop to reuse the browser context
Claimed listings are never overwritten by the scraper
Image update detection uses count diff (raw vs public) — simpler than comparing individual URLs
On image change: delete old raw_images → insert fresh set → delete old listing_images → re-download all

F-017: Listing Sync & Update Detection ​

Summary ​

Problem ​

Implementation ​

Importer Update Logic (importer.ts) ​

Staleness Verification (staleness-checker.ts) ​

Re-scrape Batch (kijiji-continuous.ts) ​

Image Update Detection ​

Pipeline Event Logging & Telegram Reports ​

Files Modified ​

Discussion Notes ​

Mar 4, 2026 ​