Appearance
F-017: Listing Sync & Update Detection
Status: ✅ Done · Priority: P1 · Branch:
feature/F-017-listing-update-detection· Updated: Mar 4, 2026
Summary
Keep public listings in sync with what's actually on the platform. Three related concerns:
- Update detection — when re-scraped data differs from the public listing (price, description, images, amenities), propagate changes
- Staleness verification — before archiving a listing not seen in 72h, verify it's actually been removed from Kijiji by visiting its URL
- Re-scraping existing listings — periodically re-visit
platformUrlof imported listings that haven't appeared in recent search results
Problem
Previously:
importOne()had a hard skip:if (raw.importedListingId) return null— public listings never got updated- Staleness checker blindly archived after 48h without checking if the listing was actually gone
- Scraper only covers search pages 1-25 (~500-1000 listings). Kijiji Montreal has 6000+ active listings. Listings beyond page 25 that are still active get falsely archived.
Implementation
Importer Update Logic (importer.ts)
importOne() now calls updateExistingListing() instead of returning null for already-imported listings:
- Loads the public listing, compares all fields (price, title, description, bedrooms, bathrooms, area, amenities, contact, availability)
- Builds a diff of changed fields
- Applies updates to public listing
- Skips claimed listings (don't overwrite landlord edits)
- Reactivates archived listings that are re-scraped (status: archived → active)
- Returns
{ isUpdate: true, changedFields: [...] }for logging
Staleness Verification (staleness-checker.ts)
Before archiving, now verifies the listing is actually gone:
verifyListingRemoved(platformUrl)— lightweightfetch()(no Playwright)- Detects removal: 404/410 status, redirect to search results, "no longer available" markers
- If still live → refresh
lastScrapedAt(prevents re-checking for another 72h) - If gone → archive as before
- On verification error → fail safe (don't archive)
- 2.5s delay between verifications (rate limiting)
- Threshold increased: 48h → 72h
Re-scrape Batch (kijiji-continuous.ts)
Every 4th scrape cycle (~20min), visits 10 existing imported listings:
- Queries
raw_listingswherelastScrapedAt > 24h, oldest first - Visits
platformUrlwith Playwright, extracts Apollo state - Upserts raw data and publishes through the normal pipeline (normalize → import)
- The importer's new update logic handles the rest
- Reuses the existing browser context (no extra Playwright instance)
Image Update Detection
On re-scrape, detects when images have changed and re-downloads them:
upsertListing()replacesraw_imageswith fresh set on update (delete + reinsert)updateExistingListing()comparesraw_imagescount vslisting_imagescount- Returns
hasNewImagesflag when counts differ - Importer consumer publishes to
images.pendingqueue for updates with new images - Image downloader clears old
listing_imagesbefore re-importing (avoids duplicates)
Pipeline Event Logging & Telegram Reports
- Importer consumer distinguishes insert vs update in logs
isUpdateandchangedFieldsincluded in pipeline events- Stats collector tracks
importUpdatedseparately fromimported - Update categories tracked per field (price, description, images, etc.)
- Hourly Telegram report shows update field breakdown:
Updated fields: price: 5, images: 3, description: 2
Files Modified
| File | Change |
|---|---|
services/scraper/src/pipeline/importer.ts | Added updateExistingListing() with image count diff, hasNewImages flag |
services/scraper/src/pipeline/staleness-checker.ts | Added verifyListingRemoved(), verify-before-archive |
services/scraper/src/pipeline/image-downloader.ts | Clear old listing_images before re-import |
services/scraper/src/config.ts | STALE_THRESHOLD_HOURS 48 → 72 |
services/scraper/src/scrapers/kijiji-continuous.ts | Re-scrape batch + replace raw_images on update |
services/scraper/src/consumers/importer-consumer.ts | Publish images for updates, track update categories |
services/scraper/src/stats/collector.ts | importUpdated counter + updateCategories tracking |
services/scraper/src/stats/hourly-reporter.ts | Update field breakdown in Telegram report |
Discussion Notes
Mar 4, 2026
- Identified the skip in
importOne()as the root cause of stale public listings - Decided to combine update detection + staleness verification + re-scraping into one feature
- Using
fetch()for staleness verification (lightweight) and Playwright for re-scraping (need Apollo state) - Re-scrape batch runs inside the continuous scraper loop to reuse the browser context
- Claimed listings are never overwritten by the scraper
- Image update detection uses count diff (raw vs public) — simpler than comparing individual URLs
- On image change: delete old raw_images → insert fresh set → delete old listing_images → re-download all