Skip to content

F-015: Multi-Scraper Pipeline โ€‹

Status: ๐Ÿ’ก Proposed ยท Priority: P2 ยท Updated: Mar 4, 2026

Summary โ€‹

Scale from single Kijiji scraper to multi-platform aggregation. Add Kangalou, Craigslist Montreal, and Realtor.ca rentals as sources. Includes cross-platform deduplication and multi-queue architecture for independent concurrency. Benefits from F-014: Properties for property-level dedup.

Requirements โ€‹

  • [ ] Kangalou scraper (Quebec-specific rental platform)
  • [ ] Craigslist Montreal scraper
  • [ ] Realtor.ca rentals scraper
  • [ ] Multi-queue BullMQ architecture (per-stage queues for independent concurrency)
  • [ ] Cross-platform deduplication (same listing on multiple sites โ†’ one property)
  • [ ] AI-powered enrichment (amenity extraction, quality scoring, categorization)

Design โ€‹

Architecture evolution:

  • Current: Single scraper:pipeline queue, sequential job chaining, concurrency 1
  • Target: Per-stage queues (scrape, download, normalize, import) with independent concurrency. Multiple scrapers feed shared downstream queues.

Deduplication strategy:

  • Property-level matching (F-014): same address + unit = same property
  • Fuzzy matching by lat/lng proximity + bedrooms + price range
  • dedupe_clusters table already exists in scraper schema (stub)

Platform-specific notes:

  • Kangalou: Quebec-only, good FR content, analysis doc at docs/dev/scraper/kangalou.md
  • Facebook Marketplace: Analysis doc at docs/dev/scraper/facebook-marketplace.md, challenging auth
  • Craigslist: Simple HTML parsing, low volume in Montreal
  • Realtor.ca: Structured API, but may have anti-scraping measures

Open questions:

  • Priority order for new scrapers? Kangalou likely first (Quebec-focused, analyzed)
  • Concurrent workers per scraper, or shared worker pool?
  • How aggressive on dedup? False positives (merging different units) vs false negatives (missing duplicates)

Discussion Notes โ€‹

Feb 1, 2026 โ€‹

  • Initial proposal as Phase 13 in roadmap
  • Kangalou analysis completed, Facebook Marketplace analyzed (deferred โ€” auth challenges)

Implementation Notes โ€‹

Not yet started.