Appearance
F-015: Multi-Scraper Pipeline โ
Status: ๐ก Proposed ยท Priority: P2 ยท Updated: Mar 4, 2026
Summary โ
Scale from single Kijiji scraper to multi-platform aggregation. Add Kangalou, Craigslist Montreal, and Realtor.ca rentals as sources. Includes cross-platform deduplication and multi-queue architecture for independent concurrency. Benefits from F-014: Properties for property-level dedup.
Requirements โ
- [ ] Kangalou scraper (Quebec-specific rental platform)
- [ ] Craigslist Montreal scraper
- [ ] Realtor.ca rentals scraper
- [ ] Multi-queue BullMQ architecture (per-stage queues for independent concurrency)
- [ ] Cross-platform deduplication (same listing on multiple sites โ one property)
- [ ] AI-powered enrichment (amenity extraction, quality scoring, categorization)
Design โ
Architecture evolution:
- Current: Single
scraper:pipelinequeue, sequential job chaining, concurrency 1 - Target: Per-stage queues (scrape, download, normalize, import) with independent concurrency. Multiple scrapers feed shared downstream queues.
Deduplication strategy:
- Property-level matching (F-014): same address + unit = same property
- Fuzzy matching by lat/lng proximity + bedrooms + price range
dedupe_clusterstable already exists in scraper schema (stub)
Platform-specific notes:
- Kangalou: Quebec-only, good FR content, analysis doc at
docs/dev/scraper/kangalou.md - Facebook Marketplace: Analysis doc at
docs/dev/scraper/facebook-marketplace.md, challenging auth - Craigslist: Simple HTML parsing, low volume in Montreal
- Realtor.ca: Structured API, but may have anti-scraping measures
Open questions:
- Priority order for new scrapers? Kangalou likely first (Quebec-focused, analyzed)
- Concurrent workers per scraper, or shared worker pool?
- How aggressive on dedup? False positives (merging different units) vs false negatives (missing duplicates)
Discussion Notes โ
Feb 1, 2026 โ
- Initial proposal as Phase 13 in roadmap
- Kangalou analysis completed, Facebook Marketplace analyzed (deferred โ auth challenges)
Implementation Notes โ
Not yet started.