F-048: Kangalou Scraper + Cross-Platform Deduplication

Status: Planning · Priority: P1 · Branch: feature/F-048-kangalou-dedup · Updated: Mar 17, 2026

Summary

Add Kangalou as a second scraping source with cross-platform deduplication and data enrichment. When the same listing exists on multiple platforms, merge the best data from each into one enriched public listing. Currently we scrape only Kijiji (~1,900 active listings). Kangalou has ~7,500 listings with zero anti-bot protection.

Requirements

Kangalou Scraper

[ ] HTTP-based scraper (no Playwright) using fetch + cheerio
[ ] Search API pagination (POST /:lang/api/search)
[ ] Detail page parsing (description, coordinates, amenities, images)
[ ] Feed into existing RabbitMQ pipeline: scrape → normalize → import
[ ] New normalize queue: normalize.kangalou
[ ] Continuous scraping loop (same pattern as Kijiji)
[ ] Configurable delay, concurrency, enable/disable

Deduplication System

[ ] Multi-signal weighted scoring for match detection
[ ] Signals: geo-proximity (lat/lng distance), price similarity, bedroom/bathroom count, address text similarity, title similarity
[ ] Configurable confidence threshold (0.80 likely, 0.90 auto-merge)
[ ] Runs during import stage — before creating a new public listing, check for matches
[ ] dedupe_clusters table redesign: cluster groups multiple raw_listings across platforms
[ ] New listing_sources junction table: one public listing → many raw_listings

Cross-Source Enrichment

[ ] When a match is found, merge data into the existing public listing
[ ] Merge strategy per field:
- Photos: union (dedupe by image hash)
- Address: prefer most specific (street > neighborhood-only)
- Coordinates: prefer Kangalou (embedded) over geocoded
- Price: prefer most recent
- Description: prefer longest
- Amenities: union
- Area (sqft): prefer non-null
- Contact info: union
[ ] Track which platform contributed which fields
[ ] Never downgrade data (don't overwrite good data with empty)

Admin Observability

[ ] Dedup matches visible in admin pipeline activity
[ ] Cluster view: see which raw listings are grouped together
[ ] Match confidence scores visible
[ ] Ability to manually merge/split clusters
[ ] Per-platform stats in scraper stats dashboard

Design

Dedup Scoring Model

Score = Σ (signal_weight × signal_score)

Signals:
  geo_proximity  (weight: 0.35) — haversine distance
                                   1.0 if < 50m, linear decay to 0 at 500m
  price_match    (weight: 0.25) — 1.0 if within 3%, decay to 0 at 15% diff
  bedrooms_match (weight: 0.15) — 1.0 if exact, 0.5 if ±1, 0 otherwise
  address_sim    (weight: 0.15) — normalized Levenshtein on street address
  title_sim      (weight: 0.10) — token overlap (Jaccard) after stopwords

Thresholds:
  ≥ 0.90 → auto-merge (high confidence)
  0.80–0.90 → flag for admin review
  < 0.80 → treat as separate listing

Schema Changes

scraper.dedupe_clusters — redesign existing table:

id                    UUID PK
canonical_listing_id  UUID FK → public.listings (nullable)
match_method          text — "auto" | "manual"
confidence            numeric — highest pairwise score
created_at            timestamp
updated_at            timestamp

scraper.dedupe_cluster_members — new:

id                UUID PK
cluster_id        UUID FK → dedupe_clusters
raw_listing_id    UUID FK → raw_listings
platform          text
pairwise_score    numeric — score vs primary member
created_at        timestamp

public.listing_sources — new junction table:

id                 UUID PK
listing_id         UUID FK → listings
platform           text — "kijiji", "kangalou"
platform_id        text — external ID on that platform
platform_url       text
contributed_fields text[] — which fields came from this source
last_seen_at       timestamp
created_at         timestamp

Pipeline Flow

Kijiji scraper ──→ normalize.kijiji ──→ importer ──→ [dedup check]
Kangalou scraper → normalize.kangalou →     ↑            ↓
                                                   ┌─────┴─────┐
                                              no match     match found
                                                   ↓           ↓
                                            create new    enrich existing
                                              listing       listing
                                                   ↓           ↓
                                            listing_sources  listing_sources
                                                   ↓           ↓
                                              image queue   image queue
                                                            (new only)

Technical Reference

Kangalou technical analysis: docs/dev/scraper/kangalou.md
Existing scraper patterns: services/scraper/src/scrapers/kijiji-continuous.ts, kijiji-parser.ts
Normalizer: services/scraper/src/consumers/normalizer-consumer.ts
Importer: services/scraper/src/consumers/importer-consumer.ts
Existing dedup table: services/scraper/src/db/schema.ts → dedupeClusters
RabbitMQ topology: services/scraper/src/mq/topology.ts

Discussion Notes

2026-03-17 — Initial planning session:

Kangalou + dedup must ship together (can't have second scraper without dedup)
Dedup uses weighted multi-signal scoring because addresses can differ across platforms (different spellings, abbreviations)
Admin needs visibility into dedup clusters for QA — see which raw listings are grouped, confidence scores
Cross-source enrichment: merge best data from each platform per field
Enrichment is a competitive advantage — our listings become richer than any single source
Amenities normalization will be a separate feature (F-049) but benefits from multi-source data
Existing dedupe_clusters table needs redesign (currently just a skeleton)
Need listing_sources junction table to track multi-platform origin per public listing
Kangalou is HTTP-only (no Playwright), estimated 3-5 days for scraper alone
Phone numbers on Kangalou are auth-gated — skip initially

Implementation Notes

To be filled during implementation.

F-048: Kangalou Scraper + Cross-Platform Deduplication ​

Summary ​

Requirements ​

Kangalou Scraper ​

Deduplication System ​

Cross-Source Enrichment ​

Admin Observability ​

Design ​

Dedup Scoring Model ​

Schema Changes ​

Pipeline Flow ​

Technical Reference ​

Discussion Notes ​

Implementation Notes ​