Skip to content

F-048: Kangalou Scraper + Cross-Platform Deduplication

Status: Planning · Priority: P1 · Branch: feature/F-048-kangalou-dedup · Updated: Mar 17, 2026

Summary

Add Kangalou as a second scraping source with cross-platform deduplication and data enrichment. When the same listing exists on multiple platforms, merge the best data from each into one enriched public listing. Currently we scrape only Kijiji (~1,900 active listings). Kangalou has ~7,500 listings with zero anti-bot protection.

Requirements

Kangalou Scraper

  • [ ] HTTP-based scraper (no Playwright) using fetch + cheerio
  • [ ] Search API pagination (POST /:lang/api/search)
  • [ ] Detail page parsing (description, coordinates, amenities, images)
  • [ ] Feed into existing RabbitMQ pipeline: scrape → normalize → import
  • [ ] New normalize queue: normalize.kangalou
  • [ ] Continuous scraping loop (same pattern as Kijiji)
  • [ ] Configurable delay, concurrency, enable/disable

Deduplication System

  • [ ] Multi-signal weighted scoring for match detection
  • [ ] Signals: geo-proximity (lat/lng distance), price similarity, bedroom/bathroom count, address text similarity, title similarity
  • [ ] Configurable confidence threshold (0.80 likely, 0.90 auto-merge)
  • [ ] Runs during import stage — before creating a new public listing, check for matches
  • [ ] dedupe_clusters table redesign: cluster groups multiple raw_listings across platforms
  • [ ] New listing_sources junction table: one public listing → many raw_listings

Cross-Source Enrichment

  • [ ] When a match is found, merge data into the existing public listing
  • [ ] Merge strategy per field:
    • Photos: union (dedupe by image hash)
    • Address: prefer most specific (street > neighborhood-only)
    • Coordinates: prefer Kangalou (embedded) over geocoded
    • Price: prefer most recent
    • Description: prefer longest
    • Amenities: union
    • Area (sqft): prefer non-null
    • Contact info: union
  • [ ] Track which platform contributed which fields
  • [ ] Never downgrade data (don't overwrite good data with empty)

Admin Observability

  • [ ] Dedup matches visible in admin pipeline activity
  • [ ] Cluster view: see which raw listings are grouped together
  • [ ] Match confidence scores visible
  • [ ] Ability to manually merge/split clusters
  • [ ] Per-platform stats in scraper stats dashboard

Design

Dedup Scoring Model

Score = Σ (signal_weight × signal_score)

Signals:
  geo_proximity  (weight: 0.35) — haversine distance
                                   1.0 if < 50m, linear decay to 0 at 500m
  price_match    (weight: 0.25) — 1.0 if within 3%, decay to 0 at 15% diff
  bedrooms_match (weight: 0.15) — 1.0 if exact, 0.5 if ±1, 0 otherwise
  address_sim    (weight: 0.15) — normalized Levenshtein on street address
  title_sim      (weight: 0.10) — token overlap (Jaccard) after stopwords

Thresholds:
  ≥ 0.90 → auto-merge (high confidence)
  0.80–0.90 → flag for admin review
  < 0.80 → treat as separate listing

Schema Changes

scraper.dedupe_clusters — redesign existing table:

id                    UUID PK
canonical_listing_id  UUID FK → public.listings (nullable)
match_method          text — "auto" | "manual"
confidence            numeric — highest pairwise score
created_at            timestamp
updated_at            timestamp

scraper.dedupe_cluster_members — new:

id                UUID PK
cluster_id        UUID FK → dedupe_clusters
raw_listing_id    UUID FK → raw_listings
platform          text
pairwise_score    numeric — score vs primary member
created_at        timestamp

public.listing_sources — new junction table:

id                 UUID PK
listing_id         UUID FK → listings
platform           text — "kijiji", "kangalou"
platform_id        text — external ID on that platform
platform_url       text
contributed_fields text[] — which fields came from this source
last_seen_at       timestamp
created_at         timestamp

Pipeline Flow

Kijiji scraper ──→ normalize.kijiji ──→ importer ──→ [dedup check]
Kangalou scraper → normalize.kangalou →     ↑            ↓
                                                   ┌─────┴─────┐
                                              no match     match found
                                                   ↓           ↓
                                            create new    enrich existing
                                              listing       listing
                                                   ↓           ↓
                                            listing_sources  listing_sources
                                                   ↓           ↓
                                              image queue   image queue
                                                            (new only)

Technical Reference

  • Kangalou technical analysis: docs/dev/scraper/kangalou.md
  • Existing scraper patterns: services/scraper/src/scrapers/kijiji-continuous.ts, kijiji-parser.ts
  • Normalizer: services/scraper/src/consumers/normalizer-consumer.ts
  • Importer: services/scraper/src/consumers/importer-consumer.ts
  • Existing dedup table: services/scraper/src/db/schema.tsdedupeClusters
  • RabbitMQ topology: services/scraper/src/mq/topology.ts

Discussion Notes

2026-03-17 — Initial planning session:

  • Kangalou + dedup must ship together (can't have second scraper without dedup)
  • Dedup uses weighted multi-signal scoring because addresses can differ across platforms (different spellings, abbreviations)
  • Admin needs visibility into dedup clusters for QA — see which raw listings are grouped, confidence scores
  • Cross-source enrichment: merge best data from each platform per field
  • Enrichment is a competitive advantage — our listings become richer than any single source
  • Amenities normalization will be a separate feature (F-049) but benefits from multi-source data
  • Existing dedupe_clusters table needs redesign (currently just a skeleton)
  • Need listing_sources junction table to track multi-platform origin per public listing
  • Kangalou is HTTP-only (no Playwright), estimated 3-5 days for scraper alone
  • Phone numbers on Kangalou are auth-gated — skip initially

Implementation Notes

To be filled during implementation.