Appearance
F-048: Kangalou Scraper + Cross-Platform Deduplication
Status: Planning · Priority: P1 · Branch:
feature/F-048-kangalou-dedup· Updated: Mar 17, 2026
Summary
Add Kangalou as a second scraping source with cross-platform deduplication and data enrichment. When the same listing exists on multiple platforms, merge the best data from each into one enriched public listing. Currently we scrape only Kijiji (~1,900 active listings). Kangalou has ~7,500 listings with zero anti-bot protection.
Requirements
Kangalou Scraper
- [ ] HTTP-based scraper (no Playwright) using fetch + cheerio
- [ ] Search API pagination (
POST /:lang/api/search) - [ ] Detail page parsing (description, coordinates, amenities, images)
- [ ] Feed into existing RabbitMQ pipeline: scrape → normalize → import
- [ ] New normalize queue:
normalize.kangalou - [ ] Continuous scraping loop (same pattern as Kijiji)
- [ ] Configurable delay, concurrency, enable/disable
Deduplication System
- [ ] Multi-signal weighted scoring for match detection
- [ ] Signals: geo-proximity (lat/lng distance), price similarity, bedroom/bathroom count, address text similarity, title similarity
- [ ] Configurable confidence threshold (0.80 likely, 0.90 auto-merge)
- [ ] Runs during import stage — before creating a new public listing, check for matches
- [ ]
dedupe_clusterstable redesign: cluster groups multiple raw_listings across platforms - [ ] New
listing_sourcesjunction table: one public listing → many raw_listings
Cross-Source Enrichment
- [ ] When a match is found, merge data into the existing public listing
- [ ] Merge strategy per field:
- Photos: union (dedupe by image hash)
- Address: prefer most specific (street > neighborhood-only)
- Coordinates: prefer Kangalou (embedded) over geocoded
- Price: prefer most recent
- Description: prefer longest
- Amenities: union
- Area (sqft): prefer non-null
- Contact info: union
- [ ] Track which platform contributed which fields
- [ ] Never downgrade data (don't overwrite good data with empty)
Admin Observability
- [ ] Dedup matches visible in admin pipeline activity
- [ ] Cluster view: see which raw listings are grouped together
- [ ] Match confidence scores visible
- [ ] Ability to manually merge/split clusters
- [ ] Per-platform stats in scraper stats dashboard
Design
Dedup Scoring Model
Score = Σ (signal_weight × signal_score)
Signals:
geo_proximity (weight: 0.35) — haversine distance
1.0 if < 50m, linear decay to 0 at 500m
price_match (weight: 0.25) — 1.0 if within 3%, decay to 0 at 15% diff
bedrooms_match (weight: 0.15) — 1.0 if exact, 0.5 if ±1, 0 otherwise
address_sim (weight: 0.15) — normalized Levenshtein on street address
title_sim (weight: 0.10) — token overlap (Jaccard) after stopwords
Thresholds:
≥ 0.90 → auto-merge (high confidence)
0.80–0.90 → flag for admin review
< 0.80 → treat as separate listingSchema Changes
scraper.dedupe_clusters — redesign existing table:
id UUID PK
canonical_listing_id UUID FK → public.listings (nullable)
match_method text — "auto" | "manual"
confidence numeric — highest pairwise score
created_at timestamp
updated_at timestampscraper.dedupe_cluster_members — new:
id UUID PK
cluster_id UUID FK → dedupe_clusters
raw_listing_id UUID FK → raw_listings
platform text
pairwise_score numeric — score vs primary member
created_at timestamppublic.listing_sources — new junction table:
id UUID PK
listing_id UUID FK → listings
platform text — "kijiji", "kangalou"
platform_id text — external ID on that platform
platform_url text
contributed_fields text[] — which fields came from this source
last_seen_at timestamp
created_at timestampPipeline Flow
Kijiji scraper ──→ normalize.kijiji ──→ importer ──→ [dedup check]
Kangalou scraper → normalize.kangalou → ↑ ↓
┌─────┴─────┐
no match match found
↓ ↓
create new enrich existing
listing listing
↓ ↓
listing_sources listing_sources
↓ ↓
image queue image queue
(new only)Technical Reference
- Kangalou technical analysis:
docs/dev/scraper/kangalou.md - Existing scraper patterns:
services/scraper/src/scrapers/kijiji-continuous.ts,kijiji-parser.ts - Normalizer:
services/scraper/src/consumers/normalizer-consumer.ts - Importer:
services/scraper/src/consumers/importer-consumer.ts - Existing dedup table:
services/scraper/src/db/schema.ts→dedupeClusters - RabbitMQ topology:
services/scraper/src/mq/topology.ts
Discussion Notes
2026-03-17 — Initial planning session:
- Kangalou + dedup must ship together (can't have second scraper without dedup)
- Dedup uses weighted multi-signal scoring because addresses can differ across platforms (different spellings, abbreviations)
- Admin needs visibility into dedup clusters for QA — see which raw listings are grouped, confidence scores
- Cross-source enrichment: merge best data from each platform per field
- Enrichment is a competitive advantage — our listings become richer than any single source
- Amenities normalization will be a separate feature (F-049) but benefits from multi-source data
- Existing
dedupe_clusterstable needs redesign (currently just a skeleton) - Need
listing_sourcesjunction table to track multi-platform origin per public listing - Kangalou is HTTP-only (no Playwright), estimated 3-5 days for scraper alone
- Phone numbers on Kangalou are auth-gated — skip initially
Implementation Notes
To be filled during implementation.