20M Rubles a Year on Web Scraping

Web scraping gets discussed mostly in the abstract — as a technique, a legal grey zone, or an AI training problem. We've been running a scraping business for over a decade. Here's what it actually looks like at 20M+ RUB per year in revenue.

What we actually do

xmldatafeed.com collects publicly available pricing and catalog data from Russian e-commerce sites, classified portals, and product databases. We sell this data as structured feeds to retailers, price comparison services, analytics firms, and procurement teams.

The model is B2B subscription. Clients define what categories and sources they need. We collect, clean, normalize, and deliver. Most clients receive daily or hourly updates via XML or JSON API.

Revenue structure

The 20M figure represents combined annual revenue from recurring subscriptions. No single client is dominant — the portfolio is diversified across retail, logistics, and market research. The largest segment is e-commerce price intelligence for mid-market retailers.

We've never taken venture capital. The business grew from near zero through organic content, Habr publications, and word of mouth. Paid advertising has never been a meaningful channel for us.

Growing without paid channels is slower. It's also more durable. Every client who finds you through your own writing is already pre-qualified by what you said.
— from the Habr article series, 2023

The legal question

Every discussion of scraping eventually reaches the same question. Our answer: we only collect data that is publicly accessible without authentication, we do not bypass any technical access controls, and we comply with reasonable crawling rates that don't affect site performance.

Russian law on publicly available information is reasonably clear. Data published without access restrictions can be collected and redistributed, provided it doesn't contain personal data under 152-ФЗ and doesn't reproduce copyrighted content verbatim in its entirety.

We've operated this way since the beginning and have never faced a legal challenge. We also don't collect pricing data from sites that explicitly prohibit it in their ToS if those sites are operated by clients in our network — that's a business relationship issue, not just a legal one.

Technical architecture

The stack has evolved significantly over ten years, but the core principle hasn't: reliability beats cleverness. Most of our collection infrastructure is based on simple scheduled crawlers with deterministic retry logic. We use Elasticsearch for normalization and deduplication. Delivery is via a straightforward REST API.

We've tried ML-based extraction a few times. It works well for well-structured sites, poorly for everything else, and is difficult to debug when it silently degrades. For production scraping at scale, explicit rules and manual maintenance of extractors is more reliable than a model you can't interrogate.

What doesn't work

Several things we've tried and abandoned:

Automated extraction using vision models — impressive demos, production reliability around 70%, which isn't good enough for billing data
Self-service onboarding — most clients need help configuring what they actually need, not what they think they need
Trying to cover every source — depth and reliability in fewer categories beats breadth across many

What actually works

The business works because the problem is unglamorous and persistent. Every retailer needs to know what competitors charge. That need doesn't go away. The market for reliable, clean, structured price data is not going to be replaced by an AI wrapper — it requires operational infrastructure that most companies don't want to build in-house.

That's the actual product: not clever technology, but operational reliability at a price point that makes in-house alternatives look expensive.