Design Google's Real-Time Bidding (RTB) System

Difficulty: Senior/Principal Level

Understanding the Problem

🎯 What is Google's Real-Time Bidding (RTB) System?

Google's Real-Time Bidding system is a programmatic advertising platform that enables advertisers to compete in real-time auctions for ad inventory across the web. When a user visits a website with Google ad slots, an instantaneous auction occurs where multiple advertisers bid to display their advertisements, with the highest bidder securing the ad placement.

In our design, we will assume that this system represents one of Google's most sophisticated and revenue-critical platforms, processing billions of ad auction requests daily while maintaining sub-100ms response times, ensuring fair competition, and maximizing both advertiser value and publisher revenue.

As we'll explore in this comprehensive breakdown, we'll walk through this complex system step by step, using proven system design principles as our foundation. While we'll cover more technical depth than typically required in a single interview session, this detailed analysis helps build a thorough understanding of large-scale real-time auction systems.

Google RTB Real-Time Bidding System Complete Architecture Diagram - Distributed System Design with Kafka Message Queues, Redis Pub-Sub, SSE Servers and Microservices

Complete Google RTB System Architecture Overview


Functional Requirements

Core Requirements

  • Publishers should be able to create ad campaigns with initial bid amounts, spending limits, and campaign timelines
  • Advertisers should be able to place competitive bids on ad inventory, with bids only accepted when they surpass the current leading bid
  • Users should be able to monitor live campaign activity, including current winning bids and time remaining
  • Users should be able to view the complete bidding history for any campaign in real-time

Out of Scope

  • Advanced campaign performance analytics and ROI tracking
  • Automated bid optimization using machine learning algorithms
  • Cross-platform campaign synchronization and management
  • Fraud detection and click validation systems

Non-Functional Requirements

When discussing system requirements with your interviewer, it's essential to establish scale expectations upfront. These requirements will drive fundamental architectural decisions and help prioritize which technical challenges to address first.

Core Requirements

  • The system should scale to handle 10 million concurrent advertising campaigns worldwide
  • The system should provide real-time bid updates so advertisers can make informed bidding decisions
  • The system should guarantee fault tolerance and data durability - no bid data can be lost under any circumstances
  • The system should ensure strong bid consistency so all participants see identical current winning bid information

Out of Scope

  • Advanced security features including encryption and access control
  • Comprehensive monitoring, alerting, and observability infrastructure
  • Automated testing frameworks and continuous deployment pipelines

Core Entities of our System

Before diving into system architecture, we need to establish the fundamental data models that will drive our RTB platform. We'll start with a conceptual overview and progressively add implementation details as we develop our API contracts. These entities form the backbone of our entire system design and will help with the API design.

Our RTB platform centers around five essential entities that fulfill our core functional requirements:

  • Publisher: Represents websites, apps, or content platforms that create advertising campaigns on our platform
  • Advertiser: Represents companies or individuals who purchase ad space through our platform, including account information, billing details, and verification status
  • AdCampaign: Encompasses advertising initiatives with budget constraints, targeting parameters, and performance metrics
  • Bid: Captures individual bidding attempts by advertisers (bidders) with amounts, timestamps, and processing status
  • AdInventory: Defines available advertising slots with placement specifications, audience data, and pricing information

The decision to separate these entities provides several architectural benefits:

  • Scalability: Each entity can be optimized and scaled independently based on access patterns
  • Flexibility: Inventory slots can be reused across different campaign cycles without data duplication
  • Maintainability: Updates to inventory specifications don't impact existing campaign configurations
  • Extensibility: Future features like advanced targeting or performance analytics can be added without major schema changes

During interviews, both normalized and denormalized approaches are valid - the critical aspect is articulating your design rationale and trade-offs clearly.

Let's define these entities with specific fields to guide our implementation:

// Detailed entity definitions for implementation
interface Publisher {
  publisherId: string;
  websiteName: string;
  domain: string;
  email: string;
  accountBalance: number;
  isVerified: boolean;
  monthlyTraffic: number;
  createdAt: Date;
}

interface Advertiser {
  advertiserId: string;
  companyName: string;
  email: string;
  accountBalance: number;
  creditLimit: number;
  isVerified: boolean;
  createdAt: Date;
}

interface AdCampaign {
  campaignId: string;
  publisherId: string;
  name: string;
  startingBid: number;
  currentMaxBid: number;
  dailyBudget: number;
  totalBudget: number;
  spentToday: number;
  totalSpent: number;
  startDate: Date;
  endDate: Date;
  status: 'ACTIVE' | 'PAUSED' | 'COMPLETED' | 'BUDGET_EXHAUSTED';
  inventoryId: string;
  createdAt: Date;
  updatedAt: Date;
}

interface AdInventory {
  inventoryId: string;
  publisherId: string;
  slotDimensions: string;        // "728x90", "300x250", etc.
  slotPosition: string;          // "above-fold", "sidebar", "footer"
  websiteCategory: string;       // "technology", "sports", "news"
  targetAudience: {
    demographics: string[];      // ["25-34", "35-44"]
    interests: string[];         // ["technology", "gaming"]
    geoLocations: string[];      // ["US", "CA", "UK"]
  };
  minimumBid: number;
  estimatedImpressions: number;
  createdAt: Date;
}

interface Bid {
  bidId: string;
  campaignId: string;
  advertiserId: string;
  amount: number;
  status: 'ACCEPTED' | 'REJECTED' | 'OUTBID';
  bidTime: Date;
  ipAddress?: string;
  userAgent?: string;
}

API or System Interface

Before diving into architectural details, we need to establish clear API contracts that define how clients interact with our RTB system. These interfaces will guide our implementation and ensure we address all functional requirements systematically.

For publishers creating ad campaigns, we need a POST endpoint that takes the campaign details and returns the created campaign:

POST /campaigns -> AdCampaign & AdInventory
Authorization: Bearer <publisher_jwt_token>
{
  "name": "Summer Tech Sale Campaign",
  "startingBid": 2.50,
  "dailyBudget": 1000.00,
  "totalBudget": 10000.00,
  "startDate": "2024-06-01T00:00:00Z",
  "endDate": "2024-06-30T23:59:59Z",
  "inventoryId": "inv_tech_banner_001"
}

// Response
{
  "campaignId": "camp_abc123",
  "status": "ACTIVE",
  "currentMaxBid": 2.50,
  "inventory": {
    "slotDimensions": "728x90",
    "websiteCategory": "technology",
    "estimatedImpressions": 50000
  }
}

For advertisers placing bids on campaigns, we need a POST endpoint that takes the bid details and returns the created bid:

POST /campaigns/:campaignId/bids -> Bid
Authorization: Bearer <advertiser_jwt_token>
{
  "amount": 3.75
}

// Response
{
  "bidId": "bid_def456",
  "status": "ACCEPTED",
  "previousMaxBid": 2.50,
  "newMaxBid": 3.75,
  "bidTime": "2024-06-15T14:30:00Z",
  "isCurrentWinner": true
}

For viewing campaigns, we need a GET endpoint that takes a campaignId and returns the campaign and inventory details:

GET /campaigns/:campaignId -> AdCampaign & AdInventory

// Response
{
  "campaign": {
    "campaignId": "camp_abc123",
    "name": "Summer Tech Sale Campaign",
    "currentMaxBid": 3.75,
    "currentWinner": "adv_xyz789",
    "totalBids": 47,
    "timeRemaining": "14d 8h 30m",
    "status": "ACTIVE"
  },
  "inventory": {
    "slotDimensions": "728x90",
    "websiteCategory": "technology",
    "targetAudience": {
      "demographics": ["25-34", "35-44"],
      "interests": ["technology", "gadgets"]
    }
  }
}

For viewing campaign bidding history, we need a GET endpoint that returns paginated bid records:

GET /campaigns/:campaignId/bids?limit=50&offset=0 -> BidHistory

// Response
{
  "bids": [
    {
      "bidId": "bid_def456",
      "amount": 3.75,
      "bidder": "advertiser_xyz",
      "status": "ACCEPTED",
      "bidTime": "2024-06-15T14:30:00Z",
      "isCurrentWinner": true
    },
    {
      "bidId": "bid_ghi789",
      "amount": 3.25,
      "bidder": "advertiser_abc",
      "status": "OUTBID",
      "bidTime": "2024-06-15T14:15:00Z",
      "isCurrentWinner": false
    }
  ],
  "pagination": {
    "total": 47,
    "limit": 50,
    "offset": 0,
    "hasMore": false
  },
  "realTimeUpdates": {
    "websocketUrl": "/campaigns/camp_abc123/bid-stream"
  }
}

High-Level Design

1) Publishers should be able to create ad campaigns with initial bid amounts, spending limits, and campaign timelines

The foundation of our RTB system begins with campaign creation capabilities. Publishers initiate this process through our /campaigns endpoint, providing comprehensive campaign specifications including targeting preferences and budget parameters.

Our initial architecture establishes the communication pathways between client applications and our backend microservices. The "Campaign Service" serves as our primary component, interfacing with persistent storage to manage campaign and inventory data as defined in our entity model. This service specializes in campaign lifecycle management and data retrieval operations.

Google RTB Campaign Creation Flow Diagram - Client API Gateway Campaign Service Database Architecture for Real-Time Bidding System

RTB Campaign Creation Workflow - High-Level Design

Client: Publishers will interact with the system through the client's website or app. All client requests will be routed to the system's backend through an API Gateway.

API Gateway: The API Gateway handles requests from the client, including authentication, rate limiting, and, most importantly, routing to the appropriate service.

Campaign Service: The Campaign Service, at this point, is just a thin wrapper around the database. It takes the campaign details from the request, validates them, and stores them in the database.

Database: The Database stores tables for campaigns and ad inventory.

While this represents a straightforward CRUD implementation, it's valuable to trace the complete data flow for campaign creation to ensure clarity.

Campaign Creation Flow:

  1. Client application (publisher) submits POST request to /campaigns containing campaign specifications
  2. API Gateway authenticates the request and forwards it to the Campaign Service
  3. Campaign Service performs validation and persists campaign and inventory data to the database
  4. Database commits the transaction and returns confirmation to the service layer

Implementation Note: We deliberately exclude ad creative management from this discussion as it represents a separate concern. In production systems, creative assets would typically reside in object storage (S3, GCS) with URL references stored in the campaign metadata. This separation allows for independent scaling of creative delivery and campaign management. During interviews, acknowledging this separation demonstrates awareness of system boundaries and helps maintain focus on core RTB mechanics.

Here's what the database schema might look like for this basic CRUD functionality:

-- Database schema for campaign management
CREATE TABLE publishers (
    publisher_id VARCHAR(36) PRIMARY KEY,
    website_name VARCHAR(255) NOT NULL,
    domain VARCHAR(255) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    account_balance DECIMAL(12,2) DEFAULT 0,
    monthly_traffic INTEGER DEFAULT 0,
    is_verified BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE advertisers (
    advertiser_id VARCHAR(36) PRIMARY KEY,
    company_name VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    account_balance DECIMAL(12,2) DEFAULT 0,
    credit_limit DECIMAL(12,2) DEFAULT 0,
    is_verified BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE ad_inventory (
    inventory_id VARCHAR(36) PRIMARY KEY,
    publisher_id VARCHAR(36) NOT NULL,
    slot_dimensions VARCHAR(20) NOT NULL,
    slot_position VARCHAR(50) NOT NULL,
    website_category VARCHAR(100) NOT NULL,
    target_audience JSONB NOT NULL,
    minimum_bid DECIMAL(10,2) NOT NULL,
    estimated_impressions INTEGER DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE campaigns (
    campaign_id VARCHAR(36) PRIMARY KEY,
    publisher_id VARCHAR(36) REFERENCES publishers(publisher_id),
    inventory_id VARCHAR(36) REFERENCES ad_inventory(inventory_id),
    name VARCHAR(255) NOT NULL,
    starting_bid DECIMAL(10,2) NOT NULL,
    current_max_bid DECIMAL(10,2) NOT NULL,
    daily_budget DECIMAL(12,2) NOT NULL,
    total_budget DECIMAL(12,2) NOT NULL,
    spent_today DECIMAL(12,2) DEFAULT 0,
    total_spent DECIMAL(12,2) DEFAULT 0,
    start_date TIMESTAMP NOT NULL,
    end_date TIMESTAMP NOT NULL,
    status VARCHAR(20) DEFAULT 'ACTIVE',
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Indexes for performance
CREATE INDEX idx_campaigns_publisher ON campaigns(publisher_id);
CREATE INDEX idx_campaigns_status ON campaigns(status);
CREATE INDEX idx_campaigns_end_date ON campaigns(end_date);
CREATE INDEX idx_inventory_category ON ad_inventory(website_category);

2) Advertisers should be able to place competitive bids on ad inventory, with bids only accepted when they surpass the current leading bid

The bidding mechanism represents the most technically challenging aspect of our RTB system, requiring careful consideration of concurrency, consistency, and real-time communication patterns. This component will demand the majority of our architectural attention and design effort.

Our bidding infrastructure requires a specialized "Bidding Service" that operates independently from campaign management. This dedicated service handles:

  • Bid validation and amount verification against current winning bids
  • Atomic campaign state updates with new maximum bid values
  • Comprehensive bid audit trail maintenance
  • Real-time participant notification systems

Service Separation Rationale:

  • Traffic Volume Asymmetry: Bidding operations typically generate 100x more requests than campaign creation, necessitating independent scaling strategies
  • Domain Complexity: Bidding involves sophisticated concurrency control, race condition prevention, and real-time coordination that warrants isolated business logic
  • Performance Specialization: Bidding services require optimization for high-throughput write operations, while campaign services focus on read-heavy access patterns
Google RTB Bidding Service Architecture Diagram - Competitive Bid Processing Flow with Concurrency Control and Database Integration

RTB Bidding Service - Competitive Bid Processing Architecture

Basic Bidding Flow:

  1. Client application (advertiser) submits POST request to /campaigns/:campaignId/bids containing bid specifications
  2. API Gateway authenticates and routes the request to the Bidding Service
  3. Bidding Service retrieves current maximum bid from database, validates the new bid amount, and persists the bid record with appropriate status ("ACCEPTED" for valid bids exceeding current maximum, "REJECTED" otherwise)
  4. Database stores all bid attempts in dedicated bids table with campaign relationship maintained via campaignId foreign key

Critical Design Decision: Comprehensive Bid Auditing

A common architectural mistake involves storing only the current maximum bid value directly on the campaign entity. While this approach reduces storage complexity, it fundamentally violates data integrity principles by destroying historical information.

Eliminating bid history creates several critical problems:

  • Dispute Resolution: Advertisers will inevitably contest bid processing decisions, requiring complete audit trails for verification
  • Fraud Detection: Suspicious bidding patterns can only be identified through historical analysis
  • Business Intelligence: Campaign performance optimization depends on understanding bidding behavior over time
  • Regulatory Compliance: Financial transaction systems typically require comprehensive audit capabilities

The additional storage and complexity costs of maintaining complete bid history are minimal compared to the business risks of data loss.

Here's the complete database schema for our bidding system:

-- Complete bidding system schema
CREATE TABLE bids (
    bid_id VARCHAR(36) PRIMARY KEY,
    campaign_id VARCHAR(36) REFERENCES campaigns(campaign_id),
    advertiser_id VARCHAR(36) REFERENCES advertisers(advertiser_id),
    amount DECIMAL(10,2) NOT NULL,
    status VARCHAR(20) NOT NULL, -- 'ACCEPTED', 'REJECTED', 'OUTBID'
    bid_time TIMESTAMP DEFAULT NOW(),
    ip_address INET,
    user_agent TEXT,
    
    -- Audit fields
    processed_at TIMESTAMP,
    processing_duration_ms INTEGER,
    previous_max_bid DECIMAL(10,2),
    
    CONSTRAINT valid_bid_amount CHECK (amount > 0),
    CONSTRAINT valid_status CHECK (status IN ('ACCEPTED', 'REJECTED', 'OUTBID'))
);

-- Critical indexes for performance
CREATE INDEX idx_bids_campaign_time ON bids(campaign_id, bid_time DESC);
CREATE INDEX idx_bids_advertiser ON bids(advertiser_id);
CREATE INDEX idx_bids_status ON bids(status);
CREATE INDEX idx_bids_amount ON bids(campaign_id, amount DESC);

-- Partial index for active bids only (performance optimization)
CREATE INDEX idx_active_bids ON bids(campaign_id, amount DESC) 
WHERE status = 'ACCEPTED';

And here's how the bidding service logic would look:

class BiddingService:
    def __init__(self, db_connection, cache_client):
        self.db = db_connection
        self.cache = cache_client
        
    def place_bid(self, campaign_id: str, advertiser_id: str, amount: float) -> BidResult:
        start_time = time.time()
        
        try:
            # 1. Validate campaign is active and accepting bids
            campaign = self.get_campaign(campaign_id)
            if not self.is_campaign_active(campaign):
                return BidResult(status='REJECTED', reason='Campaign not active')
            
            # 2. Check advertiser has sufficient funds
            if not self.check_advertiser_funds(advertiser_id, amount):
                return BidResult(status='REJECTED', reason='Insufficient funds')
            
            # 3. Get current max bid (this is where consistency matters)
            current_max = self.get_current_max_bid(campaign_id)
            
            # 4. Validate bid amount
            if amount <= current_max:
                return BidResult(status='REJECTED', reason=f'Bid must exceed ${current_max}')
            
            # 5. Attempt to place bid (atomic operation)
            bid_result = self.atomic_bid_placement(
                campaign_id, advertiser_id, amount, current_max
            )
            
            # 6. If successful, update campaign max bid and notify
            if bid_result.status == 'ACCEPTED':
                self.update_campaign_max_bid(campaign_id, amount)
                self.notify_bid_update(campaign_id, amount, advertiser_id)
            
            return bid_result
            
        except Exception as e:
            self.log_error(f"Bid placement failed: {e}")
            return BidResult(status='ERROR', reason='Internal server error')
        
        finally:
            processing_time = (time.time() - start_time) * 1000
            self.record_metrics(campaign_id, processing_time)
    
    def atomic_bid_placement(self, campaign_id: str, advertiser_id: str, 
                           amount: float, expected_max: float) -> BidResult:
        """
        Atomic bid placement using optimistic concurrency control
        """
        with self.db.transaction():
            # Re-check max bid within transaction
            current_max = self.get_current_max_bid_for_update(campaign_id)
            
            if current_max != expected_max:
                # Someone else bid in the meantime, retry
                return BidResult(status='RETRY', reason='Concurrent bid detected')
            
            if amount <= current_max:
                return BidResult(status='REJECTED', reason='Bid too low')
            
            # Insert the bid
            bid_id = self.insert_bid(campaign_id, advertiser_id, amount, 'ACCEPTED')
            
            # Mark previous winning bid as 'OUTBID'
            self.mark_previous_bids_outbid(campaign_id, amount)
            
            return BidResult(
                status='ACCEPTED',
                bid_id=bid_id,
                previous_max=current_max,
                new_max=amount
            )

3) Users should be able to monitor live campaign activity, including current winning bids and time remaining

Campaign monitoring serves two distinct user scenarios with different technical requirements:

  1. Discovery Mode: Users browsing available inventory to assess bidding opportunities (read-only access)
  2. Active Bidding: Users preparing to place competitive bids requiring current market information (real-time consistency)

While both scenarios involve campaign data retrieval, the second demands significantly more sophisticated data freshness guarantees to prevent user frustration and bidding errors.

The detailed real-time update mechanisms will be covered in our deep dive sections, but we'll establish the foundational approach for maintaining reasonably current bid information.

Google RTB Campaign Monitoring System Architecture - Real-Time Bid Status Tracking and Live Campaign Activity Dashboard

RTB Campaign Monitoring - Live Bid Status and Activity Tracking

Initial Campaign Load: Users initiate campaign viewing through GET requests to /campaigns/:campaignId, receiving comprehensive campaign details including inventory specifications, current bid status, and timeline information.

Continuous Data Freshness: The critical challenge emerges after initial page load. Static bid information quickly becomes stale in active campaigns, leading to user confusion when their "winning" bids are rejected due to outdated market data.

Our baseline solution implements periodic polling every few seconds to refresh maximum bid values. While this approach introduces some latency and doesn't guarantee perfect consistency, it significantly reduces the probability of bid rejection due to stale data, providing an acceptable user experience for most scenarios.


4) Users should be able to view the complete bidding history for any campaign in real-time

Comprehensive bid history access serves multiple critical business functions: transparency for participants, dispute resolution capabilities, and competitive intelligence gathering. This requirement demands both historical data retrieval and real-time updates as new bids arrive.

Bidding History Requirements:

  • Complete Audit Trail: All bid attempts (accepted, rejected, outbid) with timestamps and amounts
  • Real-time Updates: Live streaming of new bids as they occur during active campaigns
  • Participant Information: Advertiser details (respecting privacy constraints) and bid status
  • Performance Metrics: Processing times, bid validation results, and system health indicators

API Design:

// Get paginated bid history
GET /campaigns/:campaignId/bids?limit=50&offset=0

// Real-time bid updates via WebSocket
WebSocket /campaigns/:campaignId/bid-stream

Implementation Approach: The bid history functionality leverages our existing bid storage infrastructure while adding real-time streaming capabilities. Since we already maintain comprehensive bid records for consistency and auditing purposes, the primary challenge involves efficient data retrieval and real-time distribution.

Data Flow:

  1. Historical Retrieval: Paginated queries against the bids table with appropriate indexing for performance
  2. Real-time Streaming: WebSocket connections that receive bid updates from our existing notification system
  3. Privacy Filtering: Advertiser information anonymization based on campaign settings and user permissions

This functionality integrates seamlessly with our existing bidding infrastructure, requiring minimal additional complexity while providing significant value for system transparency and user experience.

Here is how our final system architecture looks like. Keep in mind that this is a simplified version of the actual system architecture that will be improved dramatically in the next sections:

Google RTB Bid History System Architecture - Real-Time Bidding History Tracking with WebSocket Streaming and Database Integration

RTB Bid History Service - Historical Data and Real-Time Streaming


Advanced Technical Deep Dives

With our foundational architecture established, we must now address the complex technical challenges that differentiate senior-level system design interviews. These deep-dive discussions typically reflect the seniority expectations for the role, and while we'll cover the most critical architectural decisions here, additional explorations may emerge based on interview dynamics.

1) How do we guarantee bid consistency under high concurrency?

Maintaining accurate bid state represents the cornerstone challenge in our RTB platform design. Consider this scenario that illustrates why robust synchronization mechanisms are absolutely essential.

Concurrency Conflict Example

Initial State:

  • Active campaign shows maximum bid of $25
  • Advertiser Alpha places bid for $150
  • Advertiser Beta simultaneously submits bid for $45

Without adequate concurrency protection, this dangerous execution sequence becomes possible:

Step 1 - Parallel Database Reads:

  • Alpha's bid request retrieves current maximum as $25
  • Beta's bid request simultaneously reads the same stale maximum of $25
  • Both requests see identical database state at the same moment

Step 2 - Independent Validation:

  • Alpha's system validates: $150 > $25 ✓ (bid should be accepted)
  • Beta's system validates: $45 > $25 ✓ (bid should be accepted)
  • Both validations pass using the same outdated reference point

Step 3 - Concurrent Transaction Commits:

  • Alpha's transaction commits: Updates maximum bid to $150, marks bid as ACCEPTED
  • Beta's transaction commits: Updates maximum bid to $45, marks bid as ACCEPTED
  • Database now contains contradictory state
⚠️ Critical System Failure

Our platform now contains contradictory winning states:

  • Advertiser Alpha believes they won with $150 (factually correct)
  • Advertiser Beta believes they won with $45 (incorrectly processed due to race condition)

This concurrency failure violates fundamental data consistency requirements. Beta's submission should have been immediately rejected against Alpha's superior $150 offer, but inadequate synchronization permitted both operations to complete using stale data references.

Let's examine several approaches to resolve this consistency challenge:

❌ Problematic Approach: Broad Row Locking

A common initial strategy involves extensive row-level locking to retrieve current maximum values. This approach attempts to serialize all bid processing for a campaign to prevent concurrent modifications. The implementation would proceed as follows:

  1. Initialize database transaction for atomicity guarantees
  2. Acquire exclusive locks on all campaign bid records using SELECT ... FOR UPDATE
  3. Calculate current maximum bid from locked dataset
  4. Validate incoming bid amount against computed maximum
  5. Insert new bid record if validation succeeds
  6. Complete transaction to persist all changes

The SQL implementation might look like the following:

START TRANSACTION;

WITH highest_bid AS (
    SELECT COALESCE(MAX(bid_amount), 0) AS current_max
    FROM campaign_bids
    WHERE campaign_uuid = ?
    FOR UPDATE
)
INSERT INTO campaign_bids (campaign_uuid, bidder_id, bid_amount, created_timestamp)
SELECT ?, ?, ?, CURRENT_TIMESTAMP
FROM highest_bid
WHERE ? > current_max
RETURNING bid_uuid;

COMMIT;

Problems with this approach
This extensive locking strategy creates severe system bottlenecks:

Throughput Degradation: Locking entire bid collections for individual campaigns forces sequential processing, eliminating parallelism benefits. As campaign volume and bidding frequency increase, lock contention creates cascading performance failures that can escalate to full table locks under heavy load conditions.

Bad User Experience: Extended lock hold times translate directly to response time increases and timeout failures during bid submission. Real-time advertising platforms require sub-second response guarantees that become impossible under this serialized processing model.

This approach fundamentally violates database locking best practices by: a) locking excessive row counts unnecessarily, and b) maintaining locks for extended durations during computational operations.

⚠️ Improved Approach: External Caching Layer

Recognizing the performance bottleneck of extensive row locking, many engineers pivot toward in-memory caching solutions. This approach acknowledges that repeatedly querying the database for maximum bid calculations while holding locks creates the primary performance constraint.

By leveraging Redis for maximum bid caching, the data flow transforms significantly:

  1. Query Redis for current campaign maximum
  2. Atomically update Redis with new maximum if bid qualifies
  3. Persist bid record to primary database with appropriate status

Emerging Complexities
While this resolves database lock contention, it introduces distributed state consistency challenges:

  1. Atomic Cache Operations: Redis must handle read-modify-write sequences atomically to prevent race conditions. Redis's single-threaded execution model supports this through Lua scripting:
-- Atomic bid validation and update script
local campaign_key = KEYS[1]
local bid_amount = tonumber(ARGV[1])
local current_maximum = tonumber(redis.call('GET', campaign_key) or '0')

if bid_amount > current_maximum then
    redis.call('SET', campaign_key, bid_amount)
    redis.call('EXPIRE', campaign_key, 86400)  -- 24-hour TTL
    return {1, current_maximum, bid_amount}  -- Success response
else
    return {0, current_maximum, bid_amount}  -- Rejection response
end
  1. Cross-System Consistency: Maintaining synchronization between Redis and PostgreSQL introduces several architectural choices:
    • Implement two-phase commit protocols for strict consistency (high complexity, performance overhead)
    • Treat Redis as authoritative during active campaigns with asynchronous database reconciliation (consistency-performance tradeoff)
    • Apply compensating transaction patterns with rollback mechanisms for cache-database conflicts

The distributed transaction requirement often signals architectural complexity that interviews may explore, suggesting system redesign to eliminate cross-system consistency challenges.

✅ Optimal Approach: Consolidated Database State

Rather than managing consistency across separate systems, we can consolidate maximum bid storage within our primary database schema. This approach treats the campaigns table as both the authoritative state store and our performance cache layer.

The streamlined execution flow becomes:

  1. Acquire exclusive lock on single campaign record
  2. Read current maximum from campaigns table
  3. Validate and persist bid record with computed status
  4. Update campaign maximum if new bid qualifies
  5. Release lock and complete transaction

This dramatically reduces lock scope to a single row with minimal hold duration. For scenarios requiring lock-free operations, optimistic concurrency control provides an elegant alternative.

Optimistic concurrency excels in our RTB context because simultaneous bid conflicts remain statistically uncommon. The implementation pattern:

  1. Read campaign record including current maximum (serves as version identifier)
  2. Validate incoming bid amount against retrieved maximum
  3. Attempt conditional update using original maximum as constraint:
UPDATE rtb_campaigns 
SET current_maximum_bid = ?, 
    last_modified = CURRENT_TIMESTAMP,
    bid_count = bid_count + 1
WHERE campaign_uuid = ? 
  AND current_maximum_bid = ?
  AND campaign_status = 'ACTIVE';
  1. Retry entire sequence if update affects zero rows (indicating concurrent modification)

This eliminates locking entirely while preserving consistency guarantees, accepting occasional retry overhead when conflicts materialize.


2) How do we design for fault tolerance and data durability?

Bid loss represents an unacceptable failure mode for our RTB platform. Consider the business impact: notifying an advertiser that their winning bid "disappeared" during processing would fundamentally undermine platform credibility. We must architect absolute durability guarantees and ensure complete bid processing even during catastrophic system failures.

Our solution centers on immediate bid persistence through durable message queuing, providing multiple architectural advantages:

Immediate Persistence: Upon bid receipt, we instantly commit the bid to persistent storage within our message queue infrastructure. This ensures bid survival even during complete service failures. Consider this analogous to a bank immediately recording deposit slips - the transaction record persists regardless of subsequent processing delays.

Traffic Surge Absorption: High-profile campaigns often experience exponential bid volume increases during closing periods. Without proper buffering, our system faces impossible choices:

  • Reject valid bids (business failure)
  • System collapse under load (operational failure)
  • Massive resource over-provisioning (economic inefficiency)

Message queues provide elastic buffering capacity, gracefully handling traffic surges by temporarily storing excess bids while maintaining processing guarantees. This transforms unpredictable traffic spikes into manageable, distributed processing workloads.

Temporal Ordering Guarantees: Message queues (especially Kafka) provide strict ordering guarantees within partitions. This proves critical for fairness - when multiple advertisers submit identical bid amounts, temporal precedence determines the winner. Queue-based ordering eliminates complex timestamp comparison logic.

Google RTB Fault Tolerance Architecture - Kafka Message Queue System for Durable Bid Processing and Data Persistence

RTB Fault Tolerance - Kafka Queue Architecture for Data Durability

Our implementation leverages Kafka for message queue functionality. While alternatives like RabbitMQ or Amazon SQS offer similar capabilities, Kafka provides optimal characteristics for our requirements:

Massive Throughput Capacity: Kafka clusters routinely handle millions of messages per second, easily accommodating our peak bidding volumes during major advertising events.

Multi-Level Durability: Kafka provides configurable replication across multiple brokers with synchronous acknowledgment policies, ensuring zero message loss even during broker failures.

Horizontal Partitioning: Campaign-based partitioning enables parallel processing across multiple consumer instances while maintaining per-campaign ordering guarantees.

The operational flow proceeds as follows:

  1. Advertiser submits bid via client application
  2. API Gateway routes to Kafka producer for immediate persistence
  3. Kafka confirms durable storage and returns receipt to advertiser
  4. Bidding Service consumes messages asynchronously for processing
  5. Valid bids receive database persistence with status updates
  6. Failed processing attempts remain in Kafka for retry handling

Here's our Kafka implementation approach for resilient bid processing:

# Kafka Producer for reliable bid ingestion
class RTBMessageProducer:
    def __init__(self, cluster_config):
        self.kafka_client = KafkaProducer(
            bootstrap_servers=cluster_config['brokers'],
            value_serializer=self._serialize_message,
            key_serializer=str.encode,
            acks='all',  # Require all in-sync replicas
            retries=5,
            batch_size=32768,
            linger_ms=10,  # Micro-batching for throughput
            compression_type='lz4'
        )
    
    def ingest_bid_request(self, bid_payload):
        # Campaign ID serves as partition key for ordering
        routing_key = bid_payload['campaign_uuid']
        
        # Enrich with processing metadata
        enriched_message = {
            'request_id': self._generate_request_id(),
            'ingestion_timestamp': self._current_timestamp(),
            'campaign_uuid': bid_payload['campaign_uuid'],
            'bidder_uuid': bid_payload['bidder_uuid'],
            'bid_amount_cents': bid_payload['bid_amount_cents'],
            'client_metadata': {
                'ip_address': bid_payload.get('client_ip'),
                'user_agent': bid_payload.get('user_agent'),
                'request_origin': bid_payload.get('origin')
            }
        }
        
        # Publish to persistent topic
        send_result = self.kafka_client.send(
            topic='rtb-bid-ingestion',
            key=routing_key,
            value=enriched_message
        )
        
        # Return acknowledgment immediately
        return {
            'request_id': enriched_message['request_id'],
            'processing_status': 'QUEUED',
            'acknowledgment': 'Bid successfully queued for processing'
        }

# Kafka Consumer for asynchronous bid processing
class RTBMessageConsumer:
    def __init__(self, cluster_config, processing_service):
        self.kafka_client = KafkaConsumer(
            'rtb-bid-ingestion',
            bootstrap_servers=cluster_config['brokers'],
            group_id='rtb-processing-cluster',
            value_deserializer=self._deserialize_message,
            enable_auto_commit=False,  # Explicit commit control
            max_poll_records=50,
            session_timeout_ms=45000,
            fetch_max_wait_ms=500
        )
        self.processing_service = processing_service
        
    def consume_and_process(self):
        for kafka_message in self.kafka_client:
            try:
                bid_request = kafka_message.value
                
                # Execute bid validation and processing
                processing_result = self.processing_service.execute_bid_logic(
                    campaign_uuid=bid_request['campaign_uuid'],
                    bidder_uuid=bid_request['bidder_uuid'],
                    amount_cents=bid_request['bid_amount_cents']
                )
                
                # Persist processing outcome
                self.record_processing_result(bid_request['request_id'], processing_result)
                
                # Acknowledge successful processing
                self.kafka_client.commit()
                
                # Trigger real-time notifications if bid accepted
                if processing_result.is_successful():
                    self.broadcast_bid_update(bid_request['campaign_uuid'], processing_result)
                    
            except Exception as processing_error:
                self.handle_processing_failure(kafka_message, processing_error)

# Kafka cluster topology configuration
cluster_topology = {
    'rtb-bid-ingestion': {
        'partition_count': 64,  # High parallelism
        'replication_factor': 3,
        'configuration': {
            'retention.ms': 604800000,  # 7-day retention
            'compression.type': 'lz4',
            'min.insync.replicas': 2,
            'cleanup.policy': 'delete'
        }
    },
    'rtb-notification-events': {
        'partition_count': 32,
        'replication_factor': 3,
        'configuration': {
            'retention.ms': 172800000,  # 2-day retention
            'compression.type': 'lz4'
        }
    }
}

While introducing message queues adds minimal latency overhead (typically 3-8ms in production), this represents an acceptable tradeoff for guaranteed durability. Multiple patterns exist for optimizing user experience with asynchronous processing, each offering different consistency-latency balance points. We'll explore real-time notification strategies in our scaling discussion to address immediate feedback requirements.


3) How do we deliver real-time bid updates to active users?

Revisiting our functional requirement for 'Users should be able to monitor live campaign activity, including current winning bids', our existing polling-based approach reveals several critical limitations:

Latency Concerns: Periodic polling with multi-second intervals cannot satisfy the real-time expectations of competitive bidding environments, particularly during high-activity campaign periods.

Resource Inefficiency: Continuous database polling creates unnecessary load since the vast majority of requests return unchanged maximum bid values, resulting in wasted computational resources and network bandwidth.

To address the non-functional requirement of 'The system displays the current highest bid in real-time,' we need sophisticated push-based notification mechanisms.

Here are the primary approaches for achieving real-time updates:

⚠️ Incremental Approach: Extended Polling

Extended polling represents a substantial enhancement over traditional request-response patterns by maintaining persistent connections until either new data becomes available or predetermined timeout thresholds are reached. This approach establishes enduring client-server connections that remain active until campaign state changes occur or timeout periods expire (commonly 45-90 seconds).

Rather than immediate response delivery, servers retain incoming requests in a suspended state. Upon bid acceptance events, servers simultaneously respond to all pending requests with updated maximum bid information. Clients then immediately reinitiate extended polling requests, creating quasi-persistent communication channels.

Client-side implementation follows this pattern:

async function monitorCampaignBids(campaignUuid) {
  try {
    const response = await fetch(`/api/campaigns/${campaignUuid}/bid-stream`, {
      signal: AbortSignal.timeout(45000), // 45-second timeout
      headers: {
        'Accept': 'application/json',
        'Cache-Control': 'no-cache'
      }
    });
    
    if (response.ok) {
      const bidUpdate = await response.json();
      refreshCampaignDisplay(bidUpdate);
    }
  } catch (timeoutError) {
    // Handle timeout gracefully
    console.log('Extended poll timeout, reinitializing...');
  }
  
  // Immediately establish next polling cycle
  setTimeout(() => monitorCampaignBids(campaignUuid), 100);
}

Server-side implementation maintains pending request collections organized by campaign identifier. When bid processing completes successfully, all waiting requests receive immediate responses with updated state information.

Operational Limitations
Despite improvements over basic polling, extended polling introduces several architectural constraints:

Resource Consumption: Maintaining numerous open connections for popular campaigns consumes substantial server resources. High-traffic campaigns may create "reconnection storms" where thousands of clients simultaneously attempt to reconnect after receiving updates.

Temporal Inconsistencies: The timeout mechanism creates unavoidable delays - bids arriving immediately after client timeout expiration won't be visible until the subsequent polling cycle completes. This necessitates balancing resource consumption (shorter timeouts increase request frequency) against update latency (longer timeouts delay information delivery).

Horizontal Scaling Complexity: Each server instance maintains isolated connection pools, requiring additional coordination infrastructure (Redis, message queues) for cross-server update propagation when scaling horizontally.

✅ Optimal Approach: Server-Sent Events (SSE)

Server-Sent Events establishes the optimal architecture for real-time bid update delivery. SSE creates unidirectional communication channels from server to client, enabling immediate data push capabilities without requiring continuous client polling or complex connection management.

When users access campaign monitoring interfaces, browsers establish dedicated SSE connections. Servers can then instantly transmit maximum bid updates through these channels whenever state changes occur. This creates authentic real-time experiences while maintaining superior efficiency compared to polling-based alternatives.

Client-side implementation achieves remarkable simplicity:

const campaignEventStream = new EventSource(`/api/campaigns/${campaignUuid}/live-updates`);

campaignEventStream.addEventListener('bid-update', (event) => {
  const bidData = JSON.parse(event.data);
  updateCampaignInterface(bidData);
});

campaignEventStream.addEventListener('campaign-ended', (event) => {
  const finalResults = JSON.parse(event.data);
  displayFinalResults(finalResults);
});

Server-side architecture maintains active SSE connection registries organized by campaign identifiers. Upon bid acceptance, servers broadcast updated information to all registered connections for the affected campaign:

class RTBEventBroadcaster:
    def __init__(self):
        self.active_streams = defaultdict(set)  # campaign_uuid -> connection_set
        
    def register_client_stream(self, campaign_uuid, sse_connection):
        if campaign_uuid not in self.active_streams:
            self.active_streams[campaign_uuid] = set()
        self.active_streams[campaign_uuid].add(sse_connection)
    
    def propagate_bid_update(self, campaign_uuid, bid_data):
        registered_connections = self.active_streams.get(campaign_uuid, set())
        if registered_connections:
            formatted_event = f"event: bid-update\ndata: {json.dumps(bid_data)}\n\n"
            for connection in registered_connections:
                try:
                    connection.send(formatted_event)
                except ConnectionError:
                    # Mark connection for cleanup
                    pass

WebSocket alternatives exist, but SSE provides superior characteristics for unidirectional updates with reduced implementation complexity and better browser compatibility.

Google RTB Real-Time Updates Architecture - Server-Sent Events SSE Implementation for Live Bid Notifications and Campaign Monitoring

RTB Real-Time Updates - SSE Architecture for Live Bid Notifications

Scaling Considerations

The primary architectural challenge involves coordinating real-time updates across horizontally distributed servers. As user volume increases, multiple server instances become necessary to handle SSE connection loads. This creates coordination challenges: when bid processing occurs on Server Alpha, users connected to Server Beta may not receive immediate updates without additional infrastructure.

Cross-Server Coordination Challenge

Step 1 - Initial Connection Setup:

  • Advertiser X connects to Server Alpha to monitor Campaign Z
  • Advertiser Y connects to Server Beta to monitor the same Campaign Z
  • Both advertisers expect real-time updates for identical campaign

Step 2 - Bid Processing Event:

  • Advertiser W places a new bid on Campaign Z
  • The bid gets processed by Server Alpha (where the request lands)
  • Server Alpha updates the campaign's maximum bid in the database

Step 3 - Inconsistent Notification Delivery:

  • ✅ Advertiser X receives immediate SSE notification (connected to Server Alpha)
  • ❌ Advertiser Y experiences stale data (connected to Server Beta)
  • Server Beta has no knowledge of the bid update without coordination

Step 4 - Data Inconsistency Window:

  • Advertiser Y continues seeing outdated maximum bid information
  • This creates temporary data inconsistency across user experiences
  • Cross-server coordination must occur to synchronize all connected clients

Our scaling discussion will address distributed event coordination solutions for this architectural challenge.


4) How do we architect horizontal scaling for 10 million concurrent campaigns?

Achieving scalability for 10 million concurrent campaigns requires systematic analysis of each architectural layer to identify bottlenecks and implement appropriate scaling patterns. We'll establish baseline requirements and systematically address each critical component.

Capacity Planning & Traffic Estimation

For 10M globally distributed advertising campaigns, we must model realistic usage patterns:

  • Typical campaign lifecycle: 14-21 days (standard advertising cycles)
  • Expected bids per campaign: ~60-90 bids (varies by campaign attractiveness)
  • Daily bid throughput: 10M campaigns × 75 bids ÷ 17.5 days = ~43M bids/day
  • Peak traffic multiplier: 5x during advertising events (holiday seasons, major launches)
  • Maximum load: 43M × 5 = 215M bids/day = ~2,500 bids/second
  • Burst requirements: 6,000 bids/second for traffic surges

Persistent Storage Layer Scaling

Our database infrastructure represents the foundational scaling challenge, requiring analysis of both storage capacity and transaction throughput.

Storage Capacity Planning:

  • Campaign record size: ~1.4KB (including targeting metadata, budget information, creative references)
  • Bid record size: ~500 bytes (optimized schema with proper indexing strategies)
  • Annual campaign turnover: 10M × 25 cycles = 250M campaigns/year
  • Total storage requirement: 250M × (1.4KB + (500 bytes × 75 bids)) = ~9.5TB/year

Transaction Throughput Requirements: Peak load of 6,000 writes/second substantially exceeds single PostgreSQL instance capacity (~1,200 writes/second for complex transactions with consistency guarantees). This mandates horizontal database sharding.

Database Sharding Implementation:

-- Campaign-based horizontal sharding strategy
CREATE TABLE rtb_campaigns_partition_00 (LIKE rtb_campaigns INCLUDING ALL);
CREATE TABLE rtb_campaigns_partition_01 (LIKE rtb_campaigns INCLUDING ALL);
-- Continue through 12 partitions for 12x write capacity

-- Consistent hash routing: sha256(campaign_uuid) % partition_count

Read Scaling Strategy: Each shard deploys 3-4 read replicas to handle campaign discovery traffic, which typically generates 15x more read operations than write operations.

Microservices Horizontal Scaling

Bidding Service Auto-Scaling: The bidding service processes our highest-intensity workloads and demands sophisticated auto-scaling mechanisms.

Dynamic Scaling Configuration:

rtb_bidding_service:
  instance_management:
    minimum_capacity: 12
    maximum_capacity: 150
  scaling_triggers:
    - metric: cpu_utilization
      threshold: 70%
    - metric: memory_utilization  
      threshold: 75%
    - metric: kafka_consumer_lag
      threshold: 1500_messages
  scaling_behavior:
    scale_out_cooldown: 240s
    scale_in_cooldown: 720s
    scale_out_increment: 30%
    scale_in_decrement: 15%

Geographic Load Distribution:

  • Regional allocation: 45% North America, 30% Europe, 15% Asia-Pacific, 10% other regions
  • Campaign-consistent routing maintains bid ordering within individual campaigns
  • Circuit breaker patterns prevent cascading failures during regional service disruptions

Campaign Service Performance Optimization: Campaign data exhibits high read-to-write ratios, enabling aggressive caching strategies:

# Hierarchical caching architecture
class RTBCampaignService:
    def __init__(self):
        self.memory_cache = LRUCache(capacity=15000)    # Local memory
        self.distributed_cache = RedisCluster()         # Cross-server cache
        self.persistence_layer = PartitionedDatabase()
    
    async def retrieve_campaign_data(self, campaign_uuid: str):
        # Tier 1: Local memory cache (sub-millisecond)
        if cached_campaign := self.memory_cache.get(campaign_uuid):
            return cached_campaign
            
        # Tier 2: Distributed cache (~3ms lookup)
        if cached_campaign := await self.distributed_cache.get(f"rtb:campaign:{campaign_uuid}"):
            self.memory_cache[campaign_uuid] = cached_campaign
            return cached_campaign
            
        # Tier 3: Database retrieval (~25ms lookup)
        campaign_data = await self.persistence_layer.query_campaign(campaign_uuid)
        await self.distributed_cache.setex(f"rtb:campaign:{campaign_uuid}", 600, campaign_data)
        self.memory_cache[campaign_uuid] = campaign_data
        return campaign_data

Message Queue Scaling Infrastructure

Kafka infrastructure demands precise capacity planning to accommodate traffic bursts while maintaining zero message loss guarantees.

Kafka Cluster Topology:

rtb_kafka_cluster:
  broker_count: 15
  topic_configurations:
    rtb_bid_ingestion:
      partition_count: 60  
      replication_factor: 3
      retention_hours: 168  # 7-day retention
      compression: lz4
    rtb_event_notifications:
      partition_count: 30
      replication_factor: 3
      retention_hours: 48   # 2-day retention
      compression: snappy

Throughput Capacity Analysis:

  • Peak message rate: 6,000 messages/second
  • Average message payload: ~900 bytes (including enrichment metadata)
  • Per-partition bandwidth: 6,000 × 900 bytes = 5.4MB/s per partition
  • Aggregate cluster bandwidth: 60 partitions × 5.4MB/s = 324MB/s

High-Performance Producer Configuration:

# Optimized Kafka producer settings
producer_configuration = {
    'bootstrap.servers': kafka_cluster_endpoints,
    'batch.size': 65536,              # 64KB batch size
    'linger.ms': 12,                  # 12ms batching window
    'compression.type': 'lz4',        # High-speed compression
    'acks': 'all',                    # Full durability
    'retries': 7,
    'max.in.flight.requests.per.connection': 5,
    'enable.idempotence': True,       # Exactly-once delivery
    'buffer.memory': 67108864         # 64MB buffer
}

Consumer Group Architecture:

  • 10 consumer groups for parallel processing scalability
  • Each group handles 6 partitions (60 ÷ 10 = 6 partitions/group)
  • Consumer lag monitoring with alerting at 750 message threshold
  • Dead letter topic handling for processing failures

This completes our core infrastructure scaling approach. The final critical component - real-time connection management for millions of concurrent users - requires its own dedicated architectural discussion.


5) How do we scale real-time updates for millions of concurrent users?

Managing real-time bid updates across 10M campaigns with potentially 200M concurrent viewers presents our most complex distributed systems challenge. A single server cannot maintain millions of persistent connections, and when bids are processed on one server, users connected to other servers must still receive immediate updates.

The Core Scaling Challenge

Step 1 - Connection Distribution Problem:

  • With 10M campaigns averaging 20 viewers each, we need to support ~200M concurrent SSE connections
  • High-profile campaigns can attract 75,000+ simultaneous viewers
  • Single server capacity limit: ~300,000 concurrent connections
  • Required server instances: 600+ servers globally to handle peak load

Step 2 - Cross-Server Coordination Challenge:

  • Advertiser A connects to Server 1 monitoring Campaign X
  • Advertiser B connects to Server 2 monitoring the same Campaign X
  • When Advertiser C places a bid processed by Server 1, only Advertiser A receives immediate updates
  • Advertiser B experiences stale data until coordination occurs

Step 3 - Geographic Distribution Complexity:

  • Users distributed across multiple regions (Americas, Europe, Asia-Pacific)
  • Campaign activity occurs globally around the clock
  • Network latency affects real-time experience quality
  • Regional server clusters needed for optimal performance

Multi-Tier Real-Time Architecture


Tier 1 - Regional SSE Server Clusters

Each geographic region operates dedicated SSE server clusters to minimize latency and provide fault isolation:

Connection Management Strategy:

  • Americas East: 250 server instances, 320,000 connections per server
  • Europe Central: 200 server instances, 300,000 connections per server
  • Asia-Pacific: 150 server instances, 300,000 connections per server

Connection Lifecycle Management:

  • Client establishes SSE connection to nearest regional cluster
  • Load balancer uses consistent hashing by campaign_uuid for sticky sessions
  • Server registers client interest in specific campaign updates
  • Heartbeat mechanism maintains connection health monitoring

Tier 2 - Redis Pub/Sub Event Distribution

Rather than publishing updates to individual campaign channels, we implement shard-based broadcasting for efficiency:

Event Broadcasting Process:

Step 1 - Shard-Based Channel Design:

  • Campaigns distributed across 1,000 logical shards using consistent hashing
  • Each shard maps to dedicated Redis Pub/Sub channel: shard:123:bids
  • SSE servers subscribe only to shards containing campaigns with active connections
  • Reduces subscription overhead from millions of channels to thousands
// Consistent shard calculation across all system components
// NOTE: In prod, we would use consistent hashing to calculate the shard for a campaign.
function getShardForCampaign(campaignUuid) {
  const hash = crypto.createHash('md5').update(campaignUuid).digest('hex');
  const numValue = parseInt(hash.substring(0, 8), 16);
  return numValue % 1000; // 1000 total shards
}

Step 2 - Intelligent Event Routing:

  • Bidding Service determines shard for processed bid: shard_id = hash(campaign_uuid) % 1000
  • Publishes event to shard channel with campaign identifier included in message
  • All SSE servers subscribed to that shard receive the update
  • Servers filter events locally and forward only to relevant connections
// Publishing bid updates to appropriate shard channel
async function publishBidUpdate(campaignUuid, bidData) {
  const shardId = getShardForCampaign(campaignUuid);
  const channel = `shard:${shardId}:bids`;
  
  const message = { campaignUuid, ...bidData };
  await redisClient.publish(channel, JSON.stringify(message));
}

Step 3 - Event Message Structure:

{
  "campaign_uuid": "camp_abc123",
  "event_type": "BID_ACCEPTED", 
  "bid_amount": 150.00,
  "timestamp": "2024-06-15T14:30:00Z",
  "originating_server": "bid-processor-7"
}

Tier 3 - Connection Affinity and Failover

Session Affinity Implementation:

  • Load balancer routes users to same server instance consistently
  • Server maintains local connection registry organized by campaign_uuid
  • Minimizes cross-server coordination for repeat connections
  • Enables efficient local filtering of incoming events
// Local connection registry on each SSE server
class ConnectionManager {
  constructor() {
    this.connections = new Map(); // campaignUuid -> Set of SSE connections
  }
  
  addConnection(campaignUuid, sseConnection) {
    if (!this.connections.has(campaignUuid)) {
      this.connections.set(campaignUuid, new Set());
    }
    this.connections.get(campaignUuid).add(sseConnection);
  }
  
  broadcastToCampaign(campaignUuid, message) {
    const connections = this.connections.get(campaignUuid);
    connections?.forEach(conn => conn.write(`data: ${message}\n\n`));
  }
}

Graceful Failover Mechanism:

Step 1 - Health Monitoring:

  • Regular health checks between load balancer and SSE servers
  • Connection count monitoring prevents server overload
  • Automatic traffic diversion when servers approach capacity limits
// Health check endpoint for load balancer monitoring
app.get('/health', (req, res) => {
  const connectionCount = connectionManager.getTotalConnections();
  const maxConnections = 300000; // Server capacity limit
  
  if (connectionCount > maxConnections * 0.9) {
    return res.status(503).json({ 
      status: 'overloaded', 
      connections: connectionCount 
    });
  }
  
  res.json({ status: 'healthy', connections: connectionCount });
});

Step 2 - Connection Migration:

  • When server fails, affected clients receive connection termination
  • Client automatically reconnects through load balancer to healthy server
  • New server re-establishes campaign subscription and delivers current state
  • Total reconnection time typically under 2-3 seconds
// Client-side automatic reconnection with exponential backoff
function connectToCampaignUpdates(campaignUuid, retryCount = 0) {
  const eventSource = new EventSource(`/api/campaigns/${campaignUuid}/live-updates`);
  
  eventSource.onerror = (error) => {
    eventSource.close();
    
    // Exponential backoff: 1s, 2s, 4s, then cap at 5s
    const delay = Math.min(1000 * Math.pow(2, retryCount), 5000);
    setTimeout(() => {
      connectToCampaignUpdates(campaignUuid, retryCount + 1);
    }, delay);
  };
  
  return eventSource;
}

Step 3 - Coordinated Maintenance:

  • Planned maintenance uses graceful connection draining
  • Server stops accepting new connections while maintaining existing ones
  • Clients gradually migrate to other servers through natural reconnection
  • Zero-downtime deployments across server clusters

Optimizing for High-Traffic Campaigns

Traffic Pattern Analysis:

  • Campaign popularity follows power-law distribution (80/20 rule applies)
  • Top 2% of campaigns generate 60% of total connection load
  • Viral campaigns can spike from 100 to 50,000 viewers in minutes
  • Geographic concentration varies by campaign type and timing

Dynamic Load Balancing Strategies:

Campaign-Aware Server Assignment:

  • Monitor connection counts per campaign across server cluster
  • Automatically distribute high-traffic campaigns across multiple servers
  • Use consistent hashing with virtual nodes for balanced distribution
  • Prevent single-server bottlenecks during viral campaign events
// Load balancer logic for high-traffic campaign distribution
function selectServerForCampaign(campaignUuid, serverCluster) {
  const connectionCounts = serverCluster.map(s => s.getConnectionCount(campaignUuid));
  const maxConnections = Math.max(...connectionCounts);
  
  // If any campaign has >10K connections on single server, use round-robin
  if (maxConnections > 10000) {
    return serverCluster[hash(campaignUuid + Date.now()) % serverCluster.length];
  }
  
  // Otherwise use consistent hashing for session affinity
  return serverCluster[hash(campaignUuid) % serverCluster.length];
}

Adaptive Connection Throttling:

  • Implement per-campaign connection limits during extreme traffic spikes
  • Queue additional connections with estimated wait times
  • Provide alternative update mechanisms (periodic refresh) for overflow users
  • Maintain service quality for existing connections during surge events

Caching Layer for Popular Campaigns:

  • Cache recent bid history and campaign state in Redis cluster
  • Serve initial connection state from cache rather than database
  • Reduces database load during high-traffic campaign viewing
  • Enables faster connection establishment for popular campaigns

Event Consistency and Ordering

Message Ordering Guarantees:

Within-Campaign Ordering:

  • All events for single campaign processed through same shard
  • Redis Pub/Sub maintains message order within each channel
  • SSE servers deliver events to clients in received order
  • Timestamp verification prevents out-of-order event delivery

Cross-Campaign Event Coordination:

  • No ordering guarantees needed between different campaigns
  • Each campaign operates independently for performance
  • Simplified architecture without global event sequencing
  • Higher throughput through parallel campaign processing

Handling Network Partitions:

Partition Detection:

  • Redis Cluster monitors node connectivity and health
  • SSE servers detect Redis connection failures through timeouts
  • Client-side reconnection logic handles server unavailability
  • Graceful degradation maintains basic functionality during outages

Recovery Mechanisms:

  • Event replay capability for missed updates during disconnections
  • Campaign state synchronization upon reconnection
  • Conflict resolution through timestamp-based ordering
  • Client-side caching provides temporary offline capability

This comprehensive real-time architecture enables Google's RTB system to deliver instantaneous bid updates to millions of concurrent users while maintaining system reliability and consistent user experience across global deployments.

Here is our finalized system architecture diagram showing the scaling of real-time updates to clients:

Google RTB Finalized System Architecture - Complete Distributed Real-Time Bidding Platform with Horizontal Scaling and Multi-Tier Real-Time Updates

Final RTB System Architecture - Complete Distributed Platform

Potential System Extensions

The RTB system design presents numerous opportunities for architectural expansion and enhancement. Here are several commonly explored extensions during senior-level interviews.

Adaptive Campaign Termination: How would you implement dynamic campaign closure based on bidding activity thresholds? A pragmatic approach involves updating campaign termination timestamps with each accepted bid. A scheduled job periodically scans for campaigns exceeding inactivity thresholds. For enhanced precision, you could leverage delayed message processing - each bid triggers a delayed job (using Kafka or SQS with visibility delays) that checks if the bid remains winning after the timeout period, triggering campaign closure if conditions are met.

Post-Campaign Settlement Processing: How would you handle winner notification and payment workflows? The system should automatically notify winning advertisers via email/API webhooks upon campaign completion. Implement a payment grace period (24-72 hours) with automatic fallback to second-highest bidders if payment fails. This requires maintaining sorted bid rankings and payment status tracking.

Historical Bid Analytics: How would you provide comprehensive bid history with real-time updates? This essentially mirrors our real-time notification architecture - maintain paginated bid history APIs with SSE-powered live updates as new bids arrive. Cache recent bid pages aggressively while streaming incremental updates to connected clients.

Comprehensive System Observability: How would you implement end-to-end monitoring and alerting? This may be a question of its own.


Conclusion

Designing Google's Real-Time Bidding system presents a fascinating blend of technical challenges that mirror real-world distributed systems problems. The system must balance competing requirements: maintaining strong consistency for financial transactions while achieving ultra-low latency, scaling to handle massive throughput while preserving data integrity, and providing real-time updates while managing resource constraints.

Here is the final system architecture diagram:

Google RTB Final System Architecture - Production-Ready Real-Time Bidding Platform with Microservices, Kafka Queues, Redis Pub-Sub and SSE Real-Time Updates

Final Complete RTB System Architecture

Key Technical Insights:

  1. Consistency vs Performance: The progression from naive database locking to Redis caching to optimistic concurrency control demonstrates how system architects must navigate the fundamental tradeoffs between data consistency and system performance.

  2. Fault Tolerance Through Queuing: Using Kafka as a durable message queue transforms potential data loss scenarios into manageable latency tradeoffs, showcasing how proper system design can eliminate entire classes of failures.

  3. Real-time at Scale: The evolution from polling to long polling to Server-Sent Events, combined with pub/sub coordination across multiple servers, illustrates the complexity of delivering real-time experiences at internet scale.

  4. Horizontal Scaling Strategies: Database sharding by campaign ID, stateless service scaling, and pub/sub coordination demonstrate how large-scale systems achieve linear scalability through careful architectural decisions.

To Succeed in the Interview:

  • Start with Requirements: Always clarify functional and non-functional requirements upfront, especially scale expectations
  • Show Progressive Thinking: Demonstrate multiple approaches (naive → better → optimal) rather than jumping to complex solutions
  • Focus on Critical Path: Identify the most challenging aspects (bid consistency) and spend appropriate time on them
  • Consider Operational Concerns: Address monitoring, fault tolerance, and real-world deployment challenges
  • Justify Design Decisions: Explain tradeoffs and why specific technologies or patterns were chosen

This system design question effectively tests a candidate's ability to reason about distributed systems, handle consistency challenges, design for scale, and balance competing technical requirements - all essential skills for senior engineering roles at companies like Google.


This comprehensive system design question evaluates understanding of real-time distributed systems, financial data consistency, horizontal scaling patterns, and the architectural complexity required for Google-scale infrastructure.

It has appeared frequently in recent Google interviews.

Was this page helpful?