How to Set Up Redundant Cron Jobs Across Multiple Servers for High Availability

Learn to implement failover strategies for critical cron jobs using leader election and distributed locking.

The problem with single-server cron jobs

Your payment processing runs at 3 AM every day. The daily backup kicks off at midnight. User reminder emails go out every hour. All critical jobs, all running on one server — until that server dies during a random Tuesday morning reboot.

I learned this the hard way when my database backup job hadn't run for six days. The server hosting it had kernel panicked, got rebooted, and cron never restarted properly. No alerts, no warnings, just six days of missing backups that I only discovered when I needed to restore something.

Single points of failure are everywhere in cron setups. Your server crashes, your disk fills up, a process hangs and blocks the cron daemon — any of these stops every scheduled job dead in its tracks. The worst part? You often don't know until something else breaks. Silent failures are common because cron jobs run in isolation without immediate feedback loops.

That's why I started setting up redundant cron jobs across multiple servers. But there's a catch — most redundancy attempts make things worse.

Why most redundancy attempts backfire spectacularly

The obvious solution is running the same job on multiple servers. Schedule your backup script on server A and server B. If A goes down, B keeps running. Problem solved, right?

Wrong. Now you have two backup jobs writing to the same location. Two payment processors charging customers twice. Two email services sending duplicate notifications. I've seen teams accidentally double-charge users for months because they ran payment cron jobs on multiple servers without coordination.

Race conditions make it worse. When two servers start the same long-running job simultaneously, they compete for resources, corrupt shared data, or create inconsistent states. A database migration running on two servers at once can destroy your schema. File processing jobs can corrupt data when multiple instances write to the same files.

The duplicate execution problem is harder to detect than a missing job. Failed jobs usually trigger errors. Duplicate jobs often succeed — twice. Your monitoring shows green while your customers get charged multiple times or your data gets corrupted.

That's why redundant cron jobs need coordination. Only one server should run the job at any time, but if that server fails, another should take over automatically.

Strategy 1: Leader election with database locks

Database row locking is the simplest way to coordinate cron jobs across servers. The idea is straightforward: whichever server acquires the lock first becomes the leader and runs the job. Other servers check for the lock, see it's taken, and skip execution.

I like this approach because it uses infrastructure you already have. Most applications already connect to a database, so there's no new dependencies or services to manage.

prepare("
        INSERT INTO cron_locks (job_name, locked_at, expires_at) 
        VALUES (?, NOW(), DATE_ADD(NOW(), INTERVAL 10 MINUTE))
        ON DUPLICATE KEY UPDATE 
            locked_at = CASE 
                WHEN expires_at < NOW() THEN NOW()
                ELSE locked_at 
            END,
            expires_at = CASE 
                WHEN expires_at < NOW() THEN DATE_ADD(NOW(), INTERVAL 10 MINUTE)
                ELSE expires_at 
            END
    ");
    
    $stmt->execute([$jobName]);
    
    // Check if we got the lock
    $stmt = $pdo->prepare("
        SELECT locked_at FROM cron_locks 
        WHERE job_name = ? AND locked_at >= DATE_SUB(NOW(), INTERVAL 1 SECOND)
    ");
    $stmt->execute([$jobName]);
    
    if ($stmt->rowCount() > 0) {
        try {
            echo "Got lock, running job: $jobName\n";
            $callable();
        } finally {
            // Release lock
            $pdo->prepare("DELETE FROM cron_locks WHERE job_name = ?")
                ->execute([$jobName]);
        }
    } else {
        echo "Another server is running $jobName, skipping\n";
    }
}

// Usage in your cron script
runWithLock('daily-backup', function() {
    // Your backup logic here
    system('/usr/local/bin/backup-database.sh');
});
?>

This approach handles the expiration automatically. If a server crashes while holding the lock, the 10-minute expiration ensures other servers can take over. The ON DUPLICATE KEY UPDATE logic lets expired locks get reacquired without manual cleanup.

The database approach works well for jobs that run every few minutes or longer. For high-frequency jobs (every few seconds), the database overhead becomes noticeable. That's where Redis shines.

Strategy 2: Redis-based distributed locking

Redis offers faster locking with built-in expiration. The SET command with NX (only set if not exists) and EX (expiration) flags gives you atomic lock acquisition and automatic cleanup.

I prefer Redis for frequent jobs or when you need sub-second coordination. The memory-based storage makes lock acquisition much faster than database queries.

import redis
import time
import subprocess
import sys

def run_with_redis_lock(job_name, timeout_seconds=300):
    r = redis.Redis(host='your-redis-host', port=6379, db=0)
    lock_key = f"cron_lock:{job_name}"
    server_id = f"{socket.gethostname()}:{os.getpid()}"
    
    # Try to acquire lock
    if r.set(lock_key, server_id, nx=True, ex=timeout_seconds):
        try:
            print(f"Acquired lock for {job_name}, running on {server_id}")
            return True
        except Exception as e:
            print(f"Error during job execution: {e}")
            return False
        finally:
            # Only release if we still own it
            lua_script = """
            if redis.call("GET", KEYS[1]) == ARGV[1] then
                return redis.call("DEL", KEYS[1])
            else
                return 0
            end
            """
            r.eval(lua_script, 1, lock_key, server_id)
    else:
        current_holder = r.get(lock_key)
        print(f"Lock held by {current_holder.decode()}, skipping {job_name}")
        return False

# Usage
if run_with_redis_lock('email-processor'):
    subprocess.run(['/usr/local/bin/process-emails.py'])
    sys.exit(0)
else:
    sys.exit(0)  # Exit gracefully, another server is handling it

The Lua script ensures safe lock release. Without it, you might release a lock that another server acquired after your job timed out. The script only deletes the lock if the stored value matches your server ID.

Redis expiration handles crashed servers automatically. If your process dies while holding the lock, Redis expires it after the timeout, allowing other servers to take over. Choose your timeout based on how long your job typically runs plus a buffer for unexpected delays.

Strategy 3: Load balancer health checks for failover

Load balancers can route cron job traffic based on health checks. Instead of coordinating between servers, you designate one as primary and only fail over when health checks fail.

This approach works well when you want clear primary/secondary roles rather than dynamic leader election. I use it for jobs where consistent execution on the same server matters (like maintaining local state or file handles).

# nginx.conf or similar
upstream cron_servers {
    server 10.0.1.10:8080 max_fails=2 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=2 fail_timeout=30s backup;
    server 10.0.1.12:8080 max_fails=2 fail_timeout=30s backup;
}

server {
    listen 80;
    server_name cron-lb.internal;
    
    location /health {
        proxy_pass http://cron_servers;
        proxy_connect_timeout 5s;
        proxy_read_timeout 5s;
    }
    
    location /run-cron {
        proxy_pass http://cron_servers;
        proxy_connect_timeout 30s;
        proxy_read_timeout 300s;
    }
}

Each server exposes a health check endpoint that returns 200 when ready to handle cron jobs, 503 when it should not. The load balancer routes all cron traffic to the first healthy server, using backups only when the primary fails.

Your cron jobs hit the load balancer instead of running locally. This centralizes the failover logic in your infrastructure layer rather than in application code.

The etcd approach for true distributed coordination

For production systems that need bulletproof coordination, etcd provides consensus-based leader election. Unlike simple locking, etcd handles network partitions gracefully and ensures only one leader exists even during split-brain scenarios.

I use etcd when reliability trumps complexity. It's overkill for simple setups, but when you need guarantees that exactly one server runs critical jobs, etcd delivers.

#!/bin/bash
JOB_NAME="payment-processor"
ETCD_ENDPOINTS="http://etcd1:2379,http://etcd2:2379,http://etcd3:2379"
LOCK_TTL=60
SERVER_ID=$(hostname)

# Try to acquire leadership
etcdctl --endpoints=$ETCD_ENDPOINTS \
    lock --ttl=$LOCK_TTL $JOB_NAME \
    /usr/local/bin/run-payment-job.sh

# The lock command blocks until:
# 1. It acquires the lock and runs the command
# 2. The command finishes and releases the lock
# 3. Or etcd determines this server can't be leader

# run-payment-job.sh
#!/bin/bash
echo "$(date): Starting payment processing on $HOSTNAME"

# Renew lease every 20 seconds to maintain leadership
while true; do
    sleep 20
    etcdctl --endpoints=$ETCD_ENDPOINTS lease keep-alive $ETCD_LEASE_ID &
done &
KEEPALIVE_PID=$!

# Run actual job
python /app/process_payments.py

# Cleanup
kill $KEEPALIVE_PID 2>/dev/null || true
echo "$(date): Payment processing completed on $HOSTNAME"

The etcd lock command handles leader election automatically. If the current leader becomes unreachable, etcd's consensus algorithm elects a new leader from the remaining candidates. Network partitions don't cause split-brain because etcd requires majority consensus.

Lease renewal keeps the lock active during long-running jobs. If your process crashes or the server becomes unreachable, etcd automatically expires the lease and allows another server to become leader.

Handling the edge cases that will definitely bite you

Clock drift between servers creates race conditions in time-based locking. If server A thinks it's 10:00:30 and server B thinks it's 10:00:25, they might both try to run a job scheduled for 10:00:30.

NTP helps, but you also need logic to handle small time differences. I add random jitter to job start times and use longer lock timeouts than strictly necessary.

# Add 0-30 second random delay before attempting lock
sleep $((RANDOM % 30))

# Use absolute timestamps in lock keys instead of relative timing
LOCK_KEY="job:${JOB_NAME}:$(date -d 'now' +'%Y%m%d%H%M')"

Network partitions can split your servers into groups that can't communicate. Database-based locking fails if servers can't reach the database. Redis locking breaks if servers can't reach Redis. Plan for these scenarios by monitoring connectivity and having fallback behaviors.

Lock expiration timing is tricky. Too short, and jobs get interrupted when they run longer than expected. Too long, and failover takes forever when servers crash. I start with 2x the typical job runtime and adjust based on monitoring data.

Testing these edge cases requires simulating failures. Network issues are particularly hard to debug because they're intermittent and hard to reproduce in development.

Testing your failover setup before production

Chaos testing reveals problems that normal testing misses. I deliberately break servers, disconnect networks, and overload resources to see how the failover behaves under stress.

Start with simple tests: kill the process holding the lock and verify another server takes over. Then escalate to more complex scenarios like network partitions and resource exhaustion.

#!/bin/bash
# Test script for failover scenarios

echo "Testing basic failover..."
# Start job on server A
ssh server-a "/usr/local/bin/test-job.sh" &
JOB_PID=$!

sleep 5

# Verify job is running
if ssh server-a "ps aux | grep test-job | grep -v grep"; then
    echo "✓ Job running on server A"
else
    echo "✗ Job not running on server A"
    exit 1
fi

# Kill the job process
ssh server-a "pkill -f test-job"
echo "Killed job on server A"

# Wait for failover
sleep 10

# Verify job moved to server B
if ssh server-b "ps aux | grep test-job | grep -v grep"; then
    echo "✓ Job failed over to server B"
else
    echo "✗ Job did not fail over to server B"
    exit 1
fi

echo "Basic failover test passed"

# Test network partition
echo "Testing network partition..."
ssh server-a "iptables -A INPUT -s 10.0.1.11 -j DROP"  # Block server B
ssh server-a "iptables -A INPUT -s 10.0.1.12 -j DROP"  # Block server C

# Run job and see what happens
# Should either run on server A (if it can reach coordination service)
# or fail gracefully without duplicate execution

Load testing shows how the coordination mechanism performs under stress. Run multiple jobs simultaneously and verify that exactly one server executes each job instance, even under high contention.

Monitor resource usage during tests. Lock acquisition should be fast (under 100ms for most cases) and shouldn't consume significant CPU or memory. If coordination becomes a bottleneck, you need a different approach.

Monitoring redundant cron jobs without losing your mind

Redundant cron jobs need different monitoring than single-server jobs. You care which server runs each job, how long failover takes, and whether any jobs get skipped during transitions.

Track lock acquisition metrics: how often each server gets the lock, how long it takes to acquire, and how often acquisition fails. Patterns in this data reveal problems before they cause outages.

Alert on failover events but tune the sensitivity. Occasional failover during maintenance is normal. Constant failover indicates deeper problems — network issues, resource contention, or timing problems in your coordination logic.

Log job execution with server identifiers. When debugging issues, you need to know which server ran which job instance. Include timestamps, server IDs, and job outcomes in your logs.

The monitoring setup gets complex quickly. You're tracking multiple servers, coordination services, and the jobs themselves. Dead man's switch monitoring helps by focusing on outcomes rather than process details. Heartbeat APIs work well with redundant setups because they don't care which server sends the ping.

Tools like WatchCron handle the complexity of monitoring distributed cron jobs by focusing on the job outcomes rather than the coordination mechanism. Set up monitoring once and get notified when jobs fail, regardless of which server was supposed to run them or how your failover works.

I hope this saves you from the duplicate payment charges and missing backups that taught me these lessons the hard way.