> ## Documentation Index
> Fetch the complete documentation index at: https://docs.statusstack.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Status Monitoring Guide

> Complete guide to effective status monitoring with StatusStack

## Introduction

This guide covers everything you need to know about monitoring your services effectively with StatusStack. Whether you're tracking third-party services, monitoring your own infrastructure, or managing status for multiple clients, this guide will help you build a robust monitoring strategy.

***

## Getting Started with Monitoring

### The Monitoring Workflow

<Steps>
  <Step title="Plan Your Stack Structure">
    Decide how to organize your monitoring (by environment, service type, or client)
  </Step>

  <Step title="Create Stacks">
    Set up logical groupings for your services
  </Step>

  <Step title="Add Components">
    Include third-party services, custom monitors, and manual components
  </Step>

  <Step title="Configure Notifications">
    Set up multi-channel alerts for status changes
  </Step>

  <Step title="Monitor & Respond">
    Watch dashboards, respond to alerts, and maintain uptime
  </Step>
</Steps>

***

## Stack Organization Strategies

### By Environment

Separate your monitoring by deployment environment:

```
📦 Production Stack
  ├─ Production API
  ├─ Production Database
  ├─ Production CDN
  └─ Production Payment Gateway

📦 Staging Stack
  ├─ Staging API
  ├─ Staging Database
  └─ Staging CDN

📦 Development Stack
  └─ Dev Services
```

**Benefits:**

* Clear separation of concerns
* Different notification rules per environment
* Easy to identify which environment has issues
* Prevents mixing prod and dev alerts

**Best For:** Development teams with multiple environments

***

### By Service Type

Group services by functional area:

```
📦 Core Infrastructure
  ├─ AWS EC2
  ├─ AWS S3
  ├─ Cloudflare CDN
  └─ DNS Provider

📦 Application Services
  ├─ Web Application
  ├─ API Server
  ├─ Background Workers
  └─ Database

📦 External Dependencies
  ├─ Payment Gateway (Stripe)
  ├─ Email Service (SendGrid)
  ├─ SMS Provider (Twilio)
  └─ Analytics (Google Analytics)
```

**Benefits:**

* Logical grouping by function
* Easy to see which service layer is affected
* Helps identify dependency chains
* Better for incident response

**Best For:** SaaS companies with complex infrastructure

***

### For MSP Clients

Create client-specific monitoring:

```
📦 Client A - Production
  ├─ Client A Website
  ├─ Client A API
  ├─ Client A Database
  └─ Client A Email

📦 Client A - Staging
  └─ Client A Test Environment

📦 Client B - Production
  ├─ Client B Website
  └─ Client B Services

📦 Internal Infrastructure
  ├─ MSP Dashboard
  ├─ Monitoring System
  └─ Backup Services
```

**Benefits:**

* Complete client isolation
* Client-specific notification rules
* Individual client status pages
* Billing per client

**Best For:** MSPs and agencies managing multiple clients

See the [MSP Setup Guide](/guides/msp-setup) for detailed MSP configuration.

***

## Adding Third-Party Services

### Choosing Services to Monitor

StatusStack supports 5,200+ sources. Add services that your infrastructure depends on:

**Cloud Infrastructure:**

* AWS, Google Cloud, Azure, DigitalOcean
* Monitor core services: compute, storage, networking

**Content Delivery:**

* Cloudflare, Fastly, Akamai
* Track CDN and DNS availability

**Developer Tools:**

* GitHub, GitLab, Bitbucket
* Monitor repository and CI/CD availability

**Payment Processing:**

* Stripe, PayPal, Square
* Ensure payment infrastructure is operational

**Communication:**

* SendGrid, Twilio, Slack, Discord
* Track email, SMS, and chat services

### Adding Source Components

<Steps>
  <Step title="Open Your Stack">
    Navigate to the Stack where you want to add components
  </Step>

  <Step title="Click Add Component">
    Click the **"Add Component"** button
  </Step>

  <Step title="Choose Source">
    Browse or search for a service (e.g., "AWS", "Cloudflare")
  </Step>

  <Step title="Select Components">
    Choose which components of that service to monitor:

    * For AWS: EC2, S3, Lambda, RDS, etc.
    * For Cloudflare: CDN, DNS, etc.
  </Step>

  <Step title="Add to Stack">
    Click **"Add to Stack"** to start monitoring
  </Step>
</Steps>

**Automatic Updates:**

* Status updates every 1-5 minutes
* No configuration required
* Reflects official status page data

***

## Custom Website Monitoring

### When to Use Custom Monitors

Use custom monitors for:

* Your own websites and APIs
* Internal services without public status pages
* Services not in our third-party integrations
* Specific endpoints requiring health checks

### Creating Effective Monitors

**Monitor Configuration Example:**

```javascript theme={null}
{
  "name": "Production API Health Check",
  "url": "https://api.yoursite.com/health",
  "interval": 60, // seconds
  "timeout": 10, // seconds
  "alert_threshold": 3, // consecutive failures
  "expected_status": 200,
  "expected_content": "\"status\":\"ok\"",
  "headers": {
    "Authorization": "Bearer health-check-token"
  },
  "ssl_check": true
}
```

### Best Practices for Custom Monitors

<AccordionGroup>
  <Accordion title="Check Interval Selection">
    **30 seconds:** Critical production services
    **60 seconds:** Standard production services (recommended)
    **2-5 minutes:** Non-critical or staging services

    **Considerations:**

    * More frequent checks = faster detection but higher costs
    * Balance between responsiveness and efficiency
    * Most services don't need sub-minute monitoring
  </Accordion>

  <Accordion title="Timeout Configuration">
    **Rule of Thumb:** Set timeout to expected response time + 50% buffer

    **Examples:**

    * API typically responds in 200ms → Set 500ms timeout
    * Database query takes 2s → Set 3s timeout
    * Long-running endpoint (10s) → Set 15s timeout

    **Why:** Prevents false positives from occasional slowness
  </Accordion>

  <Accordion title="Alert Thresholds">
    **1 failure:** Ultra-critical services (use sparingly)
    **3 failures:** Production services (recommended)
    **5 failures:** Staging or dev environments

    **Why:** Reduces false alarms from transient network issues
  </Accordion>

  <Accordion title="Expected Content Matching">
    **Use When:**

    * You need to verify response content, not just HTTP status
    * API returns 200 but with error payload
    * Checking for specific success indicators

    **Examples:**

    ```json theme={null}
    "expected_content": "\"healthy\":true"
    "expected_content": "OK"
    "expected_content": "status.*operational"
    ```

    **Regex Support:** Use regex patterns for flexible matching
  </Accordion>

  <Accordion title="Authentication Headers">
    **Protected Endpoints:**

    ```javascript theme={null}
    "headers": {
      "Authorization": "Bearer your-token",
      "X-API-Key": "your-api-key",
      "X-Custom-Header": "value"
    }
    ```

    **Security:**

    * Use dedicated health check tokens (not admin tokens)
    * Rotate tokens regularly
    * Limit token permissions to health check endpoint only
  </Accordion>

  <Accordion title="SSL Certificate Monitoring">
    **Enable SSL Checks:**

    ```javascript theme={null}
    "ssl_check": true
    ```

    **What It Monitors:**

    * Certificate expiration (warns 7 days before)
    * Certificate validity
    * Trust chain verification

    **Benefits:**

    * Prevent unexpected certificate expirations
    * Catch SSL configuration issues
    * Maintain security compliance
  </Accordion>
</AccordionGroup>

### Health Check Endpoint Design

If you're creating custom health check endpoints, follow these patterns:

**Simple Health Check:**

```javascript theme={null}
// GET /health
{
  "status": "ok",
  "timestamp": "2025-01-19T10:30:00Z"
}
```

**Detailed Health Check:**

```javascript theme={null}
// GET /health/detailed
{
  "status": "ok",
  "timestamp": "2025-01-19T10:30:00Z",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "queue": "ok",
    "storage": "ok"
  },
  "uptime": 3600000
}
```

**Response Codes:**

* `200 OK`: All systems operational
* `503 Service Unavailable`: System degraded or down
* `5xx errors`: Treat as outage

***

## Manual Component Management

### When to Use Manual Components

Manual components are ideal for:

* Internal services without automated monitoring
* Third-party services not in our integrations
* Scheduled maintenance windows
* Services with manual status updates

### Creating Manual Components

<Steps>
  <Step title="Open Stack">
    Navigate to your Stack
  </Step>

  <Step title="Add Component">
    Click **"Add Component"** → **"Manual Component"**
  </Step>

  <Step title="Configure">
    * **Name:** "Internal Database Cluster"
    * **Description:** Optional details
    * **Initial Status:** Set starting status
  </Step>

  <Step title="Save">
    Component is added to your Stack
  </Step>
</Steps>

### Updating Manual Components

Update status when needed:

1. Open the Stack
2. Click on the manual component
3. Select new status (operational, degraded, outage, maintenance)
4. Add optional message describing the change
5. Click **Update Status**

**Use Cases:**

* Start of scheduled maintenance: Set to "maintenance"
* Detected issue: Set to "degraded" or "outage"
* Issue resolved: Set back to "operational"

***

## Status Monitoring Dashboard

### Understanding Stack Status

Each Stack displays an overall status based on its components:

| Status         | Indicator | Meaning                  |
| -------------- | --------- | ------------------------ |
| 🟢 Operational | Green     | All components healthy   |
| 🟡 Degraded    | Yellow    | Some components degraded |
| 🔴 Critical    | Red       | One or more outages      |
| 🔵 Maintenance | Blue      | Scheduled maintenance    |

**Status Calculation:**

```javascript theme={null}
if (any_component === 'outage') return 'CRITICAL'
if (any_component === 'degraded') return 'DEGRADED'
if (any_component === 'maintenance') return 'MAINTENANCE'
return 'OPERATIONAL'
```

### Component Details

Click any component to view:

* **Current Status:** Real-time status indicator
* **Last Updated:** When status was last checked
* **Uptime:** Percentage uptime (last 30 days)
* **Status History:** Timeline of all status changes
* **Active Incidents:** Ongoing issues affecting this component

### Real-Time Updates

The dashboard automatically refreshes:

* **Auto-Refresh:** Every 30 seconds
* **Manual Refresh:** Click refresh button anytime
* **Live Indicators:** Shows when data is updating

***

## Notification Configuration

### Creating Notification Rules

<Steps>
  <Step title="Navigate to Notifications">
    Go to **Settings** → **Notifications**
  </Step>

  <Step title="Create Rule">
    Click **"Create Notification Rule"**
  </Step>

  <Step title="Name Rule">
    Use descriptive names:

    * ✅ "Production Outage Alerts - SMS"
    * ✅ "All Stacks - Team Slack"
    * ❌ "Rule 1"
  </Step>

  <Step title="Select Alert Levels">
    * **Info:** Status improvements
    * **Warning:** Degraded performance
    * **Alert:** Outages
  </Step>

  <Step title="Choose Scope">
    * All Stacks
    * Specific Stacks
    * Specific Components
  </Step>

  <Step title="Configure Channels">
    Add notification destinations:

    * Discord webhook
    * Slack webhook
    * Teams webhook
    * Email addresses
    * Phone numbers (SMS)
    * Custom webhooks
  </Step>

  <Step title="Save & Test">
    Save rule and test with a manual component
  </Step>
</Steps>

### Notification Best Practices

**Layered Alerting Strategy:**

```
┌─────────────────────────────────────────┐
│         Critical Production              │
│  Alert Level: OUTAGE only               │
│  Channels: SMS + Slack + Email          │
│  Recipients: On-call engineer           │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│         Production Warnings              │
│  Alert Level: DEGRADED                  │
│  Channels: Slack + Email                │
│  Recipients: DevOps team                │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│         All Status Changes               │
│  Alert Level: INFO + WARNING + ALERT    │
│  Channels: Slack                        │
│  Recipients: #status-updates channel    │
└─────────────────────────────────────────┘
```

**Channel Selection:**

* **SMS:** Critical outages only (expensive, high urgency)
* **Slack/Discord:** Team awareness, all severity levels
* **Email:** Record keeping, escalations
* **Webhooks:** Integration with other tools (PagerDuty, etc.)

See the [Notifications documentation](/concepts/notifications) for detailed channel setup.

***

## Monitoring Workflows

### Daily Monitoring Routine

<Checklist>
  * [ ] Check dashboard for any active issues
  * [ ] Review overnight status changes
  * [ ] Verify all Stacks show green
  * [ ] Check notification delivery status
  * [ ] Review uptime percentages for critical services
</Checklist>

### Incident Response Workflow

When you receive an alert:

<Steps>
  <Step title="Acknowledge Alert">
    Open dashboard and locate affected Stack/component
  </Step>

  <Step title="Assess Severity">
    * **Outage:** Immediate action required
    * **Degraded:** Investigate and monitor
    * **Info:** Acknowledge and document
  </Step>

  <Step title="Investigate Root Cause">
    * Check component status history
    * Review related services for correlation
    * Check third-party status pages
    * Investigate custom monitor failures
  </Step>

  <Step title="Take Action">
    * Fix issues if within your control
    * Contact third-party support if needed
    * Update manual components if applicable
  </Step>

  <Step title="Document Resolution">
    * Update component status when resolved
    * Add notes to status history
    * Document root cause and fix
  </Step>

  <Step title="Post-Mortem">
    * Review incident timeline
    * Identify improvements
    * Update monitoring if needed
  </Step>
</Steps>

### Maintenance Window Management

**Before Maintenance:**

<Steps>
  <Step title="Plan Timing">
    Schedule during low-traffic periods
  </Step>

  <Step title="Notify Users">
    Update public status page if applicable
  </Step>

  <Step title="Set Component Status">
    Change affected components to "maintenance"
  </Step>

  <Step title="Adjust Notifications">
    Temporarily disable alerts for maintenance window
  </Step>
</Steps>

**During Maintenance:**

* Keep status updated if timeline changes
* Monitor related components for unexpected issues
* Document any problems encountered

**After Maintenance:**

* Return components to "operational"
* Re-enable notifications
* Verify all services healthy
* Update status page with completion

***

## Advanced Monitoring Techniques

### Dependency Mapping

Organize Stacks to reflect service dependencies:

```
📦 Frontend Services (depends on Backend + Infrastructure)
  ├─ Web Application
  └─ CDN

    ↓ depends on

📦 Backend Services (depends on Infrastructure)
  ├─ API Server
  ├─ Background Workers
  └─ Queue System

    ↓ depends on

📦 Core Infrastructure (foundational)
  ├─ Database
  ├─ Redis Cache
  └─ AWS Services
```

**Benefits:**

* Understand impact when infrastructure fails
* Prioritize fixes based on dependency chain
* Communicate outage scope to users

### Uptime Tracking

Monitor uptime percentages over time:

**Uptime Targets:**

* **99.9% ("Three Nines"):** 43 minutes downtime/month
* **99.95%:** 22 minutes downtime/month
* **99.99% ("Four Nines"):** 4 minutes downtime/month

**How to Track:**

1. View component details in dashboard
2. Check uptime percentage (last 30 days)
3. Review status history for downtime duration
4. Export reports for stakeholders

### Multi-Region Monitoring

For services deployed in multiple regions:

```
📦 Production - US East
  ├─ API Server (us-east-1)
  ├─ Database (us-east-1)
  └─ CDN Edge (US East)

📦 Production - Europe
  ├─ API Server (eu-west-1)
  ├─ Database (eu-west-1)
  └─ CDN Edge (Europe)
```

**Benefits:**

* Detect region-specific outages
* Track regional performance differences
* Better incident response for distributed systems

***

## Troubleshooting Common Issues

### False Positives

**Problem:** Monitor reports outages when service is actually operational

**Solutions:**

1. **Increase Alert Threshold:** Change from 1 to 3 consecutive failures
2. **Adjust Timeout:** Give service more time to respond
3. **Check Network Path:** Verify monitoring from correct region
4. **Review Expected Content:** Ensure regex/content match is correct

### Missing Alerts

**Problem:** Service went down but no notification received

**Checklist:**

<Checklist>
  * [ ] Notification rule is enabled
  * [ ] Alert level matches status change (e.g., "degraded" won't trigger "alert"-only rule)
  * [ ] Stack/component is in rule scope
  * [ ] Webhook URLs are correct and active
  * [ ] Email not in spam folder
  * [ ] SMS: Twilio account funded and numbers verified
</Checklist>

### Delayed Status Updates

**Problem:** Dashboard shows old status, not reflecting current state

**Solutions:**

1. **Manual Refresh:** Click refresh button in dashboard
2. **Check Update Frequency:** Third-party sources update every 1-5 minutes
3. **Custom Monitor Interval:** Verify check interval is appropriate
4. **Cache Issue:** Clear browser cache and reload

### Component Stuck in Maintenance

**Problem:** Component still shows maintenance after completion

**Solution:**

1. Click on the component
2. Manually update status to "operational"
3. Verify status change reflects immediately

***

## Monitoring Limits & Performance

### Current Limits

| Plan         | Stacks    | Components/Stack | Custom Monitors | Update Frequency |
| ------------ | --------- | ---------------- | --------------- | ---------------- |
| Free         | 3         | 10               | 5               | 5 minutes        |
| Starter      | 10        | 20               | 25              | 1 minute         |
| Professional | 50        | 50               | 100             | 30 seconds       |
| Enterprise   | Unlimited | Unlimited        | Unlimited       | 30 seconds       |

### Optimizing Performance

**Tips for Large Deployments:**

* Use Stack organization to group related services
* Limit components per Stack to 20-30 for best dashboard performance
* Use appropriate check intervals (not everything needs 30s checks)
* Archive old/unused Stacks
* Consolidate duplicate monitors

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Customer Dashboard" icon="gauge" href="/guides/customer-dashboard">
    Explore all dashboard features and shortcuts
  </Card>

  <Card title="MSP Setup" icon="building" href="/guides/msp-setup">
    Set up multi-client monitoring
  </Card>

  <Card title="Stacks Concept" icon="layer-group" href="/concepts/stacks">
    Deep dive into Stack organization
  </Card>

  <Card title="Notifications" icon="bell" href="/concepts/notifications">
    Advanced notification configuration
  </Card>
</CardGroup>

***

**Effective monitoring requires planning, appropriate tooling, and responsive workflows.** Use this guide to build a robust monitoring strategy that keeps your services reliable and your team informed.
