[Troubleshooting] DB Outage After Large-Scale DELETE — Row-Level Lock Explosion and Operational Remedies

1️⃣ Situation Summary

Environment: InnoDB-based production DB / tens of millions of rows
Symptoms: Explosive query wait times, session congestion, sharp service latency drop
Impact: Overall transaction throughput collapse, outage precursor observed

DELETE FROM CQ_SHORT_INFO;

2️⃣ Immediate Checks (Quick Check)

SHOW ENGINE INNODB STATUS;
-- MySQL 5.7 / MariaDB
SELECT * FROM information_schema.innodb_locks;
SELECT * FROM information_schema.innodb_lock_waits;
-- MySQL 8.0+
SELECT * FROM performance_schema.data_locks;

3️⃣ Root Cause Analysis

DELETE is not a simple delete.
It is a heavy operation that simultaneously generates massive Undo Logs, Redo I/O, and Row-Level Exclusive Locks.


4️⃣ Standard Resolution Pattern

Large DELETE operations must be split into batches.

Operational Guidelines

  • Batch size: 1,000 ~ 10,000
  • Single-transaction large DELETE is prohibited

Replication Considerations

  • binlog_format = STATEMENT + DELETE ... LIMIT
    → Non-deterministic risk
    → Master and Slave may delete different rows
  • ROW / MIXED format recommended

Foreign Key Pre-check

  • If ON DELETE CASCADE exists
    → Batch size ≤ 1,000

5️⃣ Advanced Strategies

🅐 Copy & Swap Pattern (The Canonical Large-Delete Strategy)

Deleting is more expensive than rebuilding.

Procedure

  1. Create NEW_CQ_SHORT_INFO with the same schema
  2. [Mandatory Safety Guard — Change Transaction Isolation Level]
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;

[Critical Note]

Under the default REPEATABLE READ isolation level,
a shared lock (S-Lock) is acquired on the source table during copy, blocking all writes.

Risk: With the default setting, all INSERT/UPDATE/DELETE operations are completely blocked during the copy phase.

Solution: Switching to READ COMMITTED allows non-locking reads, enabling safe copying without service disruption.

Data Copy

INSERT INTO NEW_CQ_SHORT_INFO
SELECT * FROM CQ_SHORT_INFO
WHERE <retention condition>;

Pre-swap Validation

SHOW PROCESSLIST; -- Check for long-running SELECTs (to avoid Metadata Lock)

RENAME → DROP

Effects

  • Undo Log: 0
  • Fragmentation eliminated
  • Downtime only at the RENAME moment

🅑 pt-archiver — The Final Weapon for Automated Operations (Percona Toolkit)

pt-archiver \
  --source DSN=... \
  --where "create_date < '2025-01-01'" \
  --limit 1000 \
  --commit-each \
  --sleep 1 \
  --purge \
  --primary-key-only \
  --check-slave-lag DSN=...

Related Article :


6️⃣ Advanced Optimization — PK Walking Iterator

DELETE FROM CQ_SHORT_INFO
WHERE id >= v_current_id
  AND id < v_current_id + p_batch;

7️⃣ Recurrence Prevention

  • Single large DELETE is prohibited
  • Partial deletion → Batch / pt-archiver
  • Massive deletion → Copy & Swap
  • Pre-check Foreign Keys / Isolation Level / Metadata Locks

8️⃣ One-Line Summary

Large DELETE operations are a direct path to database outages.
Delete a little → batch it.
Delete a lot → rebuild it.


📎 Appendix — Safe Batch Delete via Stored Procedure

This is the most stable batch processing method for partial deletion scenarios.

DELIMITER $$

CREATE PROCEDURE purge_cq_short_info(IN p_batch INT, IN p_sleep_ms INT)
BEGIN
  DECLARE v_rows INT DEFAULT 1;

  WHILE v_rows > 0 DO
    START TRANSACTION;

    DELETE FROM CQ_SHORT_INFO
    WHERE id IS NOT NULL
    ORDER BY id
    LIMIT p_batch;

    SET v_rows = ROW_COUNT();

    COMMIT;

    IF v_rows > 0 AND p_sleep_ms > 0 THEN
      DO SLEEP(p_sleep_ms / 1000);
    END IF;
  END WHILE;
END$$

DELIMITER ;
CALL purge_cq_short_info(10000, 50);

🛠 마지막 수정일: 2025.12.31

💡 도움이 필요하신가요?
Zabbix, Kubernetes, 그리고 다양한 오픈소스 인프라 환경에 대한 구축, 운영, 최적화, 장애 분석, 광고 및 협업 제안이 필요하다면 언제든 편하게 연락 주세요.

📧 Contact: jikimy75@gmail.com
💼 Service: 구축 대행 | 성능 튜닝 | 장애 분석 컨설팅

📖 E-BooK [PDF] 전자책 (Gumroad): Zabbix 엔터프라이즈 최적화 핸드북
블로그에서 다룬 Zabbix 관련 글들을 기반으로 실무 중심의 지침서로 재구성했습니다. 운영 환경에서 바로 적용할 수 있는 최적화·트러블슈팅 노하우까지 모두 포함되어 있습니다.


💡 Need Professional Support?
If you need deployment, optimization, or troubleshooting support for Zabbix, Kubernetes, or any other open-source infrastructure in your production environment, or if you are interested in sponsorships, ads, or technical collaboration, feel free to contact me anytime.

📧 Email: jikimy75@gmail.com
💼 Services: Deployment Support | Performance Tuning | Incident Analysis Consulting

📖 PDF eBook (Gumroad): Zabbix Enterprise Optimization Handbook
A single, production-ready PDF that compiles my in-depth Zabbix and Kubernetes monitoring guides.

What are your feelings

Updated on 2025-12-31