Makerble Disaster Recovery and Backup Strategy

At Makerble, our commitment to data security extends beyond regular operations to ensure that we can quickly recover from unforeseen incidents. This Disaster Recovery and Backup Strategy outlines our approach to minimizing service disruptions, safeguarding data, and restoring systems efficiently.

1. Backup Protocol

We maintain a robust backup system to prevent data loss and ensure data availability during recovery. Our protocol includes:

Daily Backups: All environments (staging and production) are backed up daily to AWS S3 and OneDrive.
Cold Storage: Every six months, critical data is transferred to external hard drives for long-term, secure storage.

This is a visual representation of our backup workflow:

2. Recovery Process

In the event of a service disruption or disaster, our priority is to restore functionality swiftly while minimizing data loss. The recovery process includes:

Detection and Alerts: Our systems continuously monitor for unusual activity using Wazuh and Cloudflare. Wazuh is primarily focused on security and vulnerability detection rather than service uptime. To monitor service uptime, we utilize Uptime Kuma and Better Uptime, which provide notifications to our response teams via Slack in the event of a downtime incident. This dual-layered approach ensures we can respond swiftly to both security threats and service disruptions
Data Restoration: In the event of a critical incident, our established protocols leverage verified backups stored in AWS S3 and OneDrive to efficiently restore affected systems. We prioritize data integrity checks to ensure the integrity of backups before any systems are brought back online. Our backup frequency is designed to minimize potential data loss, allowing for a maximum expected data loss of 24 hours. This proactive approach ensures that we can swiftly respond to incidents while minimizing disruption and protecting data integrity
System Isolation: Compromised systems are isolated from the network to prevent the spread of any issues while we initiate the recovery process.

3. Incident Response and Containment

When an incident is detected, the following steps are taken:

Isolation: Affected systems are immediately disconnected from the network to contain potential damage.
Access Management: User accounts with suspicious activity are disabled, and password resets are enforced for all potentially compromised accounts.
Communication: Relevant teams and stakeholders are informed of the incident, with updates provided as the recovery progresses.

4. Recovery Testing and Audits

Our drills simulate various failure scenarios, including data loss and system outages. Each drill is followed by a comprehensive audit review, assessing the effectiveness of the response, identifying any gaps, and implementing improvements. The results of these audits will be documented and shared internally to inform continuous improvement.

5. Continuous Improvement

Our disaster recovery plan is constantly evolving. We review incidents, feedback, and industry standards to update and refine our strategy, ensuring our approach remains effective and aligns with best practices.