/

10 MIN

READ /

December 9, 2025

Post Mortem: Why Arc Studio Was Down

tl;dr: A mistake on our end caused an integer primary key in our database to overflow, forcing an unscheduled migration that took the Arc Studio backend down for 22 hours and caused stress and uncertainty for our users. During that time you could still write in the desktop app and export PDFs, but sync and real-time collaboration were offline. No data was lost. Below, we share our learnings and outline the steps we’re taking to ensure this does not happen again.

On December 4, 2025, 7:43:30pm CET, our database returned the error message “ERROR: integer out of range”, which was the beginning of an extended outage that lasted until December 5, 4:59pm CET, almost 22 hours.

In this post, I want to share how this happened and what we’re doing to make sure this never happens again.

But first, I want to apologize: As you will see, this was caused by an oversight on my end that could have been avoided. I know the importance of your writing. It is the result of hours, months, and sometimes years of creative labor. For many of you it is one of your most precious possessions. My first responsibility is to protect that work, and my second is to give you confidence that the technical side of your writing practice is solid and predictable. On that second point I clearly failed last week. I understand the fear and frustration that uncertainty caused, and I’m genuinely sorry.

What didn’t happen

I want to clarify what has NOT happened:

We were not hacked: no data was leaked to people who should not have access to it.
Our database did not lose data — our redundant backup systems worked perfectly.
This was not scheduled: We did not merely fail to communicate an upcoming scheduled maintenance — we had to perform unplanned, emergency maintenance.

What went wrong

We use a database (PostgreSQL) to not only store the contents of your documents, but also the changes that were performed to create those documents (if you’ve ever recovered deleted text through your document history, this is how this works). A change can be adding a piece of text (usually a couple of sentences, but can also be multiple pages or just a single character), deleting some text, moving text somewhere else, adding a card to the Plot Board, etc. Each change receives a number to identify it, the so-called “primary key.” Last week, the two billion, one hundred forty-seven million, four hundred eighty-three thousand, six hundred forty-seventh change was performed.

2147483647 is the maximum value that you can store with 4 bytes.

Unfortunately, our changes table’s primary key was constrained to 4 bytes, meaning no additional changes could have been inserted afterward. Had we realized that we were about to run out of integers, we could have changed the primary key to “bigint”, which allows several orders of magnitude more changes. A minor maintenance.

But we didn’t: We did not monitor the size of this table and so we ran out of space.

We had to shut down the system while we recreated the database with the correct primary key (bigint), then had to copy over all the data from the old tables to the new ones. This involved multiple terabytes of data, which unfortunately takes a lot of time. Our first estimate was about a week, but fortunately we were able to speed up the process so that we were back within a day.

Impact

From December 4, 2025, 7:43pm to December 5, 4:59pm CET, Arc Studio’s web services were down for all users.

People could not sign in (or new users could not sign up).
Changes and comments failed to sync across devices for real-time collaboration.
No collaborators could be invited to your script.
No new scripts could be created.
The web version of the app could not be accessed.

For most users of the desktop app, the following features remained accessible (though without real-time sync between devices):

Editing scripts, Plot Boards/outlines, and notes
Exporting to PDF (with minor limitations, such as missing images)

No data was lost.

What we did

December 4, 7:43pm CET: We received an alert that a write to our changes table had failed.

7:53pm: We met on a Zoom call to investigate and explore our options

By 7:58pm, we had confirmed that we had hit the maximum number for our change counter and started taking action:

Opened a support case with AWS (our hosting provider)
Stopped the crashing sync service (other services were still running)
Investigated both short-term workarounds and different long-term fixes

It became clear there was no safe short-term workaround: to keep saving your changes reliably, we had to move that counter to a larger format. That required a database migration and downtime.

We took the remaining services offline and started an in-place migration. This approach had two big risks:

The duration was unknown (could be hours, days, or longer)
It required a lot of disk space, because all data would be duplicated during the migration

We increased disk size and hardware capacity with AWS to reduce these risks, but after a few hours we still ran out of space and had to stop.

In the early morning, we contacted Cybertec, a PostgreSQL consultancy in our network. Within about an hour, a database expert joined our Zoom call and helped executing a better recovery plan:

Provision a new, more powerful server and load data from the latest backup
Create a new table with the correct, larger number format
Copy data into the new table in parallel batches so we could speed things up and track progress
Rebuild the indexes
Swap the old and new tables and bring the service back online

This new plan worked smoothly. The sheer volume of data meant the recovery still took many hours, but not the days or weeks we had initially feared.

By around 5pm CET on December 5, the service was fully restored and live again.

Communication

During the outage, we tried to strike a balance between keeping you informed and focusing on getting the service back online as quickly as possible. Looking back, while not far off, I don’t think we got that balance right.

First, we should have been much clearer that this was unscheduled emergency work. Some people understandably assumed that this was planned maintenance that we had simply failed to announce in advance. That’s on us. This was not a case of forgetting to send an email – it was a genuine incident that forced us to take the service down.

Second, several of you told us that you would have preferred more frequent updates, even if those updates were short and uncertain. I completely get that. It is uncomfortable to send a message that says “we don’t know how long this will take” when you’re in the middle of trying to fix things, but from your perspective, silence is worse. You’re trying to decide whether to wait, switch tools, or rearrange your day, and you need information to make that call.

In future incidents, even if we don’t yet have a clear timeline, we will err on the side of communicating more often, stating clearly what we know, what we don’t know yet, and when you can expect the next update.

What went well

Even though this was a serious incident, a few things worked exactly the way they were supposed to, and they made a big difference.

Internally, communication was fast and clear. The right people were pulled in quickly, decisions were made without delay, and everyone had a shared understanding of the problem and the plan. There was no confusion about who was responsible for what, which meant all of our energy went into fixing the issue.

Our playbook for bringing in external help also paid off. We opened a line with our infrastructure provider early, and when it became clear we needed deep PostgreSQL expertise, we contacted Cybertec consulting. Within an hour, a specialist was on a Zoom call with us, helping design and validate the recovery plan that got us back online much faster than our initial estimates.

Previous investments in the product also softened the impact. The desktop app is built with strong offline support, so while real-time collaboration and sync were affected, most people were still able to keep writing and working on their scripts locally. That doesn’t make the outage acceptable, but it meant your work didn’t grind to a complete halt.

We had also invested heavily in redundancy and backups long before this incident. That part worked perfectly. There was never a moment where we were worried about losing your data; the only question was how quickly we could restore full service. Knowing that allowed us to focus on the safest and most robust path to recovery, rather than desperate attempts to “save” data.

Finally, it’s worth noting that this was our first extended outage. That doesn’t excuse it, but it does highlight that the overall system is generally robust. The lesson from this incident is not that everything is fragile, but that a specific, preventable limit was overlooked. Fixing that, and tightening our processes around it, makes the service stronger going forward.

Where we need to improve

This outage did uncover a few areas worth improving:

First, we need better monitoring of our data itself, not only about the state of our servers and of our product (where we track crashes and issues with specific features). The root cause here was a counter quietly approaching its limit. That should have been visible as a clear, early warning. We’re putting in place monitoring and alerts that track growth of critical tables and limits long before they become a problem.

Second, we need to adapt our database to the amount of data we’re now storing. Some of our tables have grown very large over time, and we haven’t yet broken them up in the way bigger systems typically do. We’re going to start partitioning these large tables so they are easier to manage, scale, and maintain safely as we grow.

Third, we want an external pair of eyes on our setup on a regular basis. We’re planning recurring audits with specialized database consultants to review our configuration, data growth, and risk areas. The goal is to catch blind spots early, rather than only discovering them when something fails.

On the product side, we also saw a few isolated weak spots in offline support. While most people were able to keep working locally, a small subset of users had trouble getting back to their scripts after the initial crash. That’s not acceptable. We will be fixing these rough edges so that if the server goes down, the desktop app behaves predictably and your work remains easily accessible.

Finally, we want to make it possible to start new scripts even while completely offline. Today, offline mode is good for continuing existing work, but not as smooth for starting something new. In a future incident, you should still be able to open the app and begin writing, regardless of the state of the servers.

Mitigation plan

The primary key migration is already done. That said, we’re adjusting our processes to make sure this kind of incident doesn’t happen again, and to reduce the impact if anything does go wrong in the future:

We’re putting proper monitoring in place for data growth and limits, and setting up regular audits so that an external expert reviews our system on a recurring basis. The goal is to catch problems while they are still just trends in a dashboard, not active outages.

Second, we will use our dedicated staging environment to run simulated outages and failure scenarios to test how the app behaves, especially around offline support and edge cases. Instead of discovering weaknesses during a live incident, we want to discover them in controlled tests and fix them ahead of time.

As these changes roll out, we’ll treat them as ongoing work, not a one-time patch. The system will continue to evolve as we grow, and we’ll keep investing in monitoring, testing, and outside review to stay ahead of issues like this.

Thanks

Finally, I want to say thank you.

To the team, for working late, handling a stressful live migration, and keeping communication going while also fixing the problem. Incidents like this are intense, and everyone showed up fully.

To Cybertec, for jumping in quickly and bringing exactly the kind of deep, practical expertise we needed. Having a consultant in our Zoom within an hour, with clear guidance and a realistic plan, made a huge difference.

And to you, our customers. Many of you were understandably stressed and frustrated, but still patient, supportive, and kind in your messages. You trusted us to get your work back online, even as we were the ones who caused the disruption in the first place. We don’t take that for granted, and it’s a big part of why we’re determined not to put you in this position again.

December 9, 2025

Michi Huber, Founder Arc Studio

‍