RCA – Federation Issues
Terminology
- SSL -- Secure Sockets Layer - older technology for securing a connection, but now commonly used to refer to TLS
- TLS -- Transport Layer Security - the current technology for securing a connection
- Certificate Authority -- a company which Google, Mozilla, Microsoft, Apple and Oracle have collectively decided can be trusted to issue a Certificate - common ones include Let's Encrypt, Comodo and (formerly) StartCom
- Root Certificate -- a Certificate which represents the foundation of a Certificate Authority, which can in turn issue other Certificates or be used to create an Intermediate Certificate
- Intermediate Certificate -- a Certificate issued by a Certificate Authority, which is itself able to issue Certificates. imagine an Ultra-Obama giving an Intermediate Obama a special award which allows Intermediate Obama to go out and give regular Obamas awards on Ultra-Obama's behalf
- Certificate -- a virtual piece of paper held by a computer, usually a web server or load-balancer, which uniquely identifies it, and which has been given to it by a Certificate Authority
- Certificate Chain -- the hierarchy by which an individual Certificate can be traced back to a Certificate Authority's Root Certificate, usually including one or more Intermediate Certificates between them. This is included with the web server's Certificate when a TLS connection is formed, because the website visitor may trust the Certificate Authority's Root Certificate without explicitly knowing about the Intermediate Certificates. Imagine each Intermediate CA as a chain link, with one end of the chain being the Certificate, and the other end being the Root Certificate
Summary of issue
queer.party citizens were not receiving normal messages from elsewhere in the Fediverse, and the rest of the Fediverse was unable to reach queer.party
Timeline of events
Initial cause
Some time ago, the Let's Encrypt certificate authority announced that they had been operating for long enough that their own root certificate was trusted by all of the companies who are big enough to dictate to the entire world which certificate authorities are trusted. As a result of this, they were retiring their previous intermediate certificate which had been temporarily issued by an existing trusted certificate authority (specifically IdenTrust), and would begin issuing certificates from a new intermediate certificate authority in the beginning of 2021, which would lead back only to their own root certificate.
Before this happens, however, Let's Encrypt also needed to create a new intermediate certificate, "Let's Encrypt R3", to replace their previous "Let's Encrypt Authority X3" certificate, which had issued the vast majority of Let's Encrypt certificates in common usage. This was done because the "X3" certificate was expiring, and they were not ready to switch to the new intermediate, named "Let's Encrypt E1". The "R3" intermediate certificate was announced on the 24th of November, 2020, with an ETA of "could be right now, could be in a month".
On the 2nd of December, 2020, Let's Encrypt announced that they had switched to the "R3" intermediate certificate, and patted themselves on the back over what a good job they did.
Where problems arose
On the 3rd of December, 2020, the certificate for the queer.party subdomain under which all media is served, content.queer.party, expired and was manually renewed. This happens once in a blue moon, and I've yet to identify why the automatic renewal sometimes fails. Because content.queer.party is accessed mostly by actual people using web browsers, the initial problem was not noticed.
Where problems were noticed
On the 6th of December, 2020, the certificate for queer.party itself was automatically renewed. Because queer.party itself is accessed frequently by both actual people and by the rest of the Fediverse, things began going wrong and queer.party citizens noticed that federation was broken.
"Federation is broken" in this case meaning "I'm not seeing new toots from other parts of the Fediverse" and "I can't see queer.party communications from my (non queer.party) instance".
The Usual Suspects (1995)
Normally when issues with federation arise, the culprit is that the tasks required for federation to happen are piling up in their queues, because the processors that handle them have gotten stuck in a way that prevents them from being automatically restarted.
I manually restarted the task processors, and waited to see if posts started trickling in on the federated timeline, because it seemed there weren't any tasks piled up in the queues, which I thought was Weird But Okay.
The thing is, baby, the game was rigged from the start
Federation wasn't fixed, and I had no idea why. I thought, "okay, restarting everything didn't help, let's try some random online ActivityPub tester". The tester failed to speak ActivityPub to queer.party, because it didn't recognise the certificate it got back as trusted. Weird flex but okay. I figured this was a red herring, because I tried loading queer.party myself and It Works For Me™.
Wondering if something weird was going on with remote access to the server, however, I loaded up Qualys SSL Labs and ran a test. Not because I thought something was wrong with the certificate or its configuration, but because it's the first thing I thought of that would tell me if it could speak to the server.
The test finished, and I didn't see a big red message that it couldn't speak to the server. Cool and normal, I went to close the tab, but noticed the security grade was sitting at B. Uncool and abnormal. queer.party (and all other sites I host) usually get an A+, because my web servers are overachievers. I scrolled down, and saw that the SSL Labs test didn't get the right certificate chain from the server. Super weird, but whatever. I restarted the load-balancer, which handles TLS on behalf of everything else, and re-ran the test.
The Gang Restarts HAProxy
The test still failed. My befuddlement was immeasurable and my evening was ruined. I began reissuing all my certificates and performing google searches such as "lets encrypt r3 issue" and "despacito 2 activitypub exclusive", tweaking TLS parameters and restarting the load-balancer each time I did anything.
I deleted every rule, condition, backend, frontend, real-server and health-check I could find that was no longer in use. I wrote regular expressions. Nothing worked.
Bapzingo
Crestfallen and confused, I deleted all Let's Encrypt-related certificates from the load-balancer, and reissued the certificates again. I checked the load-balancer configuration and re-added all the certificates (because, of course, they had been deleted), and restarted the load-balancer again.
Bronchitis
Shadow Creeper of The Big Bing Query
It was working. As a bonus, queer.party now supports TLS 1.3.
Thank you for coming to my Ted Talk.