This Article is About Rabbits

2023-10-31 •

I want you to imagine, just for a second, that I had written a breathtaking piece about rabbits. Right here in this very blog post were the most exceptional rabbit facts you had ever seen. They were novel. They were insightful. They were downright inspired. Being the normal internet citizen you are, you decided to share the article with your rabbit-loving friends. “Here!” you say into your legions of followers, “feast your eyes on this!”

Right after you post the link, I pull a bait and switch. Suddenly, a post previously containing nothing but fun rabbit facts turns into a tutorial on how to go rabbit hunting, or how to best cook a rabbit for a delectable dinner. The link stays the same, and there is some semblance of truth to the original intent maintained; after all, the article is still all about rabbits.

This isn’t just a fairy tale, dynamic content - content that rapidly changes from one moment to the next - is taking shape in many aspects of the internet. Whether it’s for A/B testing, page deletion, or a change in site ownership, it is no longer safe to assume that what we link to will be there when we come back to it. The timeframe in which we have to worry about link rot varies depending on the source and the cause of the change, however the saying “the internet is forever” doesn’t really ring true on the modern web.

Wait, I swear that wasn’t what it was before

I read the New York Times daily, usually first thing in the morning, and usually through the app. Often I’ll do another quick skim over my lunch break to see if there has been any big news stories during the morning office hours. To my surprise, I started to notice that headlines (particularly in the opinion section) would be different than I had remembered from that morning. At first, I thought I needed more coffee, but soon after began to question whether A/B tests were being run on the headlines in the hours immediately following print, and then informing which one stuck throughout the day. For those unfamiliar with the concept, A/B testing is the process of randomly assigning users to experience two different versions of an app or website and then gathering data on which version meets an objective better. In the case of the New York Times, this objective would presumably be readership.

Sure enough, the Times is doing exactly that, as outlined in When a Headline Makes Headlines of Its Own, an article they did about changing headlines between the digital and print versions of an article.

The Times also makes a practice of running what are called A/B tests on the digital headlines that appear on its homepage: Half of readers will see one headline, and the other half will see an alternative headline, for about half an hour. At the end of the test, The Times will use the headline that attracted more readers.

Now, the Times is by no means malicious in switching the headlines, and the overall gist from a glance at the headline will almost always remain. Yet, there still seems to be something off-putting about the paper of record being so fluid. What if, by pure happenstance, I really liked the first title but loathed the second? Would I feel justified in complaining that the version I originally shared is no longer titled the same? Would things be different if the entire article was re-written and optimized through the same testing, through Google’s AI tool trained to write engaging news stories?

What if the tables were turned, and it was now the readers who were reshaping the headline of the article without any input from the creator? That is exactly what artifact.news seeks to accomplish, using AI to rewrite clickbait headlines into something more informative.

Presumably, if those that the article was shared with were not familiar with the markings that Artifact uses to denote that a headline has been altered after publication, or compared the version shared with them against the version of the original publisher, confusion and perhaps some chaos would ensue. Given that many don’t read an article before sharing it, the headline may be all that is left to aid our common understanding of the world around us.

[…] research analyzing sharing and reading of online news articles reveals that more than half of article links shared or reshared on Twitter have never been clicked or read at all (Gabielkov et al., 2016).

404 page not found

Just in the process of researching this article, I came across two 404 errors. Not only are they annoying from a user’s perspective when browsing through search, but they also represent potentially hundreds to thousands of links that may now be broken. A key point that a great researcher wrote is now no longer backed up by the evidence which it drew upon. Links are incredibly hard to maintain, and new standards for persistent identifiers are being developed and proposed all the time. Rather than a technical problem, it is actually a human problem; we like to improve upon things (the boxtickers of the BS jobs hypothesis) and this tends to make things on the internet flaky.

Ahref’s estimates that at least 66.5% of the links on the internet have succumbed to link rot^[1], and thus will not be returned when queried. Even worse, they find that the larger sites are more likely to experience link rot. As the web gets more consolidated, this means that there will eventually be more and more pages that are kept around but link to nowhere.

Link hijacking

Let’s say that we accept a less permanent world, in which links are allowed to slowly fade gently into the good night. Here we are presented with a new problem, link hijacking. This section is particularly important for domain names, and less so about the pages contained within. If a malicious entity were to take over a small or medium-sized domain (any very large ones like wsj.com would be noticed almost instantly) and replaced content en masse, then those that linked to the site for images or videos would have a rude awakening. If you link to something, you are placing trust in that link to remain intact and deliver the same content you originally intended. Changing links can move from bothersome to worrisome in this category, and may allow for malicious code to enter your server or site.

Ways to prevent link rot

If, after reading all of this, you are thinking “wow, I never want to trust the internet again”, there are some tools at your disposal. You could use an archiving service such as archivebox.io to store exact copies of webpages to be able to reference or share at a later time, exactly as you originally found them. Late Night Linux (a podcast) covered this yesterday, and highlighted how it may be useful for journalists in this crazy internet era.

Or, perhaps you are a prepper for the end of the world and want to make sure you still have amazing blog posts (like this one) to read when the computers all stop talking to each other.

For the rest of you, please link and share to your heart’s desire. I pinky swear to keep this post about rabbits…

Note that this statistic is under a section titled “404 Page Not Found”, however, not all the errors they found were technically linked to the HTTP 404 error. For a full breakdown of the various error codes that went into that 66.5% statistic, please refer to the original blog post. ↩︎