Link rot and the innevitable heat death of the internet
Introduction
Yesterday I was reading the slides for Maciej Ceglowski’s talk on The Website Obesity Crisis. It’s a really good talk. I highly recommend giving it a read (or a watch). You might be a bit skeptical since it was written in 2015, but I think it is still highly relevant, even to HTML kiddies1 such as myself. Much like all good dystopian works, it identified the beginnings of a trend which has now become such a huge issue that the original work seems almost prophetic.
Anyway, in one of the slides he talks about an experiment by some Adam Drake guy.
Adam Drake wrote an engaging blog post about analyzing 2 million chess games. Rather than using a Hadoop cluster, he just piped together some Unix utilities on a laptop, and got a 235-fold performance improvement over the ‘Big Data’ approach.
It seemed pretty interesting, so I clicked the link and… nothing?
The link just took me to his homepage2.
After a bit of detective work I figured out
that the server was just redirecting me to the homepage instead of showing a 404.
Specifically, the Curl output below shows
that any request made to his old domain (aadrake.com
) is met with a 301 “permanently moved”
pointing to the index page of his new domain (adamdrake.com
),
completely disregarding the actual resource being requested.
Curl output
Emphasis mine.
$ curl -v http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
* Trying 192.64.119.137:80...
* Connected to aadrake.com (192.64.119.137) port 80 (#0)
> GET /command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html HTTP/1.1
> Host: aadrake.com
> User-Agent: curl/8.1.1
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Sun, 22 Oct 2023 21:12:20 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 56
< Connection: keep-alive
< Location: https://adamdrake.com
< X-Served-By: Namecheap URL Forward
< Server: namecheap-nginx
<
<a href='https://adamdrake.com'>Moved Permanently</a>.
* Connection #0 to host aadrake.com left intact
He is not the only one doing this. I’ve encountered multiple websites with this 404-is-index strategy and I hate it. It’s the HTTP equivalent of going “what? nah bro i never said that”. Why is your server gaslighting me?
Link rot
Even then, it is not the end of the world. It is just mildly annoying. I do, however, think that it is part of the much larger issue of link rot. The term “link rot” refers to the tendency for hyperlinks to ‘go bad’ over time, as the resources are relocated or taken offline. It is a kind of digital entropy that is slowly engulfing much of the older/indie web.
The 404-is-index strategy is a particularly bad case of link rot. Unlike a usual 404 which presents the reader with some kind of error page, a 301 happens quietly. For example many bookmark managers will quietly update references when it encounters a 301. This means that your bookmarks just disappear because they’ve been ‘moved’ to the website’s index page.
Maciej Ceglowski’s talk is yet another example.
His link to aadrake.com
was quietly broken
and with that another edge in the huge directed graph
that is the open web.
What can I do about it?
Okay so link rot is bad. How can I avoid further contributing to link entropy? Just to be clear the “I” in the section title does not refer to you, the reader, and that question before wasn’t rhetorical. I’m not going to pretend to have all the answers and this most certainly isn’t a how-to guide. Think of it more like a kind of public diary of my feeble attempts to fight my own innevitable breaking of this site.
I have tried to be very careful about the layout of the site, URI-wise.
For example, all posts are located at /posts/<glob>.html
and I never change the glob.
I also try to avoid leaking implementatin details in the URL.
This is rather easy since this is a static site,
but in the future I might add dynamic elements.
In that case I’ll try not to introduce extensions like .cgi
or .php
that couple the URL to the underlying tech stack.
Still, I fear it might not be enough.
After all, perfection is the enemy of progress.
As I learn and become a better web developer
I will surely realise fallacies in my original layout.
Already, I am growing a tad annoyed at the fact that images and CSS are grouped under /assets/
.
It’s rather arbitrary to decide that images and CSS are “assets” but HTML isn’t.
Maybe then I can use 301 for good; as resources are relocated I can maintain a list of rewrite rules. Perhaps I could model it as database migrations. In the database world, when one modifies a schema (e.g. adding or removing a column) that change is associated with a bit of code specifying how to go from one schema to another. That way, all the data (existing hyperlinks) remains valid as the schema (site layout) changes.
There are also more radical approached like IPFS. In this protocol resources are addressed by a content ID computed from their content rather than their location. That way, multiple peers can host it, reducing the likelyhood of the website ever disappearing or going down. It’s all very smart, but I doubt we can convince everyone to switch protocols just like that.
Conclusion
The tl;dr is this: There’s an issue on the web where links tend to ‘go bad’. That is, the location of the underlying resources change and hyperlinks pointing to the old location are broken. This problem is especially prevalent in the IndieWeb community because we don’t have teams of engineers managing our sites and checking for backwards compatibility.
So far my only solution is “be very careful” which isn’t a very inspiring conclusion. In the coming weeks I’ll look into making some automated testing, so I can be notified when I accidentally change the site layout in such a way that it breaks old links.
-
For lack of a better terminology I’m just going to reuse that of script kiddies. ↩
-
A home that included the phrase “While I specialize in executive advising on leadership and process, I can also dive into deep technical problems with Data Science”, a sentence so douchebag-techbro-y it made me reconsider whether tech was really the industry for me. ↩