Sjoerd Visscher's weblog
Last Update
10/16/2005; 1:30:22 AM
Thursday, January 13, 2005
Still bugs in the implementation of HTML hyperlinks?!
Yes. What's the problem? Same-document references are resolved using the base URI in most browsers. (Only Lynx gets it right!) What does that mean? It means that in most browsers same-document links (like <a href="#identifier">) don't point to another location in the current document, but point to the document linked to in the base href.
Is this a rare? Not really. F.e. the Google cache adds a base href to a page. So the links in TOCs are wrong. The Wayback machine uses base href too, and a base href is added when you save an HTML page. Also most people don't know of this bug; Anne van Kesteren found an example.
So how did this happen? The history is interesting. Fragment identifiers and the base address were already clearly specified — also their interaction — in the oldest HTML standard I could find, from November 1992. It says:
This allows for the form HREF=#identifier to refer to another anchor in the same document. If the anchor is in another document, the atribute is a relative name , relative to the documents address (or specified base address if any).
This text ultimately ends up in HTML 2.0 (RFC 1866), section 7.4:
Any characters following a ‘#’ character in a hypertext address constitute a fragment identifier. In particular, an address of the form ‘#fragment’ refers to an anchor in the same document.
In HTML 2.0 a same-document reference is not considered to be a relative URL, so RFC 1808 remains vague about them. Then in 1997 HTML 3.2 becomes a W3C Recommendation. It refers to fragment identifiers only in the context of image maps and does not contain “hyperlink” at all! So at that moment fragment identifiers are essentially left unspecified. Unfortunately this is also the time IE 4 and Netscape 4 are finalised and released.
Half a year later the first HTML 4.0 WD is released. In a section about URLs it sais:
The URL specification en vigeur at the writing of this document ([RFC1738]) offers a mechanism to refer to a resource, but not to a location within a resource. The Web community has adopted a convention called "fragment URLs" to refer to anchors within an HTML document.
What used to be clearly specified is now a “convention”. In the next WD version this text is removed, and it sais that RFC 1808 specifies fragment identifiers. (It doesn't.) The whole problem seems to be solved when RFC 2396 is published, which specifies fragment identifiers, and has a special section about same-document references (4.2). But it's too late for HTML 4.0. Even in the 4.01 update from December 1999 this is not corrected. XHTML 1.0 does reference the new RFC, but that reference is informative. Finally in 2001 XHTML Modularization makes clear that the href attribute of an a element is a URI as defined in RFC 2396.
By this time however the URI code for Mozilla was already written and not much later frozen. Which makes fixing the bug a problem. But maybe someone finds this a challenge, and the fixer of this bug can proudly say he or she made the web work the way it was intended more than 12 years ago.
Update: Somehow I completely missed (or forgot about) RFC 2396 bis. It completely turnes the same-document reference idea around. They must be resolved using the base URI, as the current browsers do. But if you click on a link that — except for the fragment identifier part — is the same as the base URI, then you should stay in the current document.