The History of the URL: Path, Fragment, Query, and Auth

URLs were never intended to be what they’ve become: an arcane way for a user to identify a site on the Web. Unfortunately, we’ve never been able to standardize URNs, which would give us a more useful naming system. Arguing that the current URL system is sufficient is like praising the DOS command line, and stating that most people should simply learn to use command line syntax. The reason we have windowing systems is to make computers easier to use, and more widely used. The same thinking should lead us to a superior way of locating specific sites on the Web.

— Dale Dougherty 1996

There are several different ways to understand the ‘Internet’. One is as a system of computers connected using a computer network. That version of the Internet came into being in 1969 with the creation of the ARPANET. Mail, files and chat all moved over that network before the creation of HTTP, HTML, or the ‘web browser’.

In 1992 Tim Berners-Lee created three things, giving birth to what we consider the Internet. The HTTP protocol, HTML, and the URL. His goal was to bring ‘Hypertext’ to life. Hypertext at its simplest is the ability to create documents which link to one another. At the time it was viewed more as a science fiction panacea, to be complimented by Hypermedia, and any other word you could add ‘Hyper’ in front of.

The key requirement of Hypertext was the ability to link from one document to another. In TBL’s time though, these documents were hosted in a multitude of formats and accessed through protocols like Gopher and FTP. He needed a consistent way to refer to a file which encoded its protocol, its host on the Internet, and where it existed on that host.

At the original World-Wide Web presentation in March of 1992 TBL described it as a ‘Universal Document Identifier’ (UDI). Many different formats were considered for this identifier:

protocol: aftp host: xxx.yyy.edu path: /pub/doc/README
PR=aftp; H=xx.yy.edu; PA=/pub/doc/README;
PR:aftp/xx.yy.edu/pub/doc/README
/aftp/xx.yy.edu/pub/doc/README)

This document also explains why spaces must be encoded in URLs (%20):

The use of white space characters has been avoided in UDIs: spaces are not legal characters. This was done because of the frequent introduction of extraneous white space when lines are wrapped by systems such as mail, or sheer necessity of narrow column width, and because of the inter-conversion of various forms of white space which occurs during character code conversion and the transfer of text between applications.

What’s most important to understand is that the URL was fundamentally just an abbreviated way of refering to the combination of scheme, domain, port, credentials and path which previously had to be understood contextually for each different communication system.

It was first officially defined in an RFC published in 1994.

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

This system made it possible to refer to different systems from within Hypertext, but now that virtually all content is hosted over HTTP, may not be as necessary anymore. As early as 1996 browsers were already inserting the http:// and www. for users automatically (rendering any advertisement which still contains them truly ridiculous).

Path

I do not think the question is whether people can learn the meaning of the URL, I just find it it morally abhorrent to force grandma or grandpa to understand what, in the end, are UNIX file system conventions.

— Israel del Rio 1996

The slash separated path component of a URL should be familiar to any user of any computer built in the last fifty years. The hierarchal filesystem itself was introduced by the MULTICS system. Its creator, in turn, attributes it to a two hour conversation with Albert Einstein he had in 1952.

MULTICS used the greater than symbol (>) to separated file path components. For example:

>usr>bin>local>awk

That was perfectly logical, but unfortunately the Unix folks decided to use > to represent redirection, delegating path separation to the forward slash (/).

Snapchat the Supreme Court

Wrong. We are I now see clearly disagreeing. You and I.

...

As a person I reserve the right to use different criteria for different purposes. I want to be able to give names to generic works, AND to particular translations AND to particular versions. I want a richer world than you propose. I don’t want to be constrained by your two-level system of “documents” and “variants”.

— Tim Berners-Lee 1993

One half of the URLs referenced by US Supreme Court opinions point to pages which no longer exist. If you were reading an academic paper in 2011, written in 2001, you have better than even odds that any given URL won’t be valid.

There was a fervent belief in 1993 that the URL would die, in favor of the ‘URN’. The Uniform Resource Name is a permanent reference to a given piece of content which, unlike a URL, will never change or break. Tim Berners-Lee first described the “urgent need” for them as early as 1991.

The simplest way to craft a URN might be to simply use a cryptographic hash of the contents of the page, for example: urn:791f0de3cfffc6ec7a0aacda2b147839. This method doesn’t meet the criteria of the web community though, as it wasn’t really possible to figure out who to ask to turn that hash into a piece of real content. It also didn’t account for the format changes which often happen to files (compressed vs uncompressed for example) which nevertheless represent the same content.

Reliability of
URLs

In 1996 Keith Shafer and several others proposed a solution to the problem of broken URLs. The link to this solution is now broken. Roy Fielding posted an implementation suggestion in July of 1995. The link is now broken.

I was able to find these pages through Google, which has functionally made page titles the URN of today. The URN format was ultimately finalized in 1997, and has essentially never been used since. The implementation is itself interesting. Each URN is composed of two components, an authority who can resolve a given type of URN, and the specific ID of this document in whichever format the authority understands. For example, urn:isbn:0131103628 will identify a book, forming a permanent link which can (hopefully) be turned into a set of URLs by your local isbn resolver.

Given the power of search engines, it’s possible the best URN format today would be a simple way for files to point to their former URLs. We could allow the search engines to index this information, and link us as appropriate:

<!-- On http://zack.is/history -->
<link rel="past-url" href="http://zackbloom.com/history.html">
<link rel="past-url" href="http://zack.is/history.html">

Query Params

The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices.

WhatWG URL Spec

If you’ve used the web for any period of time, you are familiar with query parameters. They follow the path portion of the URL, and encode options like ?name=zack&state=mi. It may seem odd to you that queries use the ampersand character (&) which is the same character used in HTML to encode special characters. In fact, if you’ve used HTML for any period of time, you likely have had to encode ampersands in URLs, turning http://host/?x=1&y=2 into http://host/?x=1&amp;y=2 or http://host?x=1&#38;y=2 (that particular confusion has always existed).

You may have also noticed that cookies follow a similar, but different format: x=1;y=2 which doesn’t actually conflict with HTML character encoding at all. This idea was not lost on the W3C, who encouraged implementers to support ; as well as & in query parameters as early as 1995.

Originally, this section of the URL was strictly used for searching ‘indexes’. The Web was originally created (and its funding was based on it creating) a method of collaboration for high energy physicists. This is not to say Tim Berners-Lee didn’t know he was really creating a general-purpose communication tool. He didn’t add support for tables for years, which is probably something physicists would have needed.

In any case, these ‘physicists’ needed a way of encoding and linking to information, and a way of searching that information. To provide that, Tim Berners-Lee created the <ISINDEX> tag. If <ISINDEX> appeared on a page, it would inform the browser that this is a page which can be searched. The browser should show a search field, and allow the user to send a query to the server.

That query was formatted as keywords separated by plus characters (+):

http://cernvm/FIND/?sgml+cms

In fantastic Internet fashion, this tag was quickly abused to do all manner of things including providing an input to calculate square roots. It was quickly proposed that perhaps this was too specific, and we really needed a general purpose <input> tag.

That particular proposal actually uses plus signs to separate the components of what otherwise looks like a modern GET query:

http://somehost.somewhere/some/path?x=xxxx+y=yyyy+z=zzzz

This was far from universally acclaimed. Some believed we needed a way of saying that the content on the other side of links should be searchable:

<a HREF="wais://quake.think.com/INFO" INDEX=1>search</a>

Tim Berners-Lee thought we should have a way of defining strongly-typed queries:

<ISINDEX TYPE="iana:/www/classes/query/personalinfo">

I can be somewhat confident in saying, in retrospect, I am glad the more generic solution won out.

The real work on <INPUT> began in January of 1993 based on an older SGML type. It was (perhaps unfortunately), decided that <SELECT> inputs needed a separate, richer, structure:

<select name=FIELDNAME type=CHOICETYPE [value=VALUE] [help=HELPUDI]>
<choice>item 1
<choice>item 2
<choice>item 3
</select>

If you’re curious, reusing <li>, rather than introducing the <option> element was absolutely considered. There were, of course, alternative form proposals. One included some variable substituion evocative of what Angular might do today:

<ENTRYBLANK TYPE=int LENGTH=length DEFAULT=default VAR=lval>
Prompt </ENTRYBLANK>
<QUESTION TYPE=float DEFAULT=default VAR=lval> Prompt </QUESTION>
<CHOICE DEFAULT=default VAR=lval>
<ALTERNATIVE VAL=value1> Prompt1
...
<ALTERNATIVE VAL=valuen> Promptn
</CHOICE>

In this example the inputs are checked against the type specified in type, and the VAR values are available on the page for use in string substitution in URLs, à la:

http://eager.io/apps/$appId

Additional proposals actually used @, rather than =, to separate query components:

name@value+name@(value&value)

It was Marc Andreessen who suggested our current method based on what he had already implemented in Mosaic:

name=value&name=value&name=value

Just two months later Mosaic would add support for method=POST forms, and ‘modern’ HTML forms were born.

Of course, it was also Marc Andreessen’s company Netscape who would create the cookie format (using a different separator). Their proposal was itself painfully shortsighted, led to the attempt to introduce a Set-Cookie2 header, and introduced fundamental structural issues we still deal with at Eager to this day.

Fragments

The portion of the URL following the ‘#’ is known as the fragment. Fragments were a part of URLs since their initial specification, used to link to a specific location on the page being loaded. For example, if I have an anchor on my site:

<a name="bio"></a>

I can link to it:

http://zack.is/#bio

This concept was gradually extended to any element (rather than just anchors), and moved to the id attribute rather than name:

<h1 id="bio">Bio</h1>

Tim Berners-Lee decided to use this character based on its connection to addresses in the United States (despite the fact that he’s British by birth). In his words:

In a snail mail address in the US at least, it is common to use the number sign for an apartment number or suite number within a building. So 12 Acacia Av #12 means “The building at 12 Acacia Av, and then within that the unit known numbered 12”. It seemed to be a natural character for the task. Now, http://www.example.com/foo#bar means “Within resource http://www.example.com/foo, the particular view of it known as bar”.

It turns out that the original Hypertext system, created by Douglas Englebart, also used the ‘#’ character for the same purpose. This may be coincidental or it could be a case of accidental “idea borrowing”.

Fragments are explicitly not included in HTTP requests, meaning they only live inside the browser. This concept proved very valuable when it came time to implement client-side navigation (before pushState was introduced). Fragments were also very valuable when it came time to think about how we can store state in URLs without actually sending it to the server. What could that mean? Let’s explore:

Molehills and Mountains

There is a whole standard, as yukky as SGML, on Electronic data Intercahnge [sic], meaning forms and form submission. I know no more except it looks like fortran backwards with no spaces.

— Tim Berners-Lee 1993

There is a popular perception that the internet standards bodies didn’t do much from the finalization of HTTP 1.1 and HTML 4.01 in 2002 to when HTML 5 really got on track. This period is also known (only by me) as the Dark Age of XHTML. The truth is though, the standardization folks were fantastically busy. They were just doing things which ultimately didn’t prove all that valuable.

One such effort was the Semantic Web. The dream was to create a Resource Description Framework (editorial note: run away from any team which seeks to create a framework), which would allow metadata about content to be universally expressed. For example, rather than creating a nice web page about my Corvette Stingray, I could make an RDF document describing its size, color, and the number of speeding tickets I had gotten while driving it.

This is, of course, in no way a bad idea. But the format was XML based, and there was a big chicken-and-egg problem between having the entire world documented, and having the browsers do anything useful with that documentation.

It did however provide a powerful environment for philosophical argument. One of the best such arguments lasted at least ten years, and was known by the masterful codename ‘httpRange-14’.

httpRange-14 sought to answer the fundamental question of what a URL is. Does a URL always refer to a document, or can it refer to anything? Can I have a URL which points to my car?

They didn’t attempt to answer that question in any satisfying manner. Instead they focused on how and when we can use 303 redirects to point users from links which aren’t documents to ones which are, and when we can use URL fragments (the bit after the ‘#’) to point users to linked data.

To the pragmatic mind of today, this might seem like a silly question. To many of us, you can use a URL for whatever you manage to use it for, and people will use your thing or they won’t. But the Semantic Web cares for nothing more than semantics, so it was on.

This particular topic was discussed on July 1st 2002, July 15th 2002, July 22nd 2002, July 29th 2002, September 16th 2002, and at least 20 other occasions through 2005. It was resolved by the great ‘httpRange-14 resolution’ of 2005, then reopened by complaints in 2007 and 2011 and a call for new solutions in 2012. The question was heavily discussed by the pedantic web group, which is very aptly named. The one thing which didn’t happen is all that much semantic data getting put on the web behind any sort of URL.

Auth

As you may know, you can include a username and password in URLs:

http://zack:shhhhhh@zack.is

The browser then encodes this authentication data into Base64, and sends it as a header:

Authentication: Basic emFjazpzaGhoaGho

The only reason for the Base64 encoding is to allow characters which might not be valid in a header, it provides no obscurity to the username and password values.

Particularily over the pre-SSL internet, this was very problematic. Anyone who could snoop on your connection could easily see your password. Many alternatives were proposed including Kerberos which is a widely used security protocol both then and now.

As with so many of these examples though, the simple basic auth proposal was easiest for browser manufacturers (Mosaic) to implement. This made it the first, and ultimately the only, solution until developers were given the tools to build their own authentication systems.

The Web Application

In the world of web applications, it can be a little odd to think of the basis for the web being the hyperlink. It is a method of linking one document to another, which was gradually augmented with styling, code execution, sessions, authentication, and ultimately became the social shared computing experience so many 70s researchers were trying (and failing) to create. Ultimately, the conclusion is just as true for any project or startup today as it was then: all that matters is adoption. If you can get people to use it, however slipshod it might be, they will help you craft it into what they need. The corollary is, of course, no one is using it, it doesn’t matter how technically sound it might be. There are countless tools which millions of hours of work went into which precisely no one uses today.


If you haven’t had a chance, feel free to take a look at the first portion of this post, covering the Domain, Protocol and Port. As always, if you enjoyed this post and think you might like other posts like it feel free to subscribe below.

Like this post? Share it with your followers.

Sign up to get our next post delivered to your inbox.

Follow us to get our latest updates.