Wednesday, March 09, 2011

The Anatomy of a Web Site Address ( URL)

Intro

The web address you type in the "address bar" of your internet browser (Internet Explorer, Firefox, or Safari etc ie ), has a subtle structure and interpretions. Browsers read URLs (Uniform Resource Locator and not actually 'unique' as it is sometimes called, but actually "is") you type in right to left, ie backwards, and when they read through an internet domain name designator, the primary being ".com" the browser then sends the request for the IP address of that domain name to the local domain name server ( more on this below).

Once the browser has this IP address, and it may be cached a while thereafter, your browser goes direct to the address and sends the entire URL expecting to open a communication most often governed by the means of transmitting web pages, that is http.


But since we start at the beginning of a web address and anything after the "dot com" point gets read left to right by other computers serving us information, let us begin at the LHS:

http:// HTTP: Hyper text transfer protocol: just meaning the protocol handeling text which is sharable across the internet, and the agreed means to transmitt and recieve html based / related files (hyper text mark up language) or smaller packets of information to be used to make a web page appear in your browser. ( for example, the primary web page may be written in XHTML while it calls upon javascript files and PHP files/ information through PHP operations from the server). The colon and the forward slashes are just a computer convention and were irritating before browsers autocompleted this for you. However, they serve a purpose to make the browser use the address line to formally request to open an http dialogue with the server.
Other protocols inclued https, a more stringent one-to-one version of http;
FTP which is simpler and relates to moving files from one computer to another, this is nearly always associated to a password and a resulting level of permission to what folders / routers you get access to, ICMP is used in PING , see below. VOIP is voice-intenet protocol, used in software like skype.

Different protocols can utilise or search for different PORTS on routers and eventual servers which then is optimised for that type of traffic. HTTP usually uses port 80.

Secure, Encrypted Protocol: https
SSL is secure serving, using the https protocol, for use with credit cards, banking and so on and domanin names wanting to use this must have a unique IP ie only one web site can be utilised which is really very limiting. It means banks have more power than maybe they should have in controlling and offering paid services for access to SSL servers.

Http latest version
The latest version allows for multiple handshakes of http/meta data in one connection ie there is a virtual pipe , which may be the same path on the internet nodes, and there is no need for the full http handshake each time.

Tricks for less http handshaking: to download just one image or one xml file while the client side then only shows part of that, or the http for images and content is only activated upon scrolling down or over to it.

"www." As stated above, this just states that the type of communication will be world wide web based, web page oriented. It is now a little superflous and as you will see you can type many web addresses without this and get to the site's first page or the page when presented to you will lack the www. See below/above for another comment on this.

Finally there is the annoying colon-slash-slash and on early browsers this had to be typed manually! "http://..." is just the syntax for instructing http communication to-and-from the web address, and like "www" is largely redundant and could well be excluded from view: I expect browsers to do this in future, especially with the growth in both mobile handset HD TV based internet browsing where by you may be using a key pad with "Multi touch" spelling rather than a proper key board, thus making it tedious to even type "www".

"xyz.com" The Top Level Domain

Dot com is just one of many web titles or TLDs, with some having restrictions on who can own them like .gov.uk, .ac.us and so on. Dot com remains the most popular in western countries at least, despite the availability of .biz, .info, .inc and so on.

" Dot TV" ie www.younameit.tv is actually a country code for Tuvalo ( a TLD; top level domain) so the islanders there make a lot of cash out of TV firms and streaming web services.

There used to be restrictions on ".net" for somekind of network provision, and ".org" but the domain-name-authorotories are not strict on these now. They are of course strict on the uniqueness of the address, ie if an address is currently owned by someone you cannot own it and get it pointed to "Your" server IP address until it expires or they sell it to you.

Some countries domain name authorities are fairly strict afore or if approached, when you infringe a trademark or "inpersonate" a trading company, registered orginsation or person. More on this below.

The web server identifies any request for just the .com, then it sends out the default page and information, for example index.html or default.asp back to your, IP address.
www.xyz.com

The domain name per se is now complete as all modern browsers would be able to process it from this. As mentioned the name and TLD is to be chosen carefully so as to be exlcusive to your company, while exhaustive in covering everything in TLD variants, abbreivations, brand names, acronyms etc and you may well want to explore common typo's, mis-concieved spellings of your company name and former company names and merged company names to help your customers find you and protect you against people trying to steel your traffic or misrepresent themselves.

Additional Syntax : Full-Stop-Text: eg www.joebloggs.xyz.com
Your name as mentioned needs to be unique and you need to know the IP address of the server computer it will be held upon. However because the address is read from right to left by the computers on the internet at the dot com TLD area leftwards, you can arrange interesting sub domain names: like this type which I hypothesised for individual person campaigns ( there would be some legal issues in some countries in using this approach) It is not until the actual web server is reached that the computer tries to interpret this: In Apache server you can set up for such sub domain names, or by defaul send them all to the home page. They are sometimes used as a slight privacy screening, or for "perishable" messages when you only want people who have recieved the actual address to be able to get access for a given time.

dot com slash...... .com/
Now the computer stops reading right-to-left at the server end. The forward slash is just like a file path in DOS, windows or Mac and means that the file will be in the next level folder, in this case the top level folder of course.

Apache server and other such programmes, direct .com requests to "/' index' folder" to serve up information from there, or as is the case interact with the server computer through this location.
For most web sites in 1998, this meant that index.html was sent out to you, and usually this would in turn ask for graphic files and maybe a style-sheet to help the web site look, erm, nice. Graphics were usually held in a folder called /images and you could for fun go through this folder on some browsers "ftp" view.

In the case of more advanced "dynamic" web sites, the information you send , which may just be the intial http request for the dot com name, is processed by a server side programme addressed through this URL folder location, and the result, ie the computed answer is the information given you. So for example if you have a permanent cookie for a web site, the first page may well be personalised just for you or more likely: advertising may be related to your previous interactions with that web site, with a profile being stored on the server, or when you completed a form, the data may be held: like a user name, or a search term (string).

Behind the scenes a little:

If you own a web site and either transfer the files up for the first time or re-write over them: When you move the web pages and info (by FTP usually, file transfer protocol) to put the web site "up", the server has a structure often with a unique folder for the "home page" : some stipulate that the first file must be called "index.html", other servers have one folder (directory) where you place only one master file. "Apache" type software knows to send you the contents of this folder first if you have not specified a long URL., ie just http://www.zyx.com wil serve you up http://www.zyx.com/index.html a single HTML file which will travel over the internet and be put together by your browser, IE or Firefox, safari etc. Consciencous programmers will also direct any non existing URLs beyond a correct "dot com" or lacking "www." at the front end to this first page

The HTML source mentions files, most often images like gifs or JPEGS and then your computer goes back to the web server to fetch them: this time it can go direct because it knows the IP address from the DNS. The IP is cached locally, and this can be for a notable period of time, even days.

Looking Behind the Scenes:

These small sub web page level, HTTP requests cause load on servers which perhaps prefer a more direct relationship: FIREBUG is a firefox plug in which has a server analysis function allowing you to see what requests back for information were made: in other words each one of these Http requests is subject to delay and re-packeting issues for the individual user, who may be on a cell phone for example, and on the server side there is much more secondary handshaking going on which is just low value load on the apache server.

A web site which waits for jpegs, flash etc to load BEFORE rending the general structure and text in HTML is badly developed. The latest dynamic web sites actually have timers built in to wait to load content or to rotate content, like advetising or playing a short video.
Having a lot of files needing separate HTTP requests is one load factor, the other is having web sites with a lot of code which is not compressed. If you run a popular web site, then even spaces, indents and long "parameters" like "query" instead of "Q" add up if you are serving millions of hits, into paying for more bandwidth inwards from your ISP.

DNS: Domain Name Servers - What do they do ?


The route to the 'nearest' DNS to your computer is predetermined in your internet connection values: this includes your own router (WiFi internet box for most of us) and the telephone company / ISPs DNS ( domain name -server AKA -system). It is really a digital telephone exchange and your ISP will either own this or pay for access to the regional or national DNS.
The DNS IP address is in your internet settings supplied by your ISP and your computer dials it to then use it as a telephone directory for who the dot com name is, ie where, which computer it is on. The DNS in turn duly sends back the IP address for the web site if it exists. Therafter your browser ( or FTP client) will store this IP address for use when dialling to that web address over a period of time so as not to hassle the DNS server again.


What actually happens is that the DNS sends back the IP address corresponding to that web address you have written. Your computer rings up Now your computer sets up an HTTP handshake with its own IP address in a little "packet" of data which is then sends out to the internet nodes / routers. In other words, your "dialed" request follows a route which is like a cross between counntry code, area code, zip code ( local postal code) AND a unique telephone line.
DNS are powerful in terms of server-client capacity, but in reality all they do is look up a two column flat table: "dot com" www.xyz.com address in one and IP address in the other.
When you launch a new "dot com" web site, or park a name somewhere else for example at a web-hotel, ALL the DNS servers in the free-world need to add this bit of data. This takes time to populate of course, and not all ISPs have admin rights to enter these to the process. Three layers of delays are possible: manual at your ISP waiting to send request to an authorised ISP; queing at the DNS authorised ISP; Queing at the primary DNS nodes; Populating and thereafter queing at local ISP nodes/ routers.

Internet Router Machines

After your own home router, there is not just one but a whole string of routers which form the internet nodes for trafficing of information. There is a large element of control of bandwidth for the "free" internet, and different routes are chosen between these nodes and your http request packet ( a little post card asking for the web site to reply) may be deprioritised, sent further or wholly re-routed. In the USA it is very frequent to have 14 routers even for internal traffic.

In days of old, all traffic in and out the USA went through a transatlantic cable and the first router "node" was in Maryland, and hence webstats reported lots of viewers living in that state! I don't know if the US blocked ICMP further or if it was not fully available anyway at that point, but that is as far as you could "see".

Behind the Scenes of Internet Routing:
Ping, TraceRoute and ICMP: Detective Work on the Often Twisty Route your HTTP Request Takes on the Internet. ICMP is a shortened protocol which commands PING and Traceroute use (in the DOS simulators, command line programming)

Traceroute is a little programme in itself which tries to identify as many nodes ie routers as possible in the path to the eventual server with the IP / Web site your requested. It will identify how many steps there are at least, and shows you the time it took from home to that step. This can help identify if your ISP is relying on cheap, slow routes on the internet and also which country for example, a website or client requesting your web site lives in.

Some routers do not allow ICMP due to load or security wishes(reasons) and the same is true for VOIP which is why it can be so rubbish quality: it often takes torturous routes and the data packets get lost.

Paths on the Internet
Your data does not need to follow the same path with each datapacket: your computer and the eventual server piece together the datapackets: so for example a jpeg for a web page may go a completely path back to you, through different routers on the internet.

Incidentally, IP addresses with the DNS entry to a URL run out over time, it used to be a year or two, and you then have to refresh the DNS entries around the world. WHOISIT and other sites, offer information on expiry dates of the URLs.

Domain Name Hosts, Web Site Servers, Web Hotels and Hijackers


These usually manage your DNS entry and "Pointing" to them, and it can be simplest to buy URLs through them. The best in the USA is argueably "Go Daddy" dot com, who despite a very commercial web interface, offer very good service and technical DNS/ URL management features for a low price, from 10 USD! "Network Solutions Inc" used to be the most trusted source of this service.

Web hotels are people who will hold your URL and show a holding screen or their own home page with a message about your URL. THey are useful if you own a good URL name but have no design or current content. It is best to find a reputable one, or use a well known web server. Some less reputable may sell off your web name or wait in the wings to buy it when it expires without warning you, just having "expiry" or ownership in the small print.

"squatters" are web hotels and individuals who have registered domain names speculatively : the famous case was "coca cola", the famous plaintiff won of course over the couple in london.

"Hijackers" are worse: previoulsy hackers and unscrupolous types could use weak internet countries like new zealand to send e-mail with your domain name, or redirect you away from a web request. These days som ISPs are doing it with mispells and typos, and if you have a cable provider you can find you go to a site which you must be careful you don' end up buying something from: they know who you are and can probably bill your credit card. I tried once to register "cacao calo" dot com and Glof.com as misspells of coke and a big american/ UK middle management waste of time site. Both were refused by my ISP, who blame the DNS authorities, while I bet the ISP registered themselves.

This is another REAL danger of snousing out domain names on some sites like "whois" and with your own ISP router machine sniffing what you are requesting: there is a risk some companies monitor these and snaffle up domain names you were checking were clear.

A 404 error does not mean the PDN ( primary domain name) is inactive, it just means the server which owns it does not serve it to you.

In the late 90s BT in the UK had issues with their primary DNS being located in wales, and there being massive amounts of transatlantic and DVLC data going down the backbone through wales. What they did was to reroute people, and they actually cached entire html web sites, in their thousands, at sites north and east of wales. We wondered why when we FTPed a site up, and we could see it in the file path, but we could not get it on the internet. This would not be "possible" in today's dynamic web site environment.

Virtual Host
Is just a name for a single IP address used to serve multiple domain names, usually mediated with on the main Apache server or an intelligent router as a 'triage' for larger ISPs and Web Hotels. This is very normal today of course, but actually relies on the browser sending the actual requested domain name ( and whole file path URL) in the packet which "dials" the IP address. Early browsers just sent the IP.

Red Indians out There ? Apache?
Apache is a very common software which operates the server computers : for example, sorts out the individual URL requests, dealing out the actual web sites from the servers behind this. The information needed to serve and administrate this account, say for a paying customer, is held in a config file: the primary domain name, any variants / alias's like no www, or .co.uk.

If the files for the web site are down or not present, then the 404 error will contain a reference to this. The config file also allows permissions for types of technical programming and access: eg using CFM or ASP or PHP. The top html or PHP file for the home page is usually held in the ( unix style) folder "/public" and the other public access files are in folders which have the permission level to let the external requests go out ie not restricted. Quite often you can see that you do not have permission to view the folder when you write a "hot wire" filepath in the URL.

Apache contains small programming areas to built up of web-master commands which are called Modules: The first most important of these is for hosting the DN: the "rewrite" module: used for permanent redirects of all domain name URL request: ie wrong file paths, old deleted pages and if you will have it non-case sensitive ( caps lock issues for people browsing) and if maybe you are doing a temporary redirec ( code 302) OBS! if you make www.jim.com go to www2.jim2.com, then if search engines crawl you then without a temp redirect note, they will list you as www2: this is the same for web hotelling. You can also capture frequent incorrect URLs which can show bad links from external web sites or poor indexing on search engines, and these can be worth implementing as pages near to what is described at source or listing. Https routing can also be acheived in this module, although of course you need to have an SSL certificate from the authorities ( Microsoft usually) in order for user-browsers to accept the https connection. ( see security lecture for non IT Managers)

Usually these days there will also be a primary web site language engine running on the machine, like PHP or Active Server Pages, and often a SQL server programme to facilitate more advanced "dynamic" features on web sites and access to "back end" database computers. This will be the topic of a subsequent lecture.

No comments:

Post a Comment