Robots.txt Files and What They Reveal

While doing a little research on robots text file tips I Googled “robots.txt disallow tips” and what was the first listing I got?

robots-txt-file.gif

The White House’s robots.txt file!!! And it has a Page Rank of 4. This discovery had my little SEO brain wondering why on earth that specific page would rank #1 for that term. Obviously the page must have a good number of incoming links and/or some good PR juice flowing to it.

So then I was interested in who might be linking to the White House’s robots.txt file – I mean, that’s not very interesting web reading. Or is it?

Well the good old Beeb gave me some information that I liked to hear – that the new administration is open. And responsible for removing a whopping 2374 or so lines from the old file, and whittling the new robots.txt file to a trim 3 lines.

I wanted to see how many sites are linking to the WH’s famous, or infamous, file. So using “allinlink: whitehouse.gov/robots.txt” I learned Google had 9,390 indexed pages linking to the White House’s file.

Although older pages I read on the issue discussed the idiocy of laying out all your ‘secret’ folders for the world, and terrorist hackers, to read was news a few years ago, it certainly made me think about how careless we can be as website owners with divulging information – information you thought you didn’t want anyone [like your competition] to know about.

So you want to know what your rival is working on for their new product launch? Take a peek at their robots.txt file and if they hire web techs like the old White House did you might be privvy to some juicy, secret, information.

Said rivals, hurry – check your robots file now and make sure you aren’t giving away your secrets.

What we tend to forget is that just because you have a robots.txt file, it doesn’t mean the pages you list won’t be indexed.

Yes, that’s right.

That begs the question – what can you do to stop search engines indexing your top secret pages? Well the obvious thing to do is to use password protected directories, or simply don’t ever have a link to a page you want hidden. No linkie, no findie.

I won’t go into all the details here, but will let others, more learned in the old robots methodology explain it to you:
An aptly named info site
More on creating a robots.txt file
Managing robots access
Even more help

Now safely make your own robots.txt file to get over any virtual arachnophobia:
Google webmaster tools
Robots.txt generator

And remember – the spiders will be coming soon.

Post to Twitter Tweet This Post

Related posts:

  1. SEO and the Basics of Onpage SEO
  2. SEO: 5 Tips For Maximizing Your Website’s Reach
  3. Ten Tips To Make Your Web Page Text Stand Out


Leave a Reply

Spam Protection by WP-SpamFree