Search Engine Spiders and Robots - How do they work?

... and why "Paid" search engine optimization is a waste of time, money and effort.

This newsletter was posted in 1999, after being rewritten and updated from prior articles about the functionality of Search engines and the diference between the new search engines, like Google and early human-categorized Directories like Yahoo. We leave this article in the archives because understanding how search engines evaluate web pages is still important and mis-understood by many web masters.

Search engine spiders, sometimes called "robots" or "crawlers", are small automated software programs that search engines use to stay up to date with content on the internet. These spiders are constantly seeking out new or updated / changed web pages. Any search engine's results page is only as good as the library database of all the web pages it "knows" exist. Lets look at how these searching programs really work, dispel a few myths and discover how spiders can ( and cannot ) help your web site become more successful.

What are these things called Spiders, Robots or web Crawlers?

Every search engine company uses small, automated software programs variously called search engine spiders, "robots" or "web crawlers". Think of the internet as a world-wide spiderweb, just like one you see in a dusty corner. Visualize the "spiders" that "crawl" across all the interconnected links, moving quietly, disturbing nothing, but visiting each and every corner of that spider web. ( Spiders, the world wide web, crawling and robots.... now you see how these "technical" terms evolved. Computer geeks are great at explanatory analogies.) No matter the name ( I prefer spiders), the assigned job is to constantly roam around the "world wide web" searching for new or updated / changed web pages. Think of how many new pages must have been added to the internet, just today alone. A search engine spider is only a service tool. It helps the search engines index or "catalog" every web site correctly.

How smart are the search engine spiders?

The reality is that spiders have only minimal, basic abilities. They function very similar to early desktop web browsers, reading only the HTML text code on a web page. Most web surfers don't remember, or have never seen one of the early "text only" Mozilla, Netscape or Internet Explorer browsers. The early browsers and the spiders of today just can't do certain things. They can't read a "Flash" animated intro page. Spiders can't see JPG or GIF images. Some web design tools, like frames or JavaScript code, is completely skipped . Web crawlers also can't invoke interface functions, like entering password protected areas or clicking the fancy buttons you have on your website. Some of the showiest web sites use dynamically generated pages inside a single URL. The complete web site might as well be invisible, because it really is "invisible" to the web crawlers. Most of the multi-media and animation rich content we take for granted now as web surfers, is still unseen by the spiders trying to catalog web sites. This is important to know in the design phase of your web site project. It's easier to build a "search engine friendly" web site today, than it is to go back and "fix" one that isn't, next week.

How Do Search Engine Spiders Work?

To understand how search engine spiders operate, it's helpful to think of them as automatic data searching robots. As I've said, spiders travel the web to find as many new or updated web pages and links as possible. When you submit your web pages to a search engine at the "Submit a URL" page, you will be added to the spider's list of web pages to visit on its next search mission out onto the internet. Your web pages could be found, even if you didn't submit them. Spiders can find you if your web page is linked from any other web page on a "known" web site. It's no surprise then that it is important to build a library of links from other popular web sites back to yours. More about links between sites in a moment. When our friendly search engine spider arrives at your web page, it first looks for a robots.txt file. This is an transparent HTML code file used to tell spiders/robots which areas of your site are off-limits and shouldn't be cataloged. Why would you work hard to welcome the spiders, and then tell them to "go away"? Some pages contain HTML code or other elements that are a waste of time ( like the JavaScript and Flash pages mentioned above). A robots.txt file can also re-direct spiders away from pages that are "secret", temporary or for any reason should not be widely available. The next chore for our spider guest is to collect outbound links from the page. These routes will be followed to other pages later. Think of a spider knocking and opening every door in a long hallway. Behind each door is another long hallway, with more doors to open. Spiders follow links from one page to another page. The internet is nothing but links from page to page. That was the original idea behind spiders, automating a tool to follow every link and visit the entire internet, learning and storing it's content in a catalog. Do you want the spiders to find you quickly? Have other "known" web sites link to your web pages and make sure your own pages link to other pages both inside and outside your web site. The internal links should be simple, but send the spider to a variety of important areas inside your web site. The VectorInter.Net newsletter links at the bottom of this page are an example.

How do Spiders "read" your web page, to learn it's content?

Every web page ( for our simple discussion) is made with HTML code. This is the unseen, computer code that tells your browser how to display the page onscreen. The display information describes the colors on the page from text to borders and the placement of photos or graphics. HTML describes the page to a browser as you would describe the Eiffel Tower to a painter who had never seen it. The page display is then rendered onscreen in a nano-second ( or slower if you use a dial-up modem.). Search engine robots "read" your pages by looking at the visible text on the page, (the content you are reading now ). They "see" the various HTML tags in your page's source code (title tag, meta tags, alt tags, etc.). You can learn more about HTML tags and how they work, in the VectorInter.Net newsletter archives. The spiders also make note of the hyperlinks on each page. The words and the links that the spiders find help the search engine decide what your page is all about. This is also where the term "Keywords" comes from. Keywords are the text on each web page that is the most descriptive of the page content.

What is this "Search Engine Optimization" I keep hearing about?

We have now arrived at the point in this whole process where a great deal of folklore and rumor begins to join the web page indexing process. After a spider has reported the content of your pages back to the search engine's main library database, each search engine evaluates and processes the information. All web pages delivered to the library database become part of the search engine and the directory ranking process. Remember that each search engine has it's own robots, unique processes and page content judgements. When a search engine user submits a search query, the search engine digs through its library database to give the final listing that is displayed. The "results page" from any search engine comes from software engineers, who devise the methods used to evaluate the page content the spiders retrieved. Any query will prompt an automated process using highly secret algorithms to make sure that the results presented are a the most relevant matches. We have more information about "How search engines work" in the VectorInter.Net newsletter archives. For now, just understand that Google, MSNsearch, Yahoo! and any other search engine is not interested in benefiting or penalizing any one web site. A search engine only grows if it is successful at it's task. That tasks is to simply, quickly and effectively find the most relevant matches to your search query. No search engine I know discriminates for or against any web site based on size, ownership or if the web site is a paid advertiser. (Note: Be sure you know the difference between true "search engines", like Google and "Directories", like Yahoo! that have search capabilities. ) The reputation of a search engine is solely based on performance. It's in their interest to catalog every web page and make it available for a query match.

"How often do the spiders visit my web pages?"

Every search engine database is different and so the frequency of visits will vary from one search engine to another. The explosive growth of new and unique web pages has slowed the process slightly, but don't worry. Search spiders are tireless drones on a mission. They want to find your web pages. That is why those links from other "known" web pages can really help. Once you are in the library database, the spiders continue to visit, watching for any changes to your pages, and updating the database with any new or altered content. Ultimately, the number of times you are visited rarely matters to most web sites. This page you are reading will not change. It's content is editorial and not going to be "wrong" next month. Having each one of your web pages correctly cataloged at each of the search engines should be the goal. Any web site owner should know which pages the search engine robots have visited. Look at your server log reports or the results from your log statistics program. ( If you don't have one, upgrade your web hosting service. VectorInter.Net provides these tools free, with every web site hosting contract.) Most spiders / robots are easily identifiable by their "user agent" names. Some are obvious, the Google robot is named "Googlebot". Other spiders have funny names, the Inktomi robot's name is "Slurp". When you run these activity reports, you'll learn the names and know when they visited your website, which pages were visited and how frequently they visit. Identifying individual robots and counting their visits can also show you aggressive robots you may not want visiting your website. Some disreputable robots are tools of the "spam" marketers. These "spam spiders" surf the web, indiscriminately grabbing every Email address listed on your web site. Now you know where some of your junk Email traffic originated. Before I move on, Let me give you another practical tip about search engine spiders. Never remove individual web pages from your site, without replacing them with the newest updated version. This is a common error. If you remove a web page, with plans to replace it days later, the spiders will "see" a blank spot and remove it from the library database. If the spiders are unable to access your web pages, if your site is down, ( poor web hosting service) or if you are experiencing huge amounts of traffic, ( not enough bandwidth at your web hosting company), the spider may not be able to update your web pages. When this happens, a specific page, or the entire web site may not be re-indexed. In most cases, search engines build their spiders smart enough to know that if it cannot access a previously known web page, it should try again later. This is a needless risk to take with your web site. Make sure that your web pages are always accessible. You can read more about "How web site hosting works" in the VectorInter.Net newsletter Archives. By the way, the answer to the original question is : Each spider will visit between 2 and 12 times a month.

To finish our overview of search engine spiders and "How they work" I said I would explain why "search engine optimization" is a waste of time, money and effort. As a policy, VectorInter.Net does not believe in paid "search engine optimization" programs. If you operate a web site, You see the advertisements with wild claims.

(Let us do your Search engine optimization. We are the experts. We can get your web site to the top of the first page in search engine results !! )

With all this advertising noise, You might start to think that someone has figured out a way to "trick" the spiders. Don't fall for the hype.

Web pages that are effective and rank well in search engines (and web directories too) are built with thoughtful content. Your web pages cannot be "fixed", adjusted or "optimized" by a paid service after the web site is "finished". From "Day 1", Build your web pages to be valuable, helpful and informative on your subject matter and you'll be found by queries to the search engine. Yes, there are some page design criteria that help the spiders index your pages. Yes, there are things you can do to improve your efficiency in the library database. This work should be done when the pages are originally designed and built, not as an "add-on" or "premium service" inserted later. You can read about a few of the web page design criterion, in an article called "Do It Yourself" Search Engine Optimization" in the VectorInter.Net Newsletter archives. The search engines want to know what your pages have to offer. The spiders are not the route to instant web site success and hiring a consultant to "trick" the spiders into ranking your web site higher in the query results pages does not work. These "SEO tricks" are like a "system" to beat a casino or the lottery. If I knew how to beat the game, would I tell you ??? I'm sure you are a nice person, but get in line behind me, my parents, me, my sisters, me, my best friend since college, me, my aunt and uncle, me, .... you get the point. Yes, the web site designers / developers at VectorIner.Net use the latest information about search engine spider attributes when planning client web pages. We constantly monitor industry trends, but ultimately the real #1 search engine rankings come from quality, helpful web site content.

Did you ever wonder "Who invented the @ symbol in every Email"? What are "The 10 Big mistakes small businesses make online ?"
Learn more when you visit the VectorInter.Net newsletter archives.