Robot Control Pro - Official User Guide - FAQ



RCP 4.0 Users Questions Menu

Useragents & Robots     List of Categories





  • What is a WWW robot?

    A robot is a program that automatically surf the Web's hypertext structure by retrieving a Url‚s and documents, and recursively retrieving all documents that are referenced. Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

    Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting all kind of documents and information from sites on command.

    Top


  • What is an User Agent?

    A User-Agent is a technical name (identification signature) for programs used on and made for the Web. Programs that perform networking tasks for a user or automatically, such as Web User-agents like: Netscape Navigator and Microsoft Internet Explorer, Email User-agent like Eudora and a robot User-agent like the Googlebot/2.2 and AltaVista robot named: Scooter. User-Agent is a name batch for programs and robots.

    Top


  • What is a search engine?

    Search Engine is a tool that enables users to search for information on the World Wide Web. Search engines use a robot to pick up keywords entered by webmasters in there HTML pages. On command a search engine can spider (index) millions of websites and built a searchable index of all information and keywords found. Your website can be found in the search engine by the keywords in your HTML files. There for its important to have a good Meta Tag with keywords and a good site description. Some search engines are specifically designed to find Web sites intended for children others to find a specific topic information and other contain the full world wide web. Search engines are needed to get visitors and traffic on your site.

    Top


  • What kinds of robots are there?

    Robots come and go in all formats and all with a different goal. The most common used and type of robots are listed bellow. Not all robots are friendly many really need to be stopped as the are a risk for your site privacy etc. Robots can be used for a number of purposes:

    - Indexing -
    Building a searchable database by reading submitted websites there meta-tag

    - HTML validation -
    Robot can check your site on errors in the used HTML/DHTML coding etc.

    - Link validation -
    Robot can check your site and report any dead links found (404 errors)
    Robot can check if you have the required link back in order to be listed on a Rank/Top List page.

    - "What's New" monitoring -
    Robot can check sites and report when a site is updated.

    - Mirroring -
    robot is capable to duplicate your complete site to a different server then yours.


    Top


  • what are Spiders, Web Crawlers, Worms, Ants etc?

    Basically they're all names for the same sort of thing " Web Robots"!
    With a slightly different connotations:

    Robots - the generic name.
    Spiders - same as robots, but sounds cooler in the press
    Worms - same as robots, although technically a worm is a replicating program, unlike a robot
    Web crawlers - same as robots, but note WebCrawler is also a specific robot user-Agent
    WebAnts - distributed cooperating robots

    Top


  • Aren't robots bad for the web?

    Yes and NO !!! - There are a few reasons why robots are good and bad for the Web and your website/server.

    Certain robot implementations can overloaded networks and servers performance (loading speed & bandwidth). This happens specially with bad coded robots from people just starting to write a robot and robots also often with bad robots that ignore the robot rules set in the robots.txt file (your RCP robottxt.out ) file.
    All Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions or abuse. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects.

    Search engine indexing robots build a central searchable database of documents and information that is open to the world wide public. This can be good to receive traffic and clients on your website. People must be able to find your site. There for its important to be listed and updated in all major search engines frequently. Robots aren't inherently bad or good, nor inherently brilliant, and all need careful attention. As you never know what the intention is from the people using the robot to gather specific information you might not want to become open for public.

    Some robots visit millions of sites to find email addresses which will be used to SPAM you. Even wurst they sell all addresses found for a few cent per address to companies who SPAM millions of users a everyday. Other robots are used to steal information and content or hidden files. Some robots are even used by hackers to find leaks in a server so they can take over control of a full website. Lucky RCP allows you to control all robot visits and robots. And RCP will stop hack attempts and security risks. RCP filters out the bad from the good and treat the hard needed search engine robots well. RCP will make sure all goes well following your own rules.

    Top


  • How can I stop a robot from indexing my site or a portion of it?

    You can do this by setting rules for robots in your RCP robottxt.out located in your cgi-bin/robot/ folder. This file works the same as a robots.txt file. RCP those not need a robots.txt file in your htdocs or public folder to control robots behaviors.

    Top


  • How do i set rules for web robots in the robottxt.out file
    to allow or deny portions of my website?


    The first method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. On websites without RCP installed you must create a txt file named "robots.txt" this file must be placed in your ~/www/htdocs directory or Public dir. It must be called "robots.txt".

    Robot Control Pro uses a special and unique system to control, allow, deny and guide all robots. Websites using RCP also have a robots.txt file but this file is not located in the /htdocs/ or /public/ folder. This file is using a different name then robots.txt file. The RCP file is named: "robottxt.out" and located in your cgi-bin/robot/ folder. The robottxt.out works exactly the same as a original robots.txt file. The robottxt.out will be provided to all robots requesting the robots.txt file.

    A robots.txt/robottxt.out file starts with a User-Agent line, followed by one or more Disallow lines. This method is usually used where entire directories are to be disallowed. The syntax of these lines is detailed below: User Agent: This field contains either the name of a specific robot you want the record to apply to, or '*', to make it apply to all robots.

    Disallow: This field specifies a partial URL that is not to be visited. This can be a full path or a partial path. Any URL that starts with this value will not be retrieved by the robot. For example, Disallow: /test would disallow both /test.html and /test/index.html, whereas Disallow /test/ would disallow only /test/index.html. Disallow /index.html would disallow only /index.html.
    (Note there is not a "/" after an individual filename.) Let's look at a couple of simple and useful examples. In this example "/robots.txt" files specifies that no robots should visit any URL starting with "/test/documents/" or "/tmp/":

    User-agent: *
    Disallow: /test/documents/
    Disallow: /tmp/:

    This example specifies that no robot should visit "/test/documents/" except for the one called WebCrawler:

    User-agent: *
    Disallow: /test/documents/
    User-agent: WebCrawler
    Disallow:

    This example specifies that no robot should visit "/document.html":
    User-agent: *
    Disallow: /document.html

    This example tells all robots to go away (* means all robots. Disallow: / means deny all files):
    User-agent: *
    Disallow: /


    Top


  • Second robot rule method using HTML Meta-Tag commands

    Where do I put a Robots META tag in my HTML files?Like any META tag it should be placed in the HEAD section of an HTML page:

    < html >
    < head >
    < meta name="robots" content="noindex,nofollow" >
    < meta name="description" content="This page ...." >
    < title >...< /title >
    < /head >
    < body >
    ... the entire body of your page ...
    < /body >
    < /html >

    What do I put into the Robots META tag?The content of the Robots META tag contains directives separated by commas. The currently defined (as of July, 2000) directives are [NO]INDEX and [NO]FOLLOW. The INDEX directive specifies if an indexing robot should index the page or ignore it. The FOLLOW directive specifies if a robot is to follow links on the page or ignore them. The defaults are INDEX and FOLLOW. The default condition if no robot tags are in place is that a robot will index the page and follow up with all links on that page IF the robot does not find other conditions that make the page illegal or out of the guidelines set by the robot's controller. The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW. Some examples:

    More normal HTML commands
    < meta name="robots" content="all" >
    < meta name="robots" content="none" >
    < meta name="robots" content="index,follow" >
    < meta name="robots" content="noindex,follow" >
    < meta name="robots" content="index,nofollow" >
    < meta name="robots" content="noindex,nofollow" >

    Be aware that the "robots" (name of the tag) and the subsequent content are case insensitive, though not all robotics software programmers have followed this directive. All robots to not respond to the last four of the above options. Some robots have attitudes! Some don't like to see multiple directives in the same statement or command. In that case, use multiple commands. TANS! (There ain't no standard!)

    Abnormal HTML commands for robot attitudes
    < meta name="robots" content="index" >
    < meta name="robots" content="follow" >

    You must take care to not specify conflicting or repeating directives such as:
    A conflicting HTML command

    Avoid this problem as the results may be unpredictable. (We do not know of any cases of robots eating a document in anger but it could happen. Remember the attitude!) Double check each entry for clarity. If you do NOT mind if a robot indexes and links the page, it is best to make no entry.

    How do I know what works on a Robots META tag?The only way is trial and error. Experiment. Since the guidelines for robots are ambiguous at best, it is hard to follow a standard that really isn't a standard. Since not all servers are UNIX or similar operating systems, even the robots.txt file is not guaranteed. The commands below are as close to a standard as it gets.

    A formal syntax for the Robots META tag content is:

    Directive Syntax
    content = all | none | directives
    all = "ALL"
    none = "NONE"
    directives = directive ["," directives]
    directive = index | follow
    index = "INDEX" | "NOINDEX"
    follow = "FOLLOW" | "NOFOLLOW"

    No Robots At All an example of a "no-robot" page. No robots (respectable or otherwise!) will look at this page for indexing and will not follow any links on this page, even though there are links to it from other pages on our site. We are including the following coined word, norobotpagehere, that would normally be indexed and prepared for a search engine search comparison. You may look at our indexed site search and find that this word is not evident in the results of a search for it, thus showing that no robot has been here. The following line indicates that we do not want any 'bots here. It is a META tag.:

    A normal HTML command
    < META NAME="ROBOTS" CONTENT="NONE" >
    Another normal HTML command
    < META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW" >
    However, not all robots recognize the above command as being identical to the META tag above it.

    Most search engines also require the keywords and description commands :
    < META NAME="keywords" content="Robot, crawler, spider, webcrawler, index" >
    < META NAME="description" Content="Information about mechanical and automated indexing for search engines." >

    A search will show results but will not show this page if above sample is used! You will not find it in any WWW search engine database. Try it! Enter one of the keywords you used into a local search engine. Entering one of the other words will get results but will not show this page even though it is obvious that the word is here.

    Top


  • Where do I find out more about robots?

    www.robotstxt.org
    www.searchtools.com
    Search Robots and Robots.txt
    Web Robots Database
    Ethical Web Agents
    UMBC AgentWeb
    Webbot - the Libwww Robot
    MOMspider: Multi-owner Maintenance Spider
    Googlebot: Google's Web Crawler


    Top



..



Robot Control Pro - Copyright Webcomposing.com © 1997-2003 - All rights reserved.
Robot Control is a trademark of Webcomposing.com Corporation, registered in the U.S. and other countries.