|
RCP
4.0 Users Questions Menu
|
Useragents &
Robots List
of Categories
-
What is a WWW robot?
A robot is a program that automatically surf the
Web's hypertext structure by retrieving a
Urls and documents, and recursively
retrieving all documents that are referenced.
Note that "recursive" here doesn't limit the
definition to any specific traversal algorithm;
even if a robot applies some heuristic to the
selection and order of documents to visit and
spaces out requests over a long space of time,
it is still a robot.
Normal Web browsers are not robots, because they
are operated by a human, and don't automatically
retrieve referenced documents (other than inline
images). Web robots are sometimes referred to as
Web Wanderers, Web Crawlers, or Spiders. These
names are a bit misleading as they give the
impression the software itself moves between
sites like a virus; this not the case, a robot
simply visits sites by requesting all kind of
documents and information from sites on
command.
Top
-
What is an User Agent?
A User-Agent is a technical name (identification
signature) for programs used on and made for the
Web. Programs that perform networking tasks for
a user or automatically, such as Web User-agents
like: Netscape Navigator and Microsoft Internet
Explorer, Email User-agent like Eudora and a
robot User-agent like the Googlebot/2.2 and
AltaVista robot named: Scooter. User-Agent is a
name batch for programs and robots.
Top
-
What is a search engine?
Search Engine is a tool that enables users to
search for information on the World Wide Web.
Search engines use a robot to pick up keywords
entered by webmasters in there HTML pages. On
command a search engine can spider (index)
millions of websites and built a searchable
index of all information and keywords found.
Your website can be found in the search engine
by the keywords in your HTML files. There for
its important to have a good Meta Tag with
keywords and a good site description. Some
search engines are specifically designed to find
Web sites intended for children others to find a
specific topic information and other contain the
full world wide web. Search engines are needed
to get visitors and traffic on your site.
Top
-
What kinds of robots are there?
Robots come and go in all formats and all with a
different goal. The most common used and type of
robots are listed bellow. Not all robots are
friendly many really need to be stopped as the
are a risk for your site privacy etc. Robots can
be used for a number of purposes:
- Indexing -
Building a searchable database by reading
submitted websites there meta-tag
- HTML validation -
Robot can check your site on errors in the used
HTML/DHTML coding etc.
- Link validation -
Robot can check your site and report any dead
links found (404 errors)
Robot can check if you have the required link
back in order to be listed on a Rank/Top List
page.
- "What's New" monitoring -
Robot can check sites and report when a site is
updated.
- Mirroring -
robot is capable to duplicate your complete site
to a different server then yours.
Top
-
what are Spiders, Web Crawlers, Worms, Ants
etc?
Basically they're all names for the same sort of
thing " Web Robots"!
With a slightly different connotations:
Robots - the generic name.
Spiders - same as robots, but sounds cooler in
the press
Worms - same as robots, although technically a
worm is a replicating program, unlike a
robot
Web crawlers - same as robots, but note
WebCrawler is also a specific robot
user-Agent
WebAnts - distributed cooperating robots
Top
-
Aren't robots bad for the web?
Yes and NO !!! - There are a few reasons
why robots are good and bad for the Web and your
website/server.
Certain robot implementations can overloaded
networks and servers performance (loading speed
& bandwidth). This happens specially with
bad coded robots from people just starting to
write a robot and robots also often with bad
robots that ignore the robot rules set in the
robots.txt file (your RCP robottxt.out )
file.
All Robots are operated by humans, who make
mistakes in configuration, or simply don't
consider the implications of their actions or
abuse. This means people need to be careful, and
robot authors need to make it difficult for
people to make mistakes with bad effects.
Search engine indexing robots build a central
searchable database of documents and information
that is open to the world wide public. This can
be good to receive traffic and clients on your
website. People must be able to find your site.
There for its important to be listed and updated
in all major search engines frequently. Robots
aren't inherently bad or good, nor inherently
brilliant, and all need careful attention. As
you never know what the intention is from the
people using the robot to gather specific
information you might not want to become open
for public.
Some robots visit millions of sites to find
email addresses which will be used to SPAM you.
Even wurst they sell all addresses found for a
few cent per address to companies who SPAM
millions of users a everyday. Other robots are
used to steal information and content or hidden
files. Some robots are even used by hackers to
find leaks in a server so they can take over
control of a full website. Lucky RCP allows you
to control all robot visits and robots. And RCP
will stop hack attempts and security risks. RCP
filters out the bad from the good and treat the
hard needed search engine robots well. RCP will
make sure all goes well following your own
rules.
Top
-
How can I stop a robot from indexing my site
or a portion of it?
You can do this by setting rules for robots in
your RCP robottxt.out located in your
cgi-bin/robot/ folder. This file works the same
as a robots.txt file. RCP those not need a
robots.txt file in your htdocs or public folder
to control robots behaviors.
Top
-
How do i set rules for web robots in the
robottxt.out file
to allow or deny portions of my website?
The first method used to exclude robots from a
server is to create a file on the server which
specifies an access policy for robots. On
websites without RCP installed you must create a
txt file named "robots.txt" this file must be
placed in your ~/www/htdocs directory or Public
dir. It must be called "robots.txt".
Robot Control Pro uses a special and unique
system to control, allow, deny and guide all
robots. Websites using RCP also have a
robots.txt file but this file is not located in
the /htdocs/ or /public/ folder. This file is
using a different name then robots.txt
file. The RCP file is named:
"robottxt.out" and located in your
cgi-bin/robot/ folder. The robottxt.out works
exactly the same as a original robots.txt file.
The robottxt.out will be provided to all robots
requesting the robots.txt file.
A robots.txt/robottxt.out file starts with a
User-Agent line, followed by one or more
Disallow lines. This method is usually used
where entire directories are to be disallowed.
The syntax of these lines is detailed below:
User Agent: This field contains either the name
of a specific robot you want the record to apply
to, or '*', to make it apply to all robots.
Disallow: This field specifies a
partial URL that is not to be visited. This can
be a full path or a partial path. Any URL that
starts with this value will not be retrieved by
the robot. For example, Disallow: /test
would disallow both /test.html and
/test/index.html, whereas Disallow /test/ would
disallow only /test/index.html. Disallow
/index.html would disallow only /index.html.
(Note there is not a "/" after an individual
filename.) Let's look at a couple of simple and
useful examples. In this example "/robots.txt"
files specifies that no robots should visit any
URL starting with "/test/documents/" or
"/tmp/":
User-agent: *
Disallow: /test/documents/
Disallow: /tmp/:
This example specifies that no robot should
visit "/test/documents/" except for the one
called WebCrawler:
User-agent: *
Disallow: /test/documents/
User-agent: WebCrawler
Disallow:
This example specifies that no robot should
visit "/document.html":
User-agent: *
Disallow: /document.html
This example tells all robots to go away (*
means all robots. Disallow: / means deny all
files):
User-agent: *
Disallow: /
Top
-
Second robot rule method using HTML Meta-Tag
commands
Where do I put a Robots META tag in my HTML
files?Like any META tag it should be placed
in the HEAD section of an HTML page:
< html >
< head >
< meta name="robots"
content="noindex,nofollow" >
< meta name="description" content="This page
...." >
< title >...< /title >
< /head >
< body >
... the entire body of your page ...
< /body >
< /html >
What do I put into the Robots META
tag?The content of the Robots META tag
contains directives separated by commas. The
currently defined (as of July, 2000) directives
are [NO]INDEX and [NO]FOLLOW.
The INDEX directive specifies if an indexing
robot should index the page or ignore it. The
FOLLOW directive specifies if a robot is to
follow links on the page or ignore them. The
defaults are INDEX and FOLLOW. The default
condition if no robot tags are in place is that
a robot will index the page and follow up with
all links on that page IF the robot does not
find other conditions that make the page illegal
or out of the guidelines set by the robot's
controller. The values ALL and NONE set all
directives on or off: ALL=INDEX,FOLLOW and
NONE=NOINDEX,NOFOLLOW. Some examples:
More normal HTML commands
< meta name="robots" content="all" >
< meta name="robots" content="none" >
< meta name="robots" content="index,follow"
>
< meta name="robots" content="noindex,follow"
>
< meta name="robots" content="index,nofollow"
>
< meta name="robots"
content="noindex,nofollow" >
Be aware that the "robots" (name of the tag) and
the subsequent content are case insensitive,
though not all robotics software programmers
have followed this directive. All robots to not
respond to the last four of the above options.
Some robots have attitudes! Some don't like to
see multiple directives in the same statement or
command. In that case, use multiple commands.
TANS! (There ain't no standard!)
Abnormal HTML commands for robot attitudes
< meta name="robots" content="index" >
< meta name="robots" content="follow"
>
You must take care to not specify conflicting or
repeating directives such as:
A conflicting HTML command
Avoid this problem as the results may be
unpredictable. (We do not know of any cases of
robots eating a document in anger but it could
happen. Remember the attitude!) Double check
each entry for clarity. If you do NOT mind if a
robot indexes and links the page, it is best to
make no entry.
How do I know what works on a Robots META
tag?The only way is trial and error.
Experiment. Since the guidelines for robots are
ambiguous at best, it is hard to follow a
standard that really isn't a standard. Since not
all servers are UNIX or similar operating
systems, even the robots.txt file is not
guaranteed. The commands below are as close to a
standard as it gets.
A formal syntax for the Robots META tag content
is:
Directive Syntax
content = all | none | directives
all = "ALL"
none = "NONE"
directives = directive [","
directives]
directive = index | follow
index = "INDEX" | "NOINDEX"
follow = "FOLLOW" | "NOFOLLOW"
No Robots At All an example of a
"no-robot" page. No robots (respectable or
otherwise!) will look at this page for indexing
and will not follow any links on this page, even
though there are links to it from other pages on
our site. We are including the following coined
word, norobotpagehere, that would normally be
indexed and prepared for a search engine search
comparison. You may look at our indexed site
search and find that this word is not evident in
the results of a search for it, thus showing
that no robot has been here. The following line
indicates that we do not want any 'bots here. It
is a META tag.:
A normal HTML command
< META NAME="ROBOTS" CONTENT="NONE" >
Another normal HTML command
< META NAME="ROBOTS" CONTENT="NOINDEX,
NOFOLLOW" >
However, not all robots recognize the above
command as being identical to the META tag above
it.
Most search engines also require the keywords
and description commands :
< META NAME="keywords" content="Robot,
crawler, spider, webcrawler, index" >
< META NAME="description"
Content="Information about mechanical and
automated indexing for search engines." >
A search will show results but will not show
this page if above sample is used! You will not
find it in any WWW search engine database. Try
it! Enter one of the keywords you used into a
local search engine. Entering one of the other
words will get results but will not show this
page even though it is obvious that the word is
here.
Top
-
Where do I find out more about
robots?
www.robotstxt.org
www.searchtools.com
Search
Robots and
Robots.txt
Web
Robots
Database
Ethical
Web Agents
UMBC
AgentWeb
Webbot
- the Libwww Robot
MOMspider:
Multi-owner Maintenance Spider
Googlebot:
Google's Web
Crawler
Top
|
..
|
|