Site-Scan Usage

If you want to use our Site-Scan just like any other online-generator for Google sitemaps, you can actually just read the final notes at the end of this webpage, proceed to the start Site-Scan page, key in your email address and root-URL, and start the scanning process. But if you want to have precise control over the content of your sitemap, optimise your robots.txt file and robot meta tags, and want to know whether you have to improve the accessibility of your webpages for search-engines, you should spare a few minutes of your time and keep on reading...

Unlike most other online-tools for the generation of Google sitemaps, Site-Scan gives you a number of options to control how your website is crawled and which URLs will be included into your sitemap. In order to receive the results you are expecting, it requires you to perform some basic preparations for your website. But don't worry, once you have followed the instructions below and prepared your site accordingly, you don't have to change any of these settings for future scans, unless you apply major changes to your website.

Instead of storing special files with different filter options for different websites, common for most sitemap-generators running directly on the server or scanning your website from your PC via an internet connection, with Site-Scan the inclusion (or exclusion) of webpages and files - respectively their URLs - into your sitemap is based on two criteria: The robots.txt file located in the root-directory of your website, and the robot meta tags found on your webpages. This approach has the following advantages:

  • The "filter-settings" for each website are actually an integral part of the respective website, so you don't have to worry how to store and organize your settings when you are taking care of several different websites.

  • It makes sense to make sure that other crawlers will find and index the same pages that you are including into your Google sitemap, so controlling the generation of the sitemap based on these criteria makes sure that all search-engines will (within certain limits) index the same pages - well, at least this is our opinion, but we think most of you will agree with us...

  • And as a value-added feature, Site-Scan will parse your old sitemap and copy the values for <changefreq> and <priority> for existing URLs into your new sitemap, and insert default values for new URLs not included in your old sitemap. This reduces the time required for manual adjustments to a minimum, so if you just updated some pages without adding new ones Site-Scan will just update the modification date and you can use your new sitemap straight away, without the need for any modification.

In the following we will explain how to prepare your robots.txt file and robot meta tags in order to control the content of your Google sitemap, and how to set the available options before starting Site-Scan.


Robots.txt file

When search-engines (or robots, or spiders, or crawlers, or whatever you name them) start crawling a site, they will first check the root-directory for the existence of a special file called robots.txt. This files contains special rules based on the Robots Exclusion Standard, telling them which files and / or directories they are forbidden to crawl or download (That is how it should be, unfortunately a lot of robots - also known as "Bad-Bots" - do not obey these rules, but luckily the major ones do). For a more detailed description of the standard, how to set up your robots.txt file and for the verification of your files, you can check out the following links:

So in order to control the content of your sitemap, you define the respective rules in your robots.txt file, save it to your root-directory and our Site-Scan will exclude these files from your sitemap. Either you prepare a special set of rules for "User-agent: Site-Scan", or Site-Scan will use the rules for "User-agent: *", the anonymous user-agent, if found. In general we recommend to use a special set of rules for Site-Scan for the following reasons:

  • If you define a separate list of rules for Site-Scan, you avoid any negative effect on other robots visiting your website, so you are free to define rules which are meant for optimising the crawling of your website and to control the content of your sitemap. Furthermore, it gives you the option to exclude unknown robots from crawling your site by adding the lines "User-agent: *", "Disallow: /".

  • If you use a special set of rules for user-agent Site-Scan, you can set up additional rules telling Site-Scan to totally exclude URLs from scanning. Unlike URLs excluded by the robot exclusion standards, which will be scanned for links and robot meta tags, but excluded from your Google sitemap, URLs excluded by these rules will be totally skipped during the crawling of your website, which leads to less URLs listed in your result page and a reduced amount of data transferred during the scan. Unlike the rules defined by the robots exclusion standard, which began with a "/" (slash), these rules are starting with a "." (dot). In order to give you maximum flexibility Site-Scan permits two different types of exclusion, namely exclusion by file-extension and exclusion by file and / or directory name. Exclusions by file-extension start with a "." (dot), followed by the actual file-extension, which can be 1 to 4 characters long. Exclusions by file and / or directory name start with "./" (dot-slash), followed by the file or directory name and an optional trailing slash in case you only want to exclude specific directories. Please note that all these rules work case-insensitive.

Example: In order to show you how the different rules affect the scanning process and the content of your Google sitemap, lets assume your robots.txt file contains the following set of rules:

  • User-agent: Site-Scan
  • Disallow: /css/
  • Disallow: /js/
  • Disallow: .jpg
  • Disallow: .jpeg
  • Disallow: ./test
  • Disallow: ./private/

The first line defines the user-agent the rules are meant for, here we use "Site-Scan" to make sure our rules do not interfere with any other robots, e.g. search-engines. The next two lines follow the normal robots exclusion standard, in our example they disallow the two directories containing the cascading style sheets (css) and JavaSript (js) files. If Site-Scan comes across any links pointing to files in these directories it will scan them and display them in your result page but exclude them from your sitemap. Thus you can verify your links to these files without telling Google to index them.

The next two lines are excluding files via file-extension for pictures in the JPEG-format. If Site-Scan comes across links pointing to URLs ending with one of these extensions it will simply discard them, meaning these URLs will not show up in your result page as well as in your sitemap, neither will Site-Scan send any HTTP-request for these files to your server.

The last two lines contain exclusions via file and / or directory name. The first line, "Disallow: ./test", tells Site-Scan to discard all URLs pointing to files or directories where the name starts with "test", e.g. test1.html, test2.jpg, files inside the /test/ or /test2/ directories etc. The second line, "Disallow: ./private/", is limiting the exclusion to a specific directory name, i.e. Site-Scan will discard all URLs pointing to files inside the /private/ directory. As with the exclusion via file-extension all URLs excluded by these rules will not show up in your scan results or your sitemap, neither will Site-Scan issue a request for these URLs

Note: The exclusions via "." (dot) and ",/" (dot-slash) rules are meant to reduce the number of URLs listed in your scan results and to reduce the amount of bandwidth and time required for the scanning process. You should use them if your website contains a vast number of links to files you don't want to list in your sitemap, e.g. pictures or mp3-files. However, they only work when defined for user-agent Site-Scan, otherwise it will ignore these rules!

Tip: When Site-Scan is parsing your robots.txt file, it first parses it completely to check whether it contains rules for user-agent "Site-Scan", before it switches over to user-agent "*". However, other robots are not necessarily that intelligent, and will stop parsing your file when they find the first possible match. That means that you should always put the rules for the anonymous user-agent "*" at the very end of your robots.txt file, making sure that other robots really match the rules you defined for them.


Robot meta tags

The second method to control the inclusion (respectively exclusion) of URLs into your sitemap is the use of robot meta tags. These are special html-tags in the head-section of your webpages and allow to easily control the handling of each single webpage. For a more detailed explanation on how to use robot meta tags, you can find additional information under the following link:

While the scan results and the generation of your sitemap are only influenced by the use of the "index / noindex" and "follow / nofollow" meta tags, Site-Scan also checks whether your pages include the "noarchive" meta tag. This is only for informative purposes, and shows you whether you excluded some pages from being cached by the search engines. These might be useful in cases where these pages are generated by scripts with varying output, or they just contain redirects to other webpages. As defined by the robots exclusion standard, these meta tags are checked in a case-insensitive way, so you can either write "NOFOLLOW", "Nofollow" or "nofollow".

The "index / noindex" meta tag is decisive whether the respective page will be included into your sitemap. In case of "index", Site-Scan will include it, in case of "noindex", it will exclude it. If the page does not contain this meta tag, it will be included, which is in line with the behaviour of most search-engines.

The "follow / nofollow" meta tag is controlling whether Site-Scan will follow the links contained in this page. In case of "follow", Site-Scan will crawl the links found on this page, in case of "nofollow" it will discard them. If the page does not contain this meta tag, it will extract and follow the links, which is in line with the behaviour of most search-engines.

As an alternative, this standard also permits the usage of the "all" and "none" directives, whereby "all" stands for "index, follow" and "none" stands for "noindex, nofollow". If Site-Scan comes across one of these directives, it will translate it into "index, follow" or "noindex, nofollow" accordingly and also show these values in your scan results.

Furthermore, Site-Scan also supports the usage of the rel="nofollow" attribute, which was introduced by Google and is now supported by most other major search-engines. When Site-Scan comes across a tag containing a link, but at the same time containing this attribute, it will completely ignore this link from scanning. This filter is active by default, but it can be deactivated by deselecting "Include meta tags" in the scan-options before starting Site-Scan. For a more detailed description of this attribute and how to use it please refer to this article on searchenginewatch.

Note: The effect of the "nofollow" meta tag is different from disallowing a file or directory in your robots.txt file. As explained earlier, Site-Scan will crawl disallowed pages and extract and follow any links contained inside without including these pages itself into your sitemap. In case of the "nofollow" meta tag, it will include these pages in your sitemap (unless you've blocked them via "noindex"), but will not follow any link on these pages.

Tip: You might have noticed that we are using the phrase "most search-engine" more than once. The reason is that search-engines are different, and each follows its own rules and algorithms, and this is also the case when it comes to meta tags. While most web-designers and webmasters assume that the "index, follow" behaviour is default for all search-engines if those tags are not found there are chances that this is not true, e.g. Inktomi is said to have a default "index, nofollow" behaviour. We can't verify this, but in order to be on the safe side we recommend that you include the following line in the <head>-section of the respective webpages:

  • <meta name="robots" content="index, follow" />

Or, as a valid alternative but slightly shorter:

  • <meta name="robots" content="all" />

It only adds a few bytes so it doesn't do any harm, but who knows whether it does if you do not include it?


How to select the options

After you prepared your website for the Site-Scan, you will proceed to the start Site-Scan page where you will have to fill in some data and can select several options for the scanning process and the generation of your sitemap. In the following we will explain to you what is required and what effects the different options have on the results of your Site-Scan. The optional settings are automatically set to default values which ensure optimum scan results, therefore it is usually no required to change them.

  • Your email-address (required)

    Once the scan request for your website has been completed, Site-Scan will generate a webpage containing detailed scan results as well as a link to the generated Google sitemap. In order to make it accessible to you, the scan-engine will send you an email which contains a link to this webpage, therefore make sure that you provide a correct and valid email address.

  • Your root-URL (required)

    This is the URL to the root directory of your website which contains your start page (e.g. index.html, Default.asp etc.), your robots.txt file as well as your existing Google sitemap. Please make sure you enter the complete URL including the trailing slash, e.g. http://www.mydomain.com/. If Site-Scan faces a redirect when requesting the root-URL, e.g. because you installed a redirect to ensure usage of the canonical form of your links, it will update the root-URL automatically.

  • Include default filters (optional, default: Yes)

    URLs pointing to certain applications, images, audio- and video-files etc. should not be included in to a sitemap, and this option allows you to tell Site-Scan to ignore these URLs without modifying your robots.txt file. The exclusion is based on registered MIME file-extensions, a list with all file-extensions currently excluded can be found on the webpage Site-Scan news. We added this option for people who are not familiar with editing their robots.txt file, or reluctant to do so, to decide whether to select this option or not consider the following rule-of-thumb:

    • If you use Site-Scan for generating a sitemap you should activate this option. It will make sure that Site-Scan only requests for URLs which it can parse and which actually belong in to your sitemap, minimising time and bandwidth required and reducing the risk that it will reach one of the limits. You can still define additional rules in your robots.txt file, Site-Scan will apply both filters based on an OR-condition.
    • If you use Site-Scan for checking your internal links you should deactivate this option. This will ensure that Site-Scan will check all internal URLs and therefore also validate all links pointing to images etc. You can still exclude certain files or directories by means of rules defined in your robots.txt file.
  • Include robots.txt (optional, default: Yes)

    Also the generation of your Google sitemap is mainly based on the rules you defined in the robots.txt file (and the robot meta tags included in your webpages), we give you the option to deselect the inclusion of the rules defined there. This can be useful in cases where you want to receive a sitemap which includes all URLs which can be extracted from your website, without applying any changes to your robots.txt file. Otherwise use the default setting.

  • Include meta tags (optional, default: Yes)

    Similar to robots.txt, you can exclude the usage of meta tags for the crawling of your website and the generation of your sitemap. Useful if you are not sure whether the usage of "nofollow" on certain pages will stop robots from finding other pages, or you want to receive a sitemap including the webpages for which you used "noindex". If you want to generate a standard sitemap, leave it to its default value. If you deselect this option, Site-Scan will also ignore the rel="nofollow" attribute while parsing your webpages.

  • Include old sitemap (optional, default: Yes)

    Usually Site-Scan will check for the existence of an old sitemap, parse it and use the values for <changefreq> and <priority> of existing URLs for the generation of the new sitemap, thus saving you considerable time required for the manual editing of these values. For new URLs, it will automatically set <changefreq> and <priority> to the default values you selected, allowing you to choose you preferred values for these attributes. However, if for what reason ever you do not want to include these values, but rather want to have a sitemap from scratch, you can deselect this option. Basically the same like scanning a website without an existing sitemap.

  • New sitemap options (optional, default: Yes, Yes, Yes)

    By definition, a Google sitemap should at least contain a list of URLs to be crawled by Googlebot. However, based on the basic idea behind the Google sitemap project, we think it is rather pointless to submit a sitemap which does not contain at least the last modification dates. And as it also offers optional values for <changefreq> and <priority>, we are of the opinion that you should take full advantage of all available options and include them into your sitemap, whether they currently really make a difference or not. However, you can deselect any single option if desired, and Site-Scan will exclude it from your new sitemap.

    Note: Once you generated a sitemap without <changefreq> or <priority> and upload it on to your server, these information will not be available during the next scan and are therefore lost, so you will have to edit them manually next time you want to include them and not stick to the values which Site-Scan assigns by default! If you want to reuse this optional values later on, make sure to keep a backup of a sitemap containing these values and upload it back on to your server before starting a new Site-Scan.

  • Default values (optional, default: never, 0.5)

    These are the default values Site-Scan will use for the optional attributes <changefreq> and <priority> for new URLs or in case these values are not available from an existing sitemap and therefore cannot be copied over to the new sitemap. You can choose the default values used by Site-Scan according to your requirements, the reason why we decided on "never" as the default for <changefreq> is that it should not be used for an active website, and therefore makes it easy to scan your new sitemap for any new URLs added. Simply search for "never" with a text or XML editor, and it will lead you to the newly added URLs so that you can modify this value as required. For <priority> we decided on 0.5 as this is a neutral value, meaning average priority for this URL. However, as mentioned you can change these default settings according to your requirements.

  • Save scan- options (optional)

    Site-Scan permits you to save the settings of your current session so that they will be retrieved automatically for your next scan. Saving of these settings is based on a cookie stored on your computer, so please make sure that you enabled the storage of persistent cookies from our website if you want to make use of this feature. The data stored in this cookie consists of your email address, root-URL and the other options and default values selected by you.

    Note: In order to ensure your privacy, the default setting for this option is "no" meaning there will be no cookie stored on your computer until you select "yes". After this the default setting will be "yes" as long as Site-Scan can locate this cookie on your computer, however by selecting "no" again before starting a new scan this cookie will be deleted immediately and the default setting will return to "no". Please note that the cookie set by Site-Scan will expire automatically 90 days after your last successful scan request. If you selected "yes" and Site-Scan does not store your scan-options, please check your browser's settings to make sure the storage of cookies is permitted.


Abuse prevention

When we turned Site-Scan into an online-tool, we took great care to avoid any possible abuse. Site-Scan will access your server with the maximum bandwidth available in order to provide your scan results and Google sitemap as fast as possible. However, people could use this to target other servers and keep them busy with repeated scan requests, which would result in a DoS (Denial of Service) attack, meaning that your server would be busy with responding to our Site-Scan requests with limited or no resources left for other requests. In order to avoid this, we included two main security features:

  • Scan permission: Once a scan request is passed on to the scan-engine, it will first check the root-directory for the robots.txt file. If it is found, it will parse it for rules, first for user-agent "Site-Scan" and then, if not found, for "*", the anonymous user-agent. If it finds the respective rules, and they contain a line "Disallow: /", Site-Scan will immediately abort the scanning process, delete any data collected so far and proceed with the next scan request. Please note that even if you deselect "Include robots.txt" in your scan options, this safety feature is still active. So if you want to protect your site from being accessed by Site-Scan, simply add the following lines to your robots.txt file:

    User-agent: Site-Scan
    Disallow: /

    If you had some bad experiences with robots before (and some people had really bad experiences...) and your robots.txt file already contains the following lines:

    User-agent: *
    Disallow: /

    you are on the safe side, Site-Scan will not start a scanning-process on your site or generate any scan-results for it unless you define a separate set of rules for "User-agent: Site-Scan".

  • Scan delay: Not every website contains a robots.txt file, so in order to add a second layer of protection against possible abuse we added a minimum scan delay between two scans for the same root-URL. If you try to register another scan request within this period or a request is still pending for this URL it will be rejected. This does not only limit the possibility of an DoS attack but also helps to prevent people from using too much of the available bandwidth and processing capacity of Site-Scan by scanning their site over and over again, thus helping to share the available resources with others. The current value for this delay can be found in the Site-Scan news.

In addition to these safety measures we reserve the right to exclude certain URLs from scan requests as well as to restrict the access to this website or parts thereof from specific IPs or IP-blocks. Should you think that this online-tool is being abused please contact us so that we can take the appropriate action.

Tip: Even if you are using Site-Scan for the generation of your Google sitemaps the proper use of the scan permission via "Disallow: /" makes it still possible to almost completely exclude other people from scanning your website. Site-Scan will only check for the rules set in your robots.txt file at the moment the scan is started. So before requesting a scan, simply comment the line out by using the hash-sign: "# Disallow: /". It might take several minutes before the scan-engine starts with your request, depending on the number of requests pending, but because of the scan delay new requests for this site are blocked for a considerable time, giving other people little chances to register a request for your "unprotected" website. Once you received your email with the link to your scan results, you restore the the value to "Disallow: /", and your site is protected again. However, if you are not worried about other people scanning your website, you don't have to modify your robots.txt file, just leave it alone once you have figured out your optimum settings.


Final notes:

In order to fulfil as many requests as possible we have to set some limits for each Site-Scan requested, the currently active limits are listed in the Site-Scan news. To perform a successful scan please make sure that you also meet the following additional requirements:

  • Make sure Site-Scan is not disallowed from accessing your site by the rules set in your robots.txt file
  • Make sure your server is up and running before requesting a Site-Scan.
  • Do not request a Site-Scan during periods you know your server might be extremely busy.
  • If you are using bandwidth-throttling, requests per-IP / minute or any other techniques to limit the access to your website make sure to exclude our IP-address (202.190.199.53) from these restrictions.
  • If Site-Scan reaches the maximum scanning time and therefore cannot finish crawling your website, consider to enable persistent connections on your server if not already activated.

Once you have defined the rules for your robots.txt file, added the correct meta tags to your webpages as required and uploaded everything on to your server your website is prepared to start Site-Scan.

Server powered by
www.ipserverone.com
Scan-technology by M-Press Systems