Configuration

You find the config in the ./config/ directory. There is a json file crawler.config.json with several sections. By default the project is shipped with an example file you can use for adaptions.

Section “options”

setup of default options … not specific to a website profile

Variable type Comments
analysis key Values for analysis
… -> MinDescriptionLength integer html check - Minimum count of chars in the description
… -> MinKeywordsLength integer html check - Minimum count of chars in the keywords
Remark: the importance for SEO optimization is very low. Google does not process any keyword for their search index.
If you use the ahCrawler search hits in keywords get a higher ranking than those in the content.
If you do not use the search engine then you can set the value to 0 (zero) to disable the check for keywords.
… -> MinTitleLength integer html check - Minimum count of chars in the document title
… -> MaxPagesize integer html check - Limit to show as large page (byte)
… -> MaxLoadtime integer html check - Limit to show as long loading page (ms)
auth key authentication for the backend; you can setup a single user only.
You can disable this (quite simple) internal authentication with removing (or renaming) this section.
If you need several users: disable this section and setup an authentication with apache users based on directory or location.
… -> user string username
… -> password string hash of wanted password using password_hash()
crawler key defaults for crawling
… -> memoryLimit string Memory size for CLI only; the value is a valid memory size for ini_set(‘memory_limit’, [value]); the default is 512M
… -> searchindex key defaults for search index crawler
…… -> simultanousRequests integer default count of simultanous requests for crawling pages for all projects you setup.
The crawling for the search index makes http GET requests (it loads the content).
The minimum number is 2.
Hint: Do not overload your own servers. Speed is really not important for a cronjob.
… -> ressources key defaults for resources crawler
…… -> simultanousRequests integer default count of simultanous requests for resources scan for all projects you setup.
The crawling for the search index makes http HEAD requests (it does NOT load the content).
The minimum number is 2.
Hint: Do not overload your own servers. Speed is really not important for a cronjob.
And: you can setup HEAD reqests with a much higher value than GET requests. Try 5 … and then higher values.
… -> timeout integer timout value for all crawling requests (GET and HEAD) in sec.
… -> userAgent string User agent to use for crawling. Background: a few webservers send an http error code by detecting a crawler. If you set a user agent of a “normal” webbrowser then the chance is higher to get a valid response.
Hint: In the web gui go to the setup to use the user agent of your currently used browser.
database key define database connection
… -> database_type string type of the PDO database connection; so far only sqlite and mysql are supported.
For first tests use “sqlite” - it OK for websites with a few hundred pages and is more simple to setup.
one of “sqlite”
… -> database_file string sqlite only: name of the database file.
default is “__DIR__/data/ahcrawl.db” (where __DIR__ will be replaced with application root directory)
… -> database_name string non-sqlite only: name of the database scheme
… -> server string non-sqlite only: name of database host
… -> username string non-sqlite only: name of database user
… -> password string non-sqlite only: password of database user
… -> charset string non-sqlite only: cahrset; example: “utf-8”
cache boolean Use cache in backend pages to speed up pages. The caches expires with a new craling process for the viewed website profile.
This feature is work in progress and is disabled by default (false).
debug boolean show debug infos in the backend pages.
lang string language of the backend interface;
one of “de”
skin string Name of the current skin. It is a directory name in ./backend/skins/.
menu array hide menu items in the backend
- key is the name of the page (have look to the url in the address bar ?page=[name])
- value is one of true|false
menu-public array hide menu items in the public frontent
- key is the name of the page (have look to the url in the address bar ?page=[name])
- value is one of true|false
searchindex array key
… -> regexToRemove array list of regex to remove from html body for the search index; by default it contains
- html comments
- script and style sections
- link rel
- nav tags
- footer tags
… -> rankingWeights array Define factors to weight the search results with the searchform in your website. A direct match of the searchterm with a found word should be higher than a match in the middle of a longer word.

The sections are:
- matchWord - Exact hit of a whole word
- WordStart - Hit at the beginning of a word
- any - Hit anywhere in the text

In each of these section are places where the search term is scanned. A hit in the url i.e. should have a higher weight than in the content.
- content … in the content
- description… in the meta description
- keywords … in the keywords
- title … in the title tag
- url … in the url
{
    "options":{
        "database":{
            "database_type": "sqlite",
            "database_file": "__DIR__/data/ahcrawl.db"
        },
        "auth": {
            "user": "admin",
            "password": "put-md5-hash-here",
        },
        "lang": "en",
        "crawler": {
            "memoryLimit": false,
            "userAgent": false,
            "searchindex":{
                "simultanousRequests": 2
            },
            "ressources":{
                "simultanousRequests": 2
            }	
        },
        "searchindex": {
            "regexToRemove": [
                "<footer[^>]>.*?<\/footer>",
                "<nav[^>]>.*?<\/nav>",
                "<script[^>]*>.*?<\/script>",
                "<style[^>]*>.*?<\/style>"
            ]
        },
        "analysis": {
            "MinTitleLength": 20,
            "MinDescriptionLength": 40,
            "MinKeywordsLength": 10,
            "MaxPagesize": 150000,
            "MaxLoadtime": 500
        }
    },
    (...)
}

Section “profiles”

Setup of websites to crawl. The first index below is an integer value that is called profile id. The table below describes all values of a profile id.

{
    "options":{
	(... see above ...)
	},
    "profiles":{
		"[id]":{
			(... profile settings ...)
		}
	}
}
Variable type Comments
label string A short name for the website. It is shown inside the admin as tab label on the top.
description string description text for this website
userpwd string optional setting for password protected websites with basic authentication
The syntax is
[username]:[password]
searchindex key definitions for the crawler of the search index
… -> urls2crawl array Start urls for scan
… -> iDepth integer Maximum path level to scan
… -> iMaxUrls integer For initial tests: set max. count of urls to scan (0 = no limit)
… -> include array Array with regex that will be applied on any detected full url in a link.
The crawler adds an url if it matches one of the regex.
Default: none; any url (matching the sticky url) will be followed.
… -> includepath array Array with regex that will be applied on any url path of a detected link.
The crawler adds an url if it matches one of the regex.
Default: none; any url (matching the sticky url) will be followed.
… -> exclude array Array with regex that will be applied on any url path of a detected link.
The crawler skips an url if it matches one of the regex.
Default: none; any url (matching the sticky url) will be followed.
Remark: Even if it is empty … the crawler follows several disallow options by default:
- disallow for agent “*” in robots.txt
- disallow for agent “ahcrawler” in robots.txt
- meta robots no index and nofollow in html head
- attribute nofollow in a link
… -> regexToRemove array list of regex to remove from html body for the search index; it overrides the default in the options section: options -> searchindex -> regexToRemove.
… -> simultanousRequests integer count of simultanous requests; it overrides the default in the options section: options -> crawler -> searchindex -> simultanousRequests.
frontend key definitions for the frontend (search form for your website)
… -> searchcategories array items for search categories based on url path
Syntax:
[key] - label of the filter
[value] - WHERE value for sql statement
i.e. “… my blog”: “/blog/%”
… -> searchlang array items for language select box in the search form
i.e. [“de”, “en”]
resources key definitions for the crawler of the search index
… -> simultanousRequests integer count of simultanous requests; it overrides the default in the options section: options -> crawler -> ressources -> simultanousRequests