Get the files

You can download a zip file and extract it

OR

if you use a version control client you can checkout the sources of the project. It will make it easier to keep the files up to date.


Git clone

Checkout sources with git client.
The following commands create a directory ahcrawler below webroot and put all files there:

cd [webroot-directory]/[install-dir]/
git clone https://github.com/axelhahn/ahcrawler.git [optional-name-of-subdir]

Leaving [optional-name-of-subdir] empty will create a subdir named "ahcrawler"

Download


Get the latest version:
Download

Extract the files somewhere below webroot on your webserver. You can put to any subdirectory. It is not a must to have it directly in the webroot.

Installation

In your browser open http://localhost/ahcrawler/
You will be guided to make an initial setup... and then in a second step to create a profile for a website.


Select a language

ahcrawler :: installer


Check requirements

If a requirement is missed you get a warning message.

ahcrawler :: installer


Setup database connection

The sqlite is for small tests or small websites only.
I highly recommend to use Mysql/ MariaDb - therefore you need to create a database and a database user.
You see if the required pdo module is installed or not.
Enter the connection data.

ahcrawler :: installer

If you continue the connection will be checked.

ahcrawler :: installer


First setup

Backend

Now you are in the backend the first time.

ahcrawler :: installer


Create a profile

Go to the [Profiles] and set up a web page.
As minimum you need:

  • A label (Short title)
  • List of start urls of the crawler - i.e. https://www.example.com/
  • Sticky domain - set a domain like www.example.com (DEPRECATED - will be removed soon and detected by the starting urls)
  • press [create]

Spider

Crawl a website

Let's go to the command line now. We need to call a cli.php with some parameters.
You can call it without parameter to get a help ... or see page CLI.

Remark:
The crawling action cannot be triggered from the web gui.
You need the command line or create cron job to atomate indexing tasks.

We start it with action "index" and let crawl profile "1":

	# first go to the application directory
	# cd /var/www/ahcrawler/

	$ cd bin
	$ php ./cli.php --action index --data searchindex --profile 1
	

In the output you can follow the crawling process and how the spider follows detected links in a document ... or not, if a Allow-deny forbids it.
INSERT-lines show that a page content was added in the search index.
At the end of the crawl process you get a summary about the count of pages in the index and used time.

If you reached this point you adjust the profile include and exclude rules that it fits your needs. Then re-run the indexer with cli.php to update your search index.
You also can safely delete the database. The crawling process recreates it.

ahcrawler :: cli :: index the searchindex
Console with CLI tool starting to index the search index.

Search index in the backend

In the web backend you will see the count of crawled pages, the oldest and newest items in the index.
http://localhost/ahcrawler/backend/?page=status

Crawl ressources

TODO:

	# first go to the application directory
	# cd /var/www/ahcrawler/

	$ cd bin
	$ php ./cli.php --action index --data ressources --profile 1
	






Copyright © 2015-2020 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here