You can download a zip file and extract it
OR
if you use a version control client you can checkout the sources of the project.
It will make it easier to keep the files up to date.
Checkout sources with git client.
The following commands create a directory ahcrawler below webroot and put all files there:
cd [webroot-directory]/[install-dir]/
git clone https://github.com/axelhahn/ahcrawler.git [optional-name-of-subdir]
Leaving [optional-name-of-subdir] empty will create a subdir named "ahcrawler"
Get the latest version:
Download
Extract the files somewhere below webroot on your webserver. You can put to any subdirectory. It is not a must to have it directly in the webroot.
In your browser open
http://localhost/ahcrawler/
You will be guided to make an initial setup... and then in a second step to create
a profile for a website.
If a requirement is missed you get a warning message.
The sqlite is for small tests or small websites only.
I highly recommend to use Mysql/ MariaDb - therefore you need to create a database and a database user.
You see if the required pdo module is installed or not.
Enter the connection data.
If you continue the connection will be checked.
Now you are in the backend the first time.
Go to the [Profiles] and set up a web page.
As minimum you need:
Let's go to the command line now. We need to call a cli.php
with some parameters.
You can call it without parameter to get a help ... or see page CLI.
Remark:
The crawling action cannot be triggered from the web gui.
You need the command line or create cron job to atomate indexing tasks.
We start it with action "index" and let crawl profile "1":
# first go to the application directory # cd /var/www/ahcrawler/ $ cd bin $ php ./cli.php --action index --data searchindex --profile 1
In the output you can follow the crawling process and how the spider follows
detected links in a document ... or not, if a Allow-deny forbids it.
INSERT-lines show that a page content was added in the search index.
At the end of the crawl process you get a summary about the count of pages in
the index and used time.
If you reached this point you adjust the profile include and exclude rules
that it fits your needs. Then re-run the indexer with cli.php to update your
search index.
You also can safely delete the database. The crawling process recreates it.
In the web backend you will see the count of crawled pages, the oldest and
newest items in the index.
http://localhost/ahcrawler/backend/?page=status
TODO:
# first go to the application directory # cd /var/www/ahcrawler/ $ cd bin $ php ./cli.php --action index --data resources --profile 1