Introduction CLI tool

The ahcrawler has a command line tool.
It is located in the ./bin/ subdirectory.
With it you can

  • list current profiles
  • (re)index a website profile
  • delete data of a profile
  • flush all data of all profiles

It was written to be used in cronjobs and for manual indexing.

Calling it without parameter it shows a help.

ahcrawler :: cli

Basic rules
The most commands you will need have a structure with 3 parameter blocks

cli.php [action] [for wich data] [and which profile]

You can use the short variant for the parameters or long (which are more readable).

Actions

--action [name of action] or -a [name of action]

Known actions are:

  • list - list all existing profiles
  • index - start crawler to reindex searchindex or resources
  • update - start crawler to update missed searchindex or resources
  • empty - remove existing data of a profile
  • flush - drop data for ALL profiles

Data

--data [name] or -d [name]

Valid data items are:

  • searchindex - the database of the webcontent for a website search; this is always the first data item you need to fill!!
  • resources - the used resources in your website (links, images, css, js files)
  • search - the entered search terms of your visitors (if you use the search form)
  • all - short for searchindex + resources

List profiles

With the list action you find out the ids of your profiles.
These ids you will need for the parameter --profile (or -p) in other actions.

Example:

cli.php --action list

(Re-) Create the index of a website

With the reindex action you can delete existing indexed data and start the indexer. This is is the most simple variant to update a profile. I handles both data stores in a single step: it deletes and indexes

  • searchindex
  • resources

The --profile parameter defines the profile to handle.

Example:

cli.php --action reindex --profile 1

Remark: On a shared hosting with a limited execution time you can split actions (empty then index and then update), data resources (searchindex and resources) while looping over all profiles.

Create index of a website

With the index action you can start the indexer to rescan the searchindex OR the linked resources.
Remark: to delete already indexed data you need to call the "empty" action (see below).

The --profile parameter defines the profile to handle.
The --data parameter is used to tell what to index.

  • searchindex - the database of the webcontent for a website search; this is always the first data item you need to fill!!
  • resources
  • all (searchindex + resources)

Example:

cli.php --action index --data all --profile 1

Remark:
If the website was crawled before you may want to delete the data of a single profile first (action empty) - or flush all indexed content of all profiles (action flush).

Update index of a website

With the update action you can complete a scan. It starts the indexer to check all items that failed in the last run and have an error status.
Repeat the update command after a full index of a website profile only.

The --profile parameter defines the profile to handle.
The --data parameter is used to tell what to index.

  • searchindex
  • resources

Example:

cli.php --action update --data resources --profile 1

Empty data of a single website profile

With the empty action you can delete all entries of the given profile id. This command initiates a DELETE in the database table(s) for all items with the given profile id.

The --profile parameter defines the profile to handle.
The --data parameter is used to tell what to delete.

  • searchindex
  • resources
  • all (searchindex + resources)
  • search - be careful - this you don't want in the most cases
  • full (searchindex + resources + search) - be careful - this you don't want in the most cases

Example:

cli.php --action empty --data searchindex --profile 1

Flush data of all website profiles

With the flush action you can delete all data of all profiles. This command initiates a DROP TABLE command in the database.
You should use the flush command if you have created a search index and a resources scan and want to rebuild them from point zero.

The --profile parameter is not needed - dropping tables has impact to all profiles.
The --data parameter is used to tell what to delete.

  • searchindex
  • resources
  • all (searchindex + resources)
  • search - be careful - this you don't want in the most cases
  • full (searchindex + resources + search) - be careful - this you don't want in the most cases

Example:

cli.php --action flush --data all

Cronjob: reindex all

It is helpful to regenerate the index by a cronjob. Therefor exists a script that reindexes all data of all projects you already created. Don't fiddle with the parameters above :-) ... use the script in ./cronscripts/ directory.

Show parameters

You can run reindex_all_profiles.php -h or --help to get a list of supported prameters.


===== AhCrawler :: Cronjob - reindex all =====

HELP:
CLI reindexer tool for a cronjob.
It flushes all indexed data of all profiles and then reindexes them.

PARAMETERS:
  -u
  --update (without value)
    update only
    Do not flush and reindex all - only update (=rescan errors and missed items)

  -p
  --profile [value] (value required)
    profile id
    Set a profile id. Do not handle all profiles - just a single one. default: all profiles
    If a value is given then it will be checked against regex /^[0-9]*$/

  -h
  --help (without value)
    show this help

Add cronjob

As an example to reindex every night at 2:15 AM add a crontab entry

15 2 * * *     php [path]/cronscripts/reindex_all_profiles.php 2>/dev/null

To start the script without parameters once per day should be the most common call for self managed servers.
Add the redirection 2>/dev/null to prevent getting emails. In the backend you get the log output of the spider process.


Maybe you want to have a look to my Cronwrapper (on Github) ... a daily job has a ttl of 1440 minutes:

15 2 * * * /usr/local/bin/cronwrapper.sh 1440 "php [path]/cronscripts/reindex_all_profiles.php"

Hints for usage on a shared hoster

A limititation on a shared hoster can be the execution timeout.
You can try to run a an initial process for a full reindex (the same example like above)...

15 2     * * *   php [path]/cronscripts/reindex_all_profiles.php          2>/dev/null

and additionally add jobs with --update to finish the crawling of missed items:

15 4,6,8 * * *   php [path]/cronscripts/reindex_all_profiles.php --update 2>/dev/null


If you still get timeouts run the full index and update jobs profile by profile as single tasks.

As an example for two profiles:

15 2     * * *   php [path]/cronscripts/reindex_all_profiles.php --profile 1           2>/dev/null
15 4,6,8 * * *   php [path]/cronscripts/reindex_all_profiles.php --profile 1 --update  2>/dev/null

For your website with profile 2 add its own commands for indexing on a separate timeslot:

15 3     * * *   php [path]/cronscripts/reindex_all_profiles.php --profile 2           2>/dev/null
15 5,7,9 * * *   php [path]/cronscripts/reindex_all_profiles.php --profile 2 --update  2>/dev/null

Cronjob: updater

A software update is visible in the backend and could be done interactively. Or you install a daily or weekly cronjob with the updater script.

The updater checks if a newer version exists. If so it will be installed. The script supports command line parameters to check the version only without installing a newer version (parameter -c) or to force an installation (of newer or current version).

The script supports exitcodes if you wish to embed it in other scripts.

Show parameters

You can run updater.php -h or --help to get a list of supported prameters.


===== AhCrawler :: Updater =====

HELP:
CLI updater tool. It checks if a newer version exists and - if so - installs the update.

PARAMETERS:
  -c
  --check (without value)
    Check only.
    If a newer version would exist, you get just a message and exitcode 1 (without installing the update).

  -f
  --force (without value)
    force installation
    If no newer version exists it reinstalls the current version.

  -h
  --help (without value)
    show this help

Add cronjob

As an example to install a software update every night at 5:45 AM add a crontab entry

45 5 * * *     php [path]/cronscripts/updater.php 2>/dev/null

To start the script without parameters once per day should be the most common call for self managed servers.
Add the redirection 2>/dev/null to prevent getting emails. In the backend you get the log output of the spider process.


Maybe you want to have a look to my Cronwrapper (on Github) ... a daily job has a ttl of 1440 minutes:

45 5 * * * /usr/local/bin/cronwrapper.sh 1440 "php [path]/cronscripts/updater.php"

Exitcodes

The updater script supports exitcodes to be embedded in other scripts.
Especially the parameter "--check" (to check for an update without installing it) could be used to trigger an alert or information.

exitcode Comments
0 Execution was successful. It can mean
  • No newer version exists.
  • If a newer version existed: the installation of the update was successful.
1 A newer version exists.
This exitcode can occur on param "--check" only.
2 A newer version exists but the download failed.
3 A newer version exists, download was successful but the installation failed.






Copyright © 2015-2021 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here