ahCrawler is a set to implement your own search on your website and
an analyzer for your web content.
It consists of
- crawler (spider) and indexer
- search for your website
- search statistics
- website analyzer (SSL check, http header, short titles and keywords, linkchecker, ...)
You need to install it on your own server. So all crawled data
stay in your environment. If you made fixes - the rescan of your
website to update search index or analyzer data is under your
GNU GPL v 3.0
The project is hosted on Github: ahcrawler
- any webserver with PHP 5.5+ up to PHP 8.0 (PHP 7.3+ or PHP 8 is strongly recommended)
- php-curl (could be included in php-common)
- php-pdo and database extension (sqlite or mysql)
- 2021-01-08: v0.141
- UPDATE: cronscript supports update and single profiles
- 2020-12-30: v0.140
- UPDATE: crawling processes
- UPDATE: cli action "update" uses GET requets to handle errors caused by denying http head requests
- FIX: remove a var_dump output in crawling process
- FIX: remove context box in about page
- 2020-12-28: v0.139
- UPDATE: show done urls in percent
- FIX: writing crawling logs is enabled again
- FIX: crawling ressources (http HEAD) runs with PHP 8 (no core dump anymore)
- 2020-12-05: v0.138
- FIX: deny list was not applied on 3xx redirects
- FIX: update code for PHP8 compatibility (work in progress)
- UPDATE: Css colors
- UPDATE lib: Chart.js 2.9.3 -> 2.9.4
- UPDATE lib: datatables 1.10.20 -> 1.10.21
- UPDATE lib: font-awesome 5.13.0 -> 5.15.1
- 2020-10-04: v0.137
- ADDED: html analyzer - scan for AUDIO and VIDEO sources
- ADDED: html analyzer - add line number in the source code of found items
- FIX: html analyzer - handle urls starting with ? in html content
So, why did I write this tool?
The starting point was to write a crawler and website search in PHP as a replacement for Sphider
that was discontinued and had open bugs.
On my domain I don't use just a CMS ... I also have a blog tool, several small applications for
photos, demos, ...
That's why the internal search of a CMS is not enough. I wanted to have a search index with all of my content of different tools.
So I wrote a crawler and an html parser to index my pages...
If I had a spider ... and an html parser ... and all content information ... the next logical
step was to add more checks for my website. And there are a lot of possibilities like
link checkers, check metadata, http response headers, ssl information ...
You can install it on a shared hoster or your own server. All data are under your control.
All timers are under your control. If you fix something: just reindex and then immediately check if
the error is gone.
- written in PHP; usable on shared hosters
- support of multiple websites in a single backend
- spider to index content: CLI / cronjob
- verify search content
- check entered search commands of your visitors
- analyze http reponse header
- analyze ssl certificate
- analyze html metadata: title, keywords, loading time
- link checker
- built in web updater
Just to get a few first impressions ... :-)
Backend - statistics of the entered search terms
Backend - anlysis of the website