ahCrawler



Description

ahCrawler is a set to implement your own search on your website and an analyzer for your web content.
It consists of

  • crawler (spider) and indexer
  • search for your website
  • search statistics
  • website analyzer (SSL check, http header, short titles and keywords, linkchecker, ...)


You need to install it on your own server. So all crawled data stay in your environment. If you made fixes - the rescan of your website to update search index or analyzer data is under your control.

GNU GPL v 3.0

The project is hosted on Github: ahcrawler

Download

Requirements


  • any webserver with PHP 7.3+ up to PHP 8.1
  • php-curl (could be included in php-common)
  • php-pdo and database extension (sqlite or mysql)
  • php-mbstring
  • php-xml

The tool is designed to run on a local machine, a vm or your own server.
It CAN run on a shared hosting (I do it myself). But a shared hosting can have limitations. It is not guaranteed that the tool can work for all websites with every hoster. Keep an eye to the following troublemakers.

  • PHP interpreter: it must be allowed to start the php interpreter in a shell. To verify it, connect via ssh to your web hosting and execute php -v. If there is a not found message there is maybe a possibibility to start php in a cronjob.
    If the php interpreter is not available then it is not possible to start a crawling process.
    Ask your provider how to start php as cli tool. Maybe the provide allows it in another hosting package.
  • Script timeout: The indexing process needs its time. It is longer as more single pages and linked elements you have. The indexing process for the html content must be finished within the timeout. The 2nd crawler process to follow linked items (javascript, css, images, media, links) can be repeated to handle still missed elements.

Last Updates


  • 2022-10-23: v0.155
    • FIX: php error in setup on missing defaultUrls
    • UPDATE: deselect OK status buttons on linked resources only
    • UPDATE: backend search additionally can search in html response

  • 2022-10-18: v0.154
    • UPDATE: css of default theme: move all colors into variables to simplify custom skins
    • UPDATE: link details show switch from secure https to unsecure http
    • UPDATE: resource details disable http ok links
    • FIX: http header of a failed page in detail page

  • 2022-09-06: v0.153
    • FIX: add support of git repo outside approot
    • FIX: php error on if a project was not crawled
    • FIX: relative redirect urls
    • UPDATE: use session_write_close
    • UPDATE: skips by extension
    • UPDATE: reduce memory usage while crawling
    • UPDATE: log viewer shows filtered view as default
    • UPDATE: jquery 3.6.0 --> jquery 3.6.1
    • UPDATE: pure 2.0.6 --> pure 2.1.0
    • UPDATE: chartjs 3.6.0 --> chartjs 3.9.1

  • 2022-03-17: v0.152
    • FIX: repeat search on page search terms - top N
    • FIX: do not abort if creation of database index failes
    • ADDED: update detects a git instance and starts a git pull or download+unzip

  • 2022-03-07: v0.151
    • FIX: switch back to language en within content
    • UPDATE: dark theme (work in progress)
    • UPDATE: about page shows PHP version and modules
    • UPDATE: PHP 8.1 compatibility

Why

So, why did I write this tool?

The starting point was to write a crawler and website search in PHP as a replacement for Sphider that was discontinued and had open bugs.

On my domain I don't use just a CMS ... I also have a blog tool, several small applications for photos, demos, ... That's why the internal search of a CMS is not enough. I wanted to have a search index with all of my content of different tools.
So I wrote a crawler and an html parser to index my pages...

If I had a spider ... and an html parser ... and all content information ... the next logical step was to add more checks for my website. And there are a lot of possibilities like link checkers, check metadata, http response headers, ssl information ...

You can install it on your own server or a shared hoster. All data are under your control. All timers are under your control. If you fix something: just reindex and then immediately check if the error is gone.

Features

  • written in PHP; usable on shared hosters
  • support of multiple websites in a single backend
  • spider to index content: CLI / cronjob
  • verify search content
  • check entered search commands of your visitors
  • analyze http reponse header
  • analyze ssl certificate
  • analyze html metadata: title, keywords, loading time
  • link checker
  • built in web updater

Screenshots


Just to get a few first impressions ... :-)

Backend - statistics of the entered search terms


ahcrawler :: backend

ahcrawler :: backend

Backend - anlysis of the website

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend







Copyright © 2015-2022 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here