ahCrawler



Description

ahCrawler is a set to implement your own search on your website and an analyzer for your web content.
It consists of

  • crawler (spider) and indexer
  • search for your website
  • search statistics
  • website analyzer (SSL check, http header, short titles and keywords, linkchecker, ...)


You need to install it on your own server. So all crawled data stay in your environment. If you made fixes - the rescan of your website to update search index or analyzer data is under your control.

GNU GPL v 3.0

The project is hosted on Github: ahcrawler

Download

Last Updates


  • 2020-05-29: v0.115
    • FIX: error counter increases on failed ressources
    • UPDATE: add dummy http code 1: hostname does not exist in DNS
    • UPDATE: show protocol switch in the opposite view of references too.
  • 2020-05-28: v0.114
    • UPDATE: updater: got 60s timeout to download
    • UPDATE: updater: got a timestamp parameter if switching to startpage
    • UPDATE: charts get back a white border
  • 2020-05-27: v0.113
    • FIX: search - detection on word start
    • UPDATE: show info about running crawler in the footer
    • UPDATE: more clean locking during crawling and scans
    • UPDATE: smaller font for titles in overview pages
    • UPDATE: start page groups messages by check page
  • 2020-05-23: v0.112
    • ADDED: blacklist - per profile you can add several search texts to ignore links
    • ADDED: timeout for all http requests
    • ADDED: home of project shows favicon
    • FIX: label for attributes in the profile settings were not uniq
    • UPDATE: disabled items got special cursor on hover
    • UPDATE: search index test: reset [X] was fixed
    • UPDATE: about does not show (German) project page anymore
    • UPDATE: Medoo to 1.7.10
  • 2020-05-15: v0.111
    • UPDATE: reorder menu items: website related pages are all in the upper part
    • UPDATE: menu items got logic: can be disabled based on available data
    • UPDATE: installer creates the default config (one manual step less in the initial setup)
    • UPDATE: profiles are ordered alphabetically (before: by id)
  • 2020-05-13: v0.110
    • ADDED: identify redirects that switch the protocol from http to https
    • UPDATE: colors
    • UPDATE: SSL check got an timeout of 2 sec (1 sec before)
    • UPDATE: pure to 2.0.3
  • 2020-05-10: v0.109
    • FIX: htmlchecks page: show short tiles/ description/ keywords (it was broken in 1.08)
    • UPDATE: linkchecker page: links to ressources contain project id now
    • UPDATE: a tile "100.00%" is shown as "100%"
  • 2020-05-09: v0.108
    • UPDATE: database table was changed
      execute "php bin/cli.php -a flush -d searchindex"
      and then index your projects again
    • UPDATE: show count of words in title, keywords, description
    • UPDATE: remove PHP Deprecated: mb_strrpos() in analyzer.html.class.php (PHP 7.4)
    • FIX: htmlchecks page: show number of pages (it was broken in 1.06)
  • 2020-05-06: v0.107
    • FIX: htmlchecks - missing tables for large/ long loading pages
  • 2020-05-06: v0.106
    • NEW start page showing an project overview with errors and warnings shown in subpages before
    • FIX: ressources page can show empty mime types
    • UPDATE: pure to 2.0.0
    • UPDATE: datatables to 1.10.20
    • UPDATE: font-awesome to 5.13.0
    • UPDATE: jquery to 3.5.0
  • 2020-04-17: v0.105
    • UPDATE: resize overview tiles
    • UPDATE: software update got back buttons
    • UPDATE: software update in a single step
    • UPDATE: login form
  • 2020-04-15: v0.104
    • UPDATE: settings allow to edit ranking multipliers (hartcoded before)
    • UPDATE: settings got more placeholders
    • UPDATE: settings page hides current database password with a dummy
    • UPDATE: showing login fom sends a 401 statuscode (instead of 200)
    • UPDATE: selected profile tab will be stored for 8 h (instead of 1 h)
    • UPDATE: ssl check for non https items jump to middle of the page (instead staying on top)
  • 2020-04-13: v0.103
    • UPDATE: langedit saves changes
    • UPDATE: colors (make a Shift + Reload after update to load new css styles)
  • 2020-02-23: v0.102
    • UPDATE: fix conditions for PHP 7.4 (means: it shows no warning on PHP 7.4 anymore)
    • UPDATE: print css
    • UPDATE: langedit: add comparison of count of specifiers (means: more checks when editing language files)
  • 2020-01-19: v0.101
    • ADDED: backend: page for bookmarklet (moved from about page)
    • UPDATE: page for lang texts
    • UPDATE: css in overview pages
    • UPDATE: cli class (allow cgi-fcgi as cli too)
    • FIX: search class - remove limit before calculation of ranking
    • FIX: typo in German lang textfile
  • 2020-01-05: v0.100
    • UPDATE: search for % char in text
    • ADDED: backend: page to test search index
  • 2020-01-04: v0.99
    • UPDATE: font-awesome to 5.11.2
    • UPDATE: jquery to 3.4.1
    • UPDATE: Chart.js to 2.9.3
    • UPDATE: medoo to 1.7.8
    • UPDATE: ahcache class
    • UPDATE: cli class
    • FIX: ranking counter in search class: it did not detect a searchterm on text end
    • UPDATE: improve details for ranking in backend searchindex search
    • UPDATE: http response headers - added non-standard headers

Requirements


  • any webserver with PHP 5.5+ up to PHP 7.4 (PHP 7 is recommended)
  • php-curl (included in php-common)
  • php-pdo and database extension (sqlite or mysql)
  • php-mbstring
  • php-xml

Screenshots


Just to get a few first impressions ... :-)

Backend - statistics of the entered search terms


ahcrawler :: backend

ahcrawler :: backend

Backend - anlysis of the website

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend







Copyright © 2015-2020 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here