ahCrawler



Description

ahCrawler is a set to implement your own search on your website and an analyzer for your web content.
It consists of

  • crawler (spider) and indexer
  • search for your website
  • search statistics
  • website analyzer (SSL check, http header, short titles and keywords, linkchecker, ...)


You need to install it on your own server. So all crawled data stay in your environment. If you made fixes - the rescan of your website to update search index or analyzer data is under your control.

GNU GPL v 3.0

The project is hosted on Github: ahcrawler

Download

Requirements


  • any webserver with PHP 5.5+ up to PHP 8.0 (PHP 7.3+ or PHP 8 is strongly recommended)
  • php-curl (could be included in php-common)
  • php-pdo and database extension (sqlite or mysql)
  • php-mbstring
  • php-xml

Last Updates


  • 2021-09-14: v0.148
    • ADDED: store counter values
      This is a background mechanism - its output follows in a next release.
    • UPDATE: upgrade chart js from v2 to v3 (using 3.5.1)
    • UPDATE: font-awesome 5.15.4
    • UPDATE: pure 2.0.6
    • FIX: remove encoding br (Brotli) in http request headers
      This fixes the crawling websites using the brotli copmression, i.e. Wordpress hostings.
    • FIX: comparison with canonical links
    • PATCH: get redirect url from raw http response header if missed in curl data

  • 2021-05-04: v0.147
    • ADDED: ignore noindex tagging
    • ADDED: ignore nofollow tagging (can be dangerous)
    • FIX: Php 8 compatibility in get.php (removes warnings in statusbar)
    • FIX: visibilty of menu item Start -> Crawler log
    • UPDATE: home page has links to crawler start urls (before: text only)
    • UPDATE: Css rules (preparing skin support in next versions)

  • 2021-04-25: v0.146
    • ADDED: settings got entry for custom html code (i.e. to add statistic tracking)

  • 2021-04-24: v0.145
    • ADDED: reindex function on starting page "home"
    • ADDED: detection if only one resource was crawled
    • FIXED: homepage and display items in different constellations
    • FIXED: settings changed username without giving current password
    • UPDATE: move check of http version in http header check
    • UPDATE: css of contextbox

  • 2021-04-14: v0.144
    • ADDED: chart of load time over all pages on start page
    • ADDED: http header check for http version. If below http version 2 you get a warning

Why

So, why did I write this tool?

The starting point was to write a crawler and website search in PHP as a replacement for Sphider that was discontinued and had open bugs.

On my domain I don't use just a CMS ... I also have a blog tool, several small applications for photos, demos, ... That's why the internal search of a CMS is not enough. I wanted to have a search index with all of my content of different tools.
So I wrote a crawler and an html parser to index my pages...

If I had a spider ... and an html parser ... and all content information ... the next logical step was to add more checks for my website. And there are a lot of possibilities like link checkers, check metadata, http response headers, ssl information ...

You can install it on a shared hoster or your own server. All data are under your control. All timers are under your control. If you fix something: just reindex and then immediately check if the error is gone.

Features

  • written in PHP; usable on shared hosters
  • support of multiple websites in a single backend
  • spider to index content: CLI / cronjob
  • verify search content
  • check entered search commands of your visitors
  • analyze http reponse header
  • analyze ssl certificate
  • analyze html metadata: title, keywords, loading time
  • link checker
  • built in web updater

Screenshots


Just to get a few first impressions ... :-)

Backend - statistics of the entered search terms


ahcrawler :: backend

ahcrawler :: backend

Backend - anlysis of the website

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend







Copyright © 2015-2021 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here