ahCrawler



Description

ahCrawler is a set to implement your own search on your website and an analyzer for your web content.
It consists of

  • crawler (spider) and indexer
  • search for your website
  • search statistics
  • website analyzer (SSL check, http header, short titles and keywords, linkchecker, ...)


You need to install it on your own server. So all crawled data stay in your environment. If you made fixes - the rescan of your website to update search index or analyzer data is under your control.

GNU GPL v 3.0

The project is hosted on Github: ahcrawler

Download

Last Updates


  • 2020-10-04: v0.137
    • ADDED: html analyzer - scan for AUDIO and VIDEO sources
    • ADDED: html analyzer - add line number in the source code of found items
    • FIX: html analyzer - handle urls starting with ? in html content

  • 2020-09-30: v0.136
    • ADDED: crawlerlog got a paging navi
    • ADDED: crawler follows canonical urls
    • ADDED: show contributors in about page
    • ADDED: pull request of Ozhiganov (Russian language files)

  • 2020-09-23: v0.135
    • ADDED: log cli output of crawling actions in ./data/
    • ADDED: page to view log data of crawling actions
    • FIX: ressource scan shows matching regex of the deny list
    • FIX: profile page layout error
    • UPDATE: show hint if a url matches a regex in the deny list
    • UPDATE: show hint if a url switches from https to http

  • 2020-09-11: v0.134
    • ADDED: sslcheck: show certificate chain check
    • UPDATE: rename "ressource" to "resource" in output. IMPORTANT: cli parameter -d is included too. --> Check your cronjobs and relace your cli parameter
      from -d ressources to -d resources.
    • UPDATE: profile: file upload got an accept attribute for images files
    • UPDATE: search.class: use param "guilang" for frontend language and "lang" for language in search --> Check integrations/*.php
    • UPDATE: search.class: customize search result output
    • UPDATE: remove unneeded functions

  • 2020-09-05: v0.133
    • ADDED: profile image - has a delete button and file upload too now
    • FIX: index resources with more pages with sqlite engine
    • FIX: searchindex indexer could have false positives in extension detection
    • FIX: cli calls "-a reindex" or "-a index -d all" with sqlite engine locked the database for resources scan
    • UPDATE: cli - show hint if using "-d all"
    • UPDATE: searchindex indexer got a few more extensions
    • UPDATE: profile image uses jpeg insted of png (uses less space)
    • UPDATE: wording changed: blacklist into deny list

Requirements


  • any webserver with PHP 5.5+ up to PHP 7.4 (PHP 7 is recommended)
  • php-curl (included in php-common)
  • php-pdo and database extension (sqlite or mysql)
  • php-mbstring
  • php-xml

Screenshots


Just to get a few first impressions ... :-)

Backend - statistics of the entered search terms


ahcrawler :: backend

ahcrawler :: backend

Backend - anlysis of the website

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend

ahcrawler :: backend







Copyright © 2015-2020 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here