Introduction

This page shows the interactions and views in the backend.
You reach it by adding the path /backend/ in the url, i.e. https://www.example.com/ahcrawler/backend/


Login

If you have enabled to login with a single user then the request to any page is protected by the application. You will be prompted to login.

ahcrawler :: login
Enter user and password.
Remarks:

  • You also can disable the user and password in the settings - then no login will appear.
  • To give access with several users and its own passwords you need to protected on webserver level with basic authentication versus .htuser file or database

Page elements

Page

ahcrawler :: page elements
Elements of the backend page.
If you follow the numbers:

  1. Logo
    The logo on top left shows the version too. A click on it always brings you back to the starting page. If an update is available it will be shown here.

  2. Navigation
    On the top left is the navigation menu. The current page or subpage will be highlighted for your better orientation.

  3. Page title
    On top is the name of the current page.
    The box below is a hint text for the current page

  4. Logoff
    On top right is a button logoff to leave the application.

  5. Page content

Tabs

On all project related output you will see an additional project selection bar.

ahcrawler :: page elements
Project selection with horizontal tabs.
If you follow the numbers:

  1. Up icon
    Go one level up.

  2. One project per tab
    The information on page is related to the selected project. By clicking on another tab the used project will be switched. The selected project will be stored in a cookie. If you switch to another page (i.e. below analysis) the last selected project keeps sticky.

  3. Plus icon
    In the project settings you have a plus icon to add a new project.

Tiles

Tiles are shown on top of page or on a beginning of a section to show a short conclusion of important values.

Tile can be clickable. Then you jump to get further details.
Their background follows a color code.

ahcrawler :: page elements
clickable and non clickable Tiles and color codes.

Pie charts

Pie charts are used often. They visualize the relation of counts.
You can click on a color code in the legend to show/ hide this value in the chart.

ahcrawler :: page elements
Clickable and non clickable tiles and their color codes.

Tables

The most tables can be sorted and filtered.

ahcrawler :: page elements
Tables are sortible (even with multiple sort) and can be filtered.
(1)
By default a page has a default sort order. The column is highlighted.
You can sort the table by any coloumn by clicking the name in the table head with the left mousebutton. Reverse order by clicking again.
Multi-coloumn sorting is available too: hold the SHIFT key while clicking in the table head.

(2)
Set a count of visible items per table.

(3)
By typing something into the text filter the table will be reduced.

(4)
Status bar of visible items on page. It shows how many rows were filtered by a search text.

(5)
Pagination.

Category: Start

The tiles on top are the total of the counters of all projects.

The table contains the project specific counters. See the legend below the image.

ahcrawler :: homepage
The start page shows an overview of all your projects and counters for currently aggregated items.

  • Profiles
    Select the profile to switch to.

  • Overview
    The section shows a short overview of the selected profile.
    • Name and description
    • collected items in the database

      • Searchindex
        It contains the number of single (html) pages that were spidered as full text.
        The view button brings you to Searchindex -> Status to see / search / analyze the currently spidered content.

      • Ressources
        It contains the number of all pages, linked media, scripts, css and external links.
        The view button brings you to the Analysis -> Ressources to analyze the items of the selected project. You can filter by http statuscode / MIME type and other criteria.

      • Search terms
        It contains the number of user searches on your website.
        The view button brings you to Searchindex -> Search terms to view the search statistics of the selected project. You need to implement the website search into your website to get a counter here.
    • starting url(s)
    • The edit button switches to profile to edit its parameters

  • Hints for improvement
    Here are found errors and warnings are shown. Its details yo can see by clicking the button on the left.

Page: Status

Status overview

The tiles on top give a few information.

indexed
The count of indexed web pages.

in the last 24 hours
The count of indexed web pages that were updated in the last 24 h.

Last update
It shows the date and time when the last page was indexed.

oldest data in index
It shows the date and time of oldest pages. This is just a small check. If you update the complete search index once per week then the oldest element never should be older 7 days.

Newest urls in the index

The table shows the 5 newest pages in the search index.
This is a control for you if the indexer is running.

Oldest urls in the index

The table shows the 5 oldest pages in the search index.
This is a control of the indexer for you. If you let update your index once per week then if your data older 7days i is a sign that the indexer did not run.

Urls in the search index ([N])

The table shows all pages in the search index.

Search test

Here you can search in the created search index to find out the behaviour of a search placed on your website. The search terms entered here in the backend won't be stored (tracked) in the database.

The search form in the backend automatically adds the search for subfolders and languages from the profile settings of the selected project.

With this form you can simulate a search of your frontend. You get a result page with some raw data from the database and the details of calculation of the ranking mechanism that defines the position in the search results.

Page: Search terms

You get a statistics of the entered search terms of the visitors on your website. Therefore you need to add a search form on your website. Then minimum one search request must be done on the website.

Last [N] search requests

A table shows you the last N entered search terms of the visitors of your website.
With the drop down you can change the cound of visible entries.

For each of the last search terms you see different metadata

  • Time: the timestamp of the search
  • search term: what did the visistor search for
  • search set: was it a limited search within a subdirectory?
  • Results: the number of results that were offered
  • IP: the ip address of the visitor
  • Browser: the user agent of the visitors webrowser
  • Referrer: The website where the search term was enterted
  • [Search] button: repeat the same search of your visitor to his/ her results

Top [N] search terms in a time range

Here you get a list of the top N search terms of a selected time range.
In dropdowns you can change the count of visible results and the time range. The items in the time range will expand during the time.
Based on your selected eange you get a from and to of the first and last search in that range.

You get a table of the top search terms, their count and number of results.
A graph shows the quantity to each other.

Page: Profiles

General data

Short title
A short name for the crawled website. It is used i.e. as title text in the tabs on top of the page.

Description of the project
Set a description of the project. It is visible on the starting page.

Screenshot or profile image
You can add a single image for the profile.

There are 2 possibilities to add an image:

(1) paste
You can copy a new image / screenshot into your clipboard, click into the dashed section and paste it. After pasting an image you will see a smaller preview and the image data size.
This method is useful if you have a screenshot tool to create fast a cropped image without saving it as a file first.

(2) upload
Select an image file (Jpeg or Png) from your local system. The classic way.

If you save the profile the image will be down scaled to max. 600 px height and width; smaller images keep their size.

Search index

The spider already respects the rules for spiders, like robots.txt, x-robots tag, robots command in html header, rel attributes in links.


List of start urls for the crawler
Enter one or more urls as a list of urls (one per line) here. These will be the starting points to scan the website to generate the searchindex.
The spider will stay on the given Hostname(s) and mark them as "internal". All elements like css or javascript files or links that do not match any hostname of the starting urls are "external" ressources.
One url in most cases is enough because the spider follows all allowed links. You can add more urls if you have several tools on your domain that are not linked or you want to merge several domains into one search index. As an example if you have a website www.example.com and a blog in blog.example.com.

List of regex that must match the full url
You can place additional rules to describe what must match to be part in the search index.
In most cases just leave it empty = all urls are accepted.
If you have just one domain in the list of starting urls you can use next option that is applied on the path.

List of regex that need to match to the path of a url
This option is similiar the last option, but the list of regex rules must match on the path.
In most cases just leave it empty = all urls are accepted.
Example:
You have several subdirs in the webroot but want to have just a few of them in the index

^/blog/.*
^/docs/.*
^/pages/.*

List of exclude regex for the path applied after the include regex above
Here you can add more regex lines for finetuning. This is a list of elements to exclude.
Remark:
The spider cannot detect inifinite loops, like calendars with browsing buttons. Add its path here.

max. depth of directories for the scan
You can limit the depth of path. Value is an integer account.

max. count of urls to scan
Before scanning a larger website you can make test with a limited number of pages by entering a low number here, i.e. "3".
Enter "0" (zero) for no limit.

optional: user and password
The use case is the scan for public websites. Additionally the basic authentication is supported (as only authentication method). To use a user and password divide with ":"
myuser:secretpassword


The next variables override the defaults for all profiles (see program settings).


search index scan - simultanous Requests (overrides the default of [N])
Here you can override the number of allowed parrallel GET requests while spidering.
The current global value is given in the label text and in the placeholder text.
Leave the field empty if you don't want to override the default.

Content to remove from searchindex (1 line per regex; not casesensitive)
Here you can override the default rules "Content to remove from searchindex".
Leave the fied empty if you don't want to override the default.

Search frontend

These options are only required if you want to add a search form with a website search into your website.

ahcrawler :: search form
example search form
Search areas of the website (in JSON syntax)
The JSON describes in keys the visible text; values are names of subdirs
You can define the options for the areas to search in.
{
  "everywhere": "%",
  "... Blog": "\/blog\/",
  "... Photos": "\/photos\/",
  "... Docs": "\/docs\/"
}

Items for language filter (one per line)
You can add an option field to search in documents in a given language. The language of a document is detected in the lang attribute in the html tag. Language specific parts inside the page are not detected.
Line by line you can add values of ISO 639 / IANA registry (mostly 2 letter lowercase).
The option to search in all languages will be added automatically.

Ressources scan and analysis

Ressources scan - simultanous Requests (overrides the default of [N])
Here you can override the number of allowed parrallel HEAD requests while spidering.
The current global value is given in the label text and in the placeholder text.
Leave the field empty if you don't want to override the default.

Deny list
All urls in links images, ... that contain one of the the texts will be excluded. There is 1 searchtext as regex per line.
Hint:
The most easy way to add a new - and useful - entry is by button of in the report view.
The list here in the profile settings page is good for deleting entries or reorder them.

Category: Analysis

ahcrawler :: analysis
Overview over analysis topics

Page: SSL Check

SSL-Certificate

When opening this page the check to the to the certificate will be made (live check).
If your page uses plain http you get a red box that no encryption is in use.

For SSL encrypted websites there is a check for its certificate. It is green if your certificate is fine.
It shows warnings if the certificate is valid 30 days or less.
It shows errors if the certificate is out of date or the domain name is not incuded in the DNS names.

ahcrawler :: SSL check - certificate info
SSL check - certificate infos.

In the table below follow basic information to the certificate.

  • Common Name
    Default domain for the certificate.
    Remark: A certificate can contain other valid domain names. See DNS names below

  • Type of certificate
    It can have the value "Business SSL" or "Extended Validation".
    "Extended Validation" should be used for websites with financial transactions (like shops, banks). With this type of certificate the owner of a domain is part of the certificate meta data and can be verified.

  • Issuer
    Company that issued the certificate.

  • CA
    Name of the root certificate (of issuer) that you must trust to trust the domain certificate too.

  • DNS names
    List of DNS names for which (sub-) domains the certificate is valid too.
    This is optional. A certificate can be valid for a single domain ... or many ... or all subdomains (wildcard certificate).

  • valid from / to
    Time range when the certificate is valid.

  • still valid (in days)
    The time left up to the end of lifetime of the certificate.
    If it is less than 30 days the status changes to warning.

Raw data

Click the button to show/ hide readable details in JSON syntax. It's quite plain.

Check non https ressources

You get an overview about used all non SSL encrypted elements if your website uses SSL. If you have a website running with https then browsers can show a warning if you embed unencrypted ressources.

By default you get a list of embedded non SSL elements in your website that could be the reason that a browser hides them because of usage of mixed content.

Do you ever try to navigate through all your webpages to find where a browser warns for mixed content? Here you could get all http only ressources. If there are some then click one to see where it is used.
With a click on the other tile you get all links that still use http (and no https).

ahcrawler :: SSL check - unencrypted ressources
SSL check - unencrypted ressources. The right green tile shows: no unencrypted ressource was found. But here are 27 links with http.

Page: Http Header check

If an http(s) request is sent to a server then it responds with header information (http response header) and the data of a website or file.
This page gives you an analysis of the header data that consists of a list of variables and values ... line by line. The http protocol has a set defined (known) variables. In general it is not easy to understand those or to see what information is missed.

ahcrawler :: Http response header
Http response header: tiles for a quick status overview. In a table you see colored values.

Http response header

On top of the page is a bar with tiles give a first overview.

  • total - the count of header information
  • valid variables - the count of header data that match defined keys
  • security headers - the count of security headers. If there is none the tile changes to a warning color.
  • caching information - the count of caching information (it includes settings for no caching)
  • compression - shows if a compression was set. If there is none the tile changes to a warning color.
  • unknown varianbles - count of variables in the response that do not maht the http standard.
  • unwanted data - count of header data that present more internal information than needed.

Below the tiles there is a link Http header (plain) that offers the http response header as plain text. Click it to open or close it.

Then follows a table with all http response data. For a visual help the header items are colored and use icons. You can see here if valid or invalid variables.
Additionally here is a check for availability of

  • compression information - violet
  • caching information - blue
  • unknown/ unwanted information - yellow or orange
  • security headers - green

Warnings for Http headers

In this section you get details about header variables you should verify/ update. You get a tile in a warning color with the found variable and its value followed by a short description text and the source line in the http response header including a line number in brackets.

  • unknown variables
    Check the variables that are unknown in Http standard. In most cases these are debugging information or invalid header data.

  • unwanted data
    You should not show too many details of your system in the http header. Even if sniffer tools can analyze several details. You should remove these variables or remove the version details in their values.

  • non standard headers
    Header entries will be detected, that are quite common - but non http standard.

Check of existing security headers

Security headers are handled by modern browsers and increase the security of your application that runs in the browser. It is strongly recommended to use them to minify XSS or script injection.
In this section all security headers are shown. The tiles are green if one was found including their value ... or in a warning color if not.

Links:

Page: Cookies

Saved Cookies

While crawling your website and following its links cookies can be saved in the browser of your visitors.
Remark: Cookies that are set by javascript are not in this list.
The tile shows the total count of all cookies shown in the table below. The table is read only and for information. Yo see all domains that set a cookie with its values.

Delete all cookies

You can delete all cookies. After deletion you can have look here after the next crawling process.

Page: HTML checks

Overview

The html checks detect several behaviours in the html content of all your websites.

Various values are relevant for SEO optimization. These information are based on simple counts and not its content quality but give a first orientation.
The limit values can be refined in the program settings.

The tiles show ...

  • indexed
    Total count of fully indexed html pages. It is the content in the search index.

  • crawl errors
    Count of crawler errors. If it is <> zero then you know that the search index is not complete.

  • too short titles
    Count of too short title metatag values based on the given count of chars.

  • too short descriptions
    Count of too short description metatag values based on the given count of chars.

  • too short keywords
    Count of too short keyword metatag values based on the given count of chars.

  • too long loading pages
    Count of pages that load longer than a given limit.
    Remark: Slow pages are ratet worse by search engines.

  • large pages
    Count of pages that are larger than a given limit.
    Remark: this is just information and has no direct relevance for SEO.

All clickable tiles have a different look. A click on one jumps to a page section with more details.

Metadata

There are 3 sections for checks of html metadata

  • Too short titles in meta tags
  • Too short description in meta tags
  • Not enough keywords in the meta tags

(1)
These consist of a warning counter again. If there are empty data (next to the items smaller than the given limit) the count of empty values will be visible as error.

(2)
A pie chart visualizes the relevance of the warning counts in comparison to the count of all pages.

(3)
A table lists all urls with the troublemakers.

Non Metadata

Then there are 2 more sections checking that are not based on any metadata or content.

  • Long loading pages
  • Large pages

You have a counter and a pie chart again. Additionally here is an additional chart where you see the size (or loading time) of all your pages of a website.
The graph visualizes the given maximum in comparison with existing values of your pages. An additional line shows the average of all pages value.
This is a tool to estimate if there is a general problem or are there only a few troublemakers (or is everything fine).

ahcrawler :: Example page bars
Example page bars I: page size.
This is nearly fine. There are only a few percent of troublmakers that should be optimized. The most pages are below the limit.
Have a detailed look to the slow/ big pages.
ahcrawler :: Example page bars
Example page bars II: loading time.
This seems to be general problem of the whole site. Here are a few top troublemakers but still a lot of pages much higher than the limit.
You need to find general bottlenecks or enable caching or need more server capacity.

Page: Link checker

Overview

The link checker shows errors and warning of all your set internal + external links, linked media, documents, javascripts, css.
A low count of invalid links is relevant for SEO optimization.

The tiles show ...

  • Age of last scan
    ... to know how old the given information is.

  • Count of ressources
    Total count of analyzed ressources: html pages including all media, documents, javascripts, css, linked external pages.

  • Redirects only
    Ressources (mostly links) that point to another location.

  • todo
    Count of ressources that aren't analyzed yet. It is zero if the crawler is finished. If it is non zero the the crawler is still in progress or stopped before it finished.

  • Http errors
    Found errors in links and ressources. This group contains unreachable ressources with invalid domain or certificate and links that returned an http error code or invalid http statuscode.
    You should check / fix all of them.

  • Http warnings
    Found warnings. This group contains ressources that returned an http warning code.
    You should check them after fixing the http errors (see last tile).

  • Http ok
    Just for comparison: count of ressources that are OK.

All clickable tiles have a different look. A click on one jumps to a page section with more details.

Sections by http status

ahcrawler :: Linkchecker
Linkcheck section for http warnings.
Each seaction page has this scructure:

(1)
A headline with the total count of found items in the brackets.
Remark: This headline is in the navigation on the left too. So the navigation gives you a short status about errors, warnings and good items too.

(2)
Introduction text.

(3)
A list of tiles that show the count grouped by the http status code.
The tiles are clickable. With it you can jump to a list of ressources that match this statuscode including the analysis where they are refererenced that you could fix something.

(4)
The bar chart shows the distribution of status codes and their values.

(5)
The legend shows a description text for each found http status code: what does it mean and how to handle it as a webmaster.

Matching ressources

If you click on a tile with an http status code you switch to a list of all ressources that match the selected status code. Each ressource is spearated with a box.
Each ressource is displayed in the way: a colored status code + location (internal|external) + type + url

ahcrawler :: a single item in the ressource list
A single item in the ressource list. Each box per ressource shows where does a redirect jumps and where it is referenced.
(1)
The ressource with a short description: status code + internal|exterenal + type + url

(2) For redirects only: where does it redirect to?
If the next hop is a redirect again, then all the hops will be shown as intended item. There is no limit of a maximum count of redirects.
If one redirects points back to one of the existing hops then this loop will be detected.
Another hint is shown if the same url switches the protocol from http to https.

(3) referenced in ... with the count of references in brackets.
Here is a list of all ressources that point to the ressource (1).

(4)
Text links with the urls point to a detail page of a ressource that show more metadata.

(5)
Add the selected url in the deny list. You get a dialog to modify the new deny entry.
Adding an entry to the deny list has an impact for the next scan(s) - not for the current session. All urls (of links, images, ...) matching one of the text lines in the deny list (those are regex) will be ignored.
The beginning ^ marks the search from the beginning of the url.
To allow just the given protocol leave ^http:// or ^https://. To allow both of them use ^http[s]*:// %s.
The $ at the end means that a url must end there. You can remove it to include urls with additional characters too.
The deny list is specific for each website and can be changed in their profile settings.

(6)
The buttons open the url in a new tab.

Page: Ressources

ahcrawler :: Ressources
Ressources start page with filter.
ahcrawler :: Ressources
Filtered ressources. Enabled filters you see in the toolbar. The filtertable is reduced to the matching items.

Category: Tools and information

TODO.

Category: Settings

ahcrawler :: settings
The settings consist of 2 sub pages.

Page: Setup

Here are settings for the program.

Backend

Language in the backend
Switch the language by changing it in the dropdown.

Visibility of menu items
In the textarea is a JSON structure with all pages.
The keys are the page names (if you have a look to the urls: they end like /backend/?page=setup). You can disable a page name to hide it in the navigation on the left. You can simplify the interface if you don't need a function or wanna simplify it. But this is just a visual function - a page is still available.

Show debugging information
This option is for development only. It enables the measurement of internal modules.

Authentication
You can protect the access to the backend interface with a single user. It is not a must. you can protect the access by webserver configuration (Basic authentication, IP restriction).
To change the username you need to enter the current password.
To set a new user or password you need to enter the current password. Enter the new one twice.

Crawler defaults

search index - simultanous requests Http GET
The search index fetches the html content of a website.
Set a value how many parallel requests you want to allow.
The minimum number is 2.
The default is 2.
A higher value gives you more speed to finish a website scan. On the other hand it generates more traffic on your webserver. Do not overload your own servers. Speed is really not important for a cronjob.
This is the default setting for all projects. You also can override this value in each project.

ressources scan - simultanous requests Http HEAD
The ressources scan makes just http HEAD requests which is lighter, faster because it transfers header information only without content.
Set a value how many parallel requests you want to allow.
The minimum number is 2.
The default is 3.
Do not overload your own servers. Speed is really not important for a cronjob.
This is the default setting for all projects. You also can override this value in each project.

Timeout in [s]
Defines the timout of each http request. This is a global setting for all profiles.

Memory (memory_limit) for CLI
Starting the spider on command line keeps a lit of all crawled urls in memory. If you have a large website or many links you could get an out of memory error. Here you can set a higher value for the spider if needed.

User Agent of the crawler
Some websites block spiders and crawlers. To improve the scan result of external links you can set a user agent that the spider will use while crawling.
As a helper there is a button to place the user agent of your currently used browser there.

Content to remove from searchindex
The search index process extracts text from the content of each html page.
You can remove parts of each page by defining a set of regex. So you can remove words of the navigation items or other unwanted parts.
Write 1 line per regex. These regex are not casesensitive.

<footer[^>]*>.*?</footer>
<header[^>]*>.*?</header>
<nav[^>]*>.*?</nav>
<script[^>]*>.*?</script>
<style[^>]*>.*?</style>

Search Frontend

Settings for finetuning of search result ranking. They will be used if you build in the search frontend in your website. Depending on place and kind of match exist different multipliers.

Constants for the analysis

You can set default parameters for the website Analysis -> Html checks. With these values you can set the sensitivity of the checks there.

Minimum count of chars in the document title
Default: 20

Minimum count of chars in the description
Default: 40

Minimum count of chars in the keywords
Default: 10

Limit to show as large page (byte)
Default: 150000

Limit to show as long loading page (ms)
Default: 500

Public services

Enabled services will be visisble for a public, anonymous visitor without logging in into the backend.
By default all public pages are diabled. It results in a 403 error if the starting page (one level above the backend) will be requested.
If you wish to offer the service for a few web tools then set them to true. Remark: You should enable home too to offer a useful starting page.

Database

You can override the current database settings. The new data will be verified. If they are wrong then the new settings won't be saved.

Page: Vendor libs

To save space the application does not contain vendor libs for the frontend. They will be included via CDN. But if you want you can download one or all.

In the table you see used vendor libraries and the place where they are loaded from. Using remote or local libs has no functional impact.
Download the libaries to increase the speed of this web app and/ or to run it without additional internet access.
Libaries marked with "(not used anymore)" were replaced and can be deleted locally.

ahcrawler :: vendor libs
Download or delete vendor libs

Category: About

TODO.







Copyright © 2015-2020 Axel Hahn
project page: GitHub (en)
Axels Webseite (de)
results will be here