Skip to content

Commit

Permalink
Introduce check plugins, use Python requests for http/s connections, …
Browse files Browse the repository at this point in the history
…and some code cleanups and improvements.
  • Loading branch information
wummel committed Feb 28, 2014
1 parent adc17fb commit 7b34be5
Show file tree
Hide file tree
Showing 194 changed files with 4,780 additions and 8,866 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ Changelog.linkchecker*
/todo
/alexa*.log
/testresults.txt
/linkchecker.prof
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ DEBORIGFILE:=$(DEBUILDDIR)/$(LAPPNAME)_$(VERSION).orig.tar.xz
DEBPACKAGEDIR:=$(DEBUILDDIR)/$(APPNAME)-$(VERSION)
FILESCHECK_URL:=http://localhost/~calvin/
SRCDIR:=${HOME}/src
PY_FILES_DIRS:=linkcheck tests *.py linkchecker linkchecker-nagios linkchecker-gui cgi-bin config doc
PY_FILES_DIRS:=linkcheck tests *.py linkchecker linkchecker-nagios linkchecker-gui cgi-bin config doc/examples
MYPY_FILES_DIRS:=linkcheck/HtmlParser linkcheck/checker \
linkcheck/cache linkcheck/configuration linkcheck/director \
linkcheck/htmlutil linkcheck/logger linkcheck/network \
linkcheck/bookmarks \
linkcheck/bookmarks linkcheck/plugins linkcheck/parser \
linkcheck/gui/__init__.py \
linkcheck/gui/checker.py \
linkcheck/gui/contextmenu.py \
Expand Down Expand Up @@ -192,7 +192,7 @@ filescheck: localbuild
done

update-copyright:
update-copyright --holder="Bastian Kleineidam"
update-copyright --holder="Bastian Kleineidam" $(PY_FILES_DIRS)

releasecheck: check update-certificates
@if egrep -i "xx\.|xxxx|\.xx" doc/changelog.txt > /dev/null; then \
Expand Down
2 changes: 1 addition & 1 deletion config/create.sql
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ create table linksdb (
name varchar(256),
checktime int,
dltime int,
dlsize int,
size int,
cached int,
level int not null,
modified varchar(256)
Expand Down
70 changes: 39 additions & 31 deletions config/linkcheckerrc
Original file line number Diff line number Diff line change
Expand Up @@ -131,32 +131,18 @@
#threads=100
# connection timeout in seconds
#timeout=60
# check anchors?
#anchors=0
# Time to wait for checks to finish after the user aborts the first time
# (with Ctrl-C or the abort button).
#aborttimeout=300
# The recursion level determines how many times links inside pages are followed.
#recursionlevel=1
# supply a regular expression for which warnings are printed if found
# in any HTML files.
#warningregex=(Oracle DB Error|Page Not Found|badsite\.example\.com)
# Basic NNTP server. Overrides NNTP_SERVER environment variable.
# warn if size info exceeds given maximum of bytes
#warnsizebytes=2000
#nntpserver=
# check HTML or CSS syntax with the W3C online validator
#checkhtml=1
#checkcss=1
# scan URL content for viruses with ClamAV
#scanvirus=1
# ClamAV config file
#clamavconf=/etc/clamav/clamd.conf
# Send and store cookies
#cookies=1
# parse a cookiefile for initial cookie data
#cookiefile=/path/to/cookies.txt
# User-Agent header string to send to HTTP web servers
# Note that robots.txt are always checked with the original User-Agent.
#useragent=Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
# Pause the given number of seconds between two subsequent connection
# requests to the same host.
#pause=0
# When checking finishes, write a memory dump to a temporary file.
# The memory dump is written both when checking finishes normally
# and when checking gets canceled.
Expand All @@ -175,22 +161,16 @@
# Check SSL certificates. Set to an absolute pathname for a custom
# CA cert bundle to use. Set to zero to disable SSL certificate verification.
#sslverify=1
# Check that SSL certificates are at least the given number of days valid.
# The number must not be negative.
# If the number of days is zero a warning is printed only for certificates
# that are already expired.
# The default number of days is 14.
#sslcertwarndays=14
# Stop checking new URLs after the given number of seconds. Same as if the
# user hits Ctrl-C after X seconds.
#maxrunseconds=600
# Maximum number of URLs to check. New URLs will not be queued after the
# given number of URLs is checked.
#maxnumurls=153
# Maximum number of connections to one single host for different connection types.
#maxconnectionshttp=10
#maxconnectionshttps=10
#maxconnectionsftp=2
# Maximum number of requests per second to one host.
#maxrequestspersecond=10
# Allowed URL schemes as a comma-separated list.
#allowedschemes=http,https

##################### filtering options ##########################
[filtering]
Expand All @@ -211,11 +191,12 @@
# recognized warnings). Add a comma-separated list of warnings here
# that prevent a valid URL from being logged. Note that the warning
# will be logged in invalid URLs.
#ignorewarnings=url-unicode-domain,anchor-not-found
#ignorewarnings=url-unicode-domain
# Regular expression to add more URLs recognized as internal links.
# Default is that URLs given on the command line are internal.

#internlinks=^http://www\.example\.net/
# Check external links
#checkextern=1


##################### password authentication ##########################
Expand Down Expand Up @@ -247,3 +228,30 @@
#loginextrafields=
# name1:value1
# name 2:value 2

############################ Plugins ###################################
#
# uncomment sections to enable plugins

# Check HTML anchors
#[AnchorCheck]

# Add country info to URLs
#[LocationInfo]

# Run W3C syntax checks
#[CssSyntaxCheck]
#[HtmlSyntaxCheck]

# Search for regular expression in page contents
#[RegexCheck]
#warningregex=Oracle Error

# Search for viruses in page contents
#[VirusCheck]
#clamavconf=/etc/clamav/clam.conf

# Check that SSL certificates are at least the given number of days valid.
#[SslCertificateCheck]
#sslcertwarndays=14

31 changes: 31 additions & 0 deletions doc/changelog.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,34 @@
8.7 "" (released xx.xx.2014)

Features:
- checking: Support connection and content check plugins.
- checking: Move lots of custom checks like Antivirus and syntax
checks into plugins (see upgrading.txt for more info).
- checking: Add options to limit the number of requests per second,
allowed URL schemes and maximum file or download size.

Changes:
- checking: Use the Python requests module for HTTP and HTTPS requests.
- logging: Removed download, domains and robots.txt statistics.
- logging: HTML output is now in HTML5.
- checking: Removed 301 warning since 301 redirects are used
a lot without updating the old URL links.
- checking: Disallowed access by robots.txt is an info now, not
a warning. Otherwise it produces a lot of warnings which
is counter-productive.
- checking: Do not check SMTP connections for mailto: URLs anymore.
It resulted in lots of false warnings since spam prevention
usually disallows direct SMTP connections from unrecognized
client IPs.
- checking: Only internal URLs are checked as default. To check
external urls use --check-extern.

Fixes:
- logging: Status was printed every second regardless of the
configured wait time.
- checking: Several speed and memory usage improvements.


8.6 "About Time" (released 8.1.2014)

Changes:
Expand Down
Loading

0 comments on commit 7b34be5

Please sign in to comment.