Extensible Link Checker Library
Version 1.0
(C) 2002 Thomas Weinbrenner
API documentation
This software is hosted on sourceforge.
HTML Link Checking
This is a open source checker für HTML links written completely in Java.
Functionality
This link checker works like this:
- It scans the file system for all files. Of cause, you must specify which
directories to look for.
- You provide for one or more URLs the link checker starts with.
- The link checker maintains a list of "still to visit URLs" and starts checking
with those in this list.
- The checking process consists of making a HTTP-Request on the approprite
URL to retrieve the HTML code, parsing (or rather "lexing") that code.
During the parsing process, links are reported to a report interface;
valid links to files are recognized and added to the list of "still to visit
URLs".
- After there are no more URLs in the list of "still to visit
URLs", it reports all files which have not been visited yet.
- All the files which have not been visited yet are parsed as well.
What's the benefit of using a library?
According to my humble experience, sometimes you must have a closer look at
the HTML-tags to make complete link-checking working. I used it in an
intranet which used frame technology, the left frame was the navigation,
the right frame the content. Some links had to change both, the navigation and
the content. We used a Javascript-function for that like in
<a href="javascript:changeNavigationAndContent ('thisIsAUrl.html')">Link</a>
I used a web-site specific
class which parsed the url out of the javascript code and added it to the
link-checkers list of "still to visit URLs". Additionally, I calculated the
URL of the navigation from that URL and added it also.
What you must do to get it working
Java Development Environment
You need a Java Development Environment. This is not a complete program,
it's a library.
Web Server
You need a Web-Server to get the link checker running, because the HTML
is retrieved by using the HTTP-protocol. The advantage is that you can even
link-check server pages like JSP or ASP files without using a JSP or ASP-code parser.
If you don't have a HTTP-Server, use for example Tomcat or Apache from the
Apache Aoftware Foundation.
Demo application
It's best to start with the provided
Demo application.
This application checks the JDK documentation which a Java Developer usually has.
Class path
Include the libraries
linkchecker.jar
antlr.jar
in your classpath.
How to configure
Web-Site specific informaton
Every Web-Site has specific configurations. These include the start URL(s),
and the location of the files in the file system. The link checker also needs
the mapping between a file name and a URL.
All those information are capsuled in interface
WebSiteInfo. There
is a default implementation in class
DefaultWebSiteInfo
which implements a simple one to one mapping between URLs and file names.
Reporting
All reporting about links is done by using interface
Report. The class
HtmlReport implements
this interface, saves all information in memory, and after the link checking
process is finished, prepares a report using HTML format.
Acknowlegements
This link checker would not have been written if I were not curious about
that neat Lexer/Parser tool called ANTLR. So I wrote a HTML Lexer, and the
first usage for that was a link checker which I indeed used in a production
intranet environment to check the links.
See ANTLR by Terence Parr. It's really great.
Licence
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
See the GNU Public Licence.