HTML::TagReader --------------- Summary: TagReader is a perl extension module which allows you to read html/xml files by tag. That is: in a similar way as you can read textfiles by line with "while(<>)" you use TagReader::getbytoken to read a file by tag. You find a complete description of HTML::TagReader further down. Installation: To install run: perl Makefile.PL make make test make install ------------------- ------------------- ------------------- Non-standard Installation: If you want to install TagReader and the application programs below /usr/local (/usr/local/bin , /usr/local/lib/5.6.0/i386-linux /usr/local/lib/site_perl/5.6.0/i386-linux/auto/HTML/TagReader etc...) then run perl Makefile.PL PREFIX=/usr/local make make test make install To install only the perl module in a different location (but e.g. the man pages in the standard location) use: perl Makefile.PL LIB=/you/new/libpath make test make install ------------------- ------------------- ------------------- The subdirectory bin contains 4 applications of TagReader. All of them start with the prefix "tr_" tr_blck -- check for broken relative links in html pages tr_llnk -- list links in html files tr_xlnk -- expand links on directories tr_mvlnk -- modify tags in html files with perl commands. tr_staticssi -- expand SSI directives #include virtual and #exec cmd tr_httpcheck -- check if a particular web-pages exists httpcheck does not directyl use the TagReader module but may be used as post processor for blck If you are interessted in a link checker to check links only via the web-server then this is not the right program for you. Other programs like e.g http://linkchecker.sourceforge.net/ or http://www.linklint.org/ or http://linkchecker.stacken.kth.se/ (webpage where you can enter a url to check) or http://htcheck.sourceforge.net/ or http://www.jmarshall.com/tools/cl/ can be used if you want to check your web-pages only remotely via a web server. Note the primary desing goal of TagReader is to provide a fast way of reading/processing html files. These application programs will normally be installed to /usr/bin/ ------------------- ------------------- ------------------- Author: guido(at)linuxfocus.org Homepage: http://linuxfocus.org/~guido/ Homepage on CPAN: http://cpan.org/authors/id/G/GU/GUS/ Copyright: This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. Installation requirements: Perl and a c-compiler (e.g gcc) ------------------- ------------------- ------------------- NAME TagReader - Perl extension module for processing html/sgml/xml files by tag. SYNOPSIS use HTML::TagReader; # open then file and get an obj-ref: my $p=new HTML::TagReader "filename"; # set to zero or undef to omit warnings about html error: $showerrors=1; # get only the tags: my $tag = $p->gettag($showerrors); # or my ($tag,$linenumber) = $p->gettag($showerrors); # get the entire file split into tags and text parts: my $tag = $p->getbytoken($showerrors); # or my ($tag,$tagtype,$linenumber) = $p->getbytoken($showerrors); DESCRIPTION The module implements a fast and small object oriented way of processing any kind of html/sgml/xml files by tag. The getbytoken(0) is similar to while(<>) but instead of reading lines it reads tags or tags and text. Here is a program that list all href tags in a html file together with it line numbers: use TagReader; my $p=new TagReader "file.html"; my @tag; while(@tag = $p->gettag(1)){ if ($tag[0]=~/ href ?=/i){ # remove optional space before the equal sign: $tag[0]=~s/ ?= ?/=/g; print "line: $tag[1]: $tag[0]\n"; } } Here is a program that will read a html file tag wise: use TagReader; my $p=new TagReader "file.html"; my @tag; while(@tag = $p->getbytoken(1)){ if ($tag[1] eq ""){ print "line: $tag[2]: not a tag (some text), \"$tag[0]\"\n\n"; }else{ print "line: $tag[2]: is a tag, $tag[0]\n\n"; } } new HTML::TagReader $file; Returns a reference to a TagReader object. This reference can be used with gettag() or getbytoken() to read the next tag. gettag($showerrors); Returns in an array context tag and line number. In a scalar context just the next tag. An empty string or and empty array is returned if the file contains no further tags. html/xml comments and any tags inside the comments are ignored. The returned tag string has all white space (tab, newline...) reduced to just a single space otherwise upper and lower case, quotes etc are as in the original file. The line numbers are those where the tag starts. You must provide 0 (or undef) or 1 as an argument to gettag. If 0 is provided then gettag will not print any errors if it finds a syntax error in the html/sgml/xml code. Currently only the following warning messages are implemented to warn about possible html syntax errors: - A starting '<' was found but no closing '>' after 300 characters - A single '<' was found which was not followed by [!/a-zA-Z]. Such a '<' should be written as < - A single '>' was found outside a tag. getbytoken($showerrors); Returns in an array context tag, tagtype (a, br, img,...) and line number. In a scalar context just the next tag. An empty string or and empty array is returned if the file contains no further tags. getbytoken() should be used to process a html file and possibly modify tags. As opposed to gettag() the getbytoken() does not remove newline or space from the data. tagtype is always lower case. The tagtype is the string starting the tag such as "a" in or "!--" in . tagtype is empty if this is not a tag (normal text or newline). You must provide 0 (or undef) or 1 as an argument to getbytoken. If 0 is provided then gettag will not print any errors if it finds a syntax error in the html/sgml/xml code. Currently only the following warning messages are implemented to warn about possible html syntax errors: - A starting '<' was found but no closing '>' after 300 characters - A single '<' was found which was not followed by [!/a-zA-Z]. Such a '<' should be written as < - A single '>' was found outside a tag. Limitations There are no limitation to the size of the file. If you need a more sophisticated interface you might want to take a look at HTML::Parser. HTML:TagReader is fast generic and straight forward to use. COPYRIGHT Copyright (c) Guido Socher [guido(at)linuxfocus.org] This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. SEE ALSO homepage of this program: http://linuxfocus.org/~guido/ or http://cpan.org/authors/id/G/GU/GUS/ perl(1) HTML::Parser(3)