HTML::TagReader
---------------
Summary:
TagReader is a perl extension module which allows you to read html/xml
files by tag. That is: in a similar way as you can read textfiles by
line with "while(<>)" you use TagReader::getbytoken to read a file by tag.
You find a complete description of HTML::TagReader further down.
Installation:
To install run:
perl Makefile.PL
make
make test
make install
Note: the application tr_imgaddsize needs Image::Size but
you can install HTML::TagReader without having this
module if you don't need tr_imgaddsize.
You can test if Image::Size is installed on your computer by running
perl -e 'use Image::Size'
This command will print nothing if it is installed.
tr_httpcheck has a dependency on curl but this is also not enforced
during installation.
You can check what the version of a possibly already installed
HTML::TagReader is by running the command:
perl -e 'use HTML::TagReader;print $HTML::TagReader::VERSION."\n"'
------------------- ------------------- -------------------
Non-standard Installation:
If you want to install TagReader and the application programs
below /usr/local (/usr/local/bin , /usr/local/lib/5.6.0/i386-linux
/usr/local/lib/site_perl/5.6.0/i386-linux/auto/HTML/TagReader etc...)
then run
perl Makefile.PL PREFIX=/usr/local
make
make test
make install
To install only the perl module in a different location
(but e.g. the man pages in the standard location) use:
perl Makefile.PL LIB=/you/new/libpath
make test
make install
------------------- ------------------- -------------------
The subdirectory bin contains some applications of TagReader.
All of them start with the prefix "tr_"
tr_blck -- check for broken relative links in html pages
tr_llnk -- list links in html files
tr_xlnk -- expand links on directories
tr_mvlnk -- modify tags in html files with perl commands.
tr_staticssi -- expand SSI directives #include virtual and #exec cmd
tr_delfont -- delete font tags that hardcode the size or font face
tr_delfont -- delete font tags that hardcode the size or font face
tr_httpcheck -- check if a particular web-pages exists
httpcheck does not directyl use the TagReader module
but may be used as post processor for blck
If you are interessted in a link checker to check
links only via the web-server then this is not the
right program for you. Other programs like e.g
http://linkchecker.sourceforge.net/ or
http://www.linklint.org/ or
http://linkchecker.stacken.kth.se/ (webpage where
you can enter a url to check) or
http://htcheck.sourceforge.net/ or
http://www.jmarshall.com/tools/cl/
can be used if you want to check your web-pages only
remotely via a web server.
tr_tagcontentgrep -- grep for a tag e.g "img src"
tr_imgaddsize -- add width and height to
(needs Image::Size from
http://www.cpan.org/authors/id/R/RJ/RJRAY/ )
Note the primary desing goal of TagReader is to provide a fast
way of reading/processing html files.
These application programs will normally be installed to /usr/bin/
------------------- ------------------- -------------------
Author: guidosocher(at)fastmail.com
Homepage: http://linuxfocus.org/~guido/
Homepage on CPAN: http://cpan.org/authors/id/G/GU/GUS/
Copyright: This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
Installation requirements: Perl and a c-compiler (e.g gcc)
Needs Image::Size if you want to use
tr_imgaddsize
------------------- ------------------- -------------------
NAME
TagReader - Perl extension module for reading html/sgml/xml files by
tags.
SYNOPSIS
use HTML::TagReader;
# open then file and get an obj-ref:
my $p=new HTML::TagReader "filename";
# set to zero or undef to omit warnings about html error:
$showerr=1;
# get only the tags:
my $tag = $p->gettag($showerr);
# or
my ($tag,$linenumber,$column)=$p->gettag($showerr);
# get the entire file split into tags and text parts:
my $tagOrText = $p->getbytoken($showerr);
# or
my ($tagOrText,$tagtype,$linenumber,$column)=$p->getbytoken($showerr);
DESCRIPTION
The module implements a fast and small object oriented way of processing
any kind of html/sgml/xml files by tag.
The getbytoken(0) is similar to while(<>) but instead of reading lines
it reads tags or tags and text.
HTML::TagReader makes it easy to keep track of the line number in a file
even though you are not reading the file by line. This important if you
want to implement error messages about html error in your code.
Here is a program that list all href tags in a html file together with
line numbers and column:
use HTML::TagReader;
my $p=new HTML::TagReader "file.html";
my @tag;
while(@tag = $p->gettag(1)){
if ($tag[0]=~/ href ?=/i){
# remove optional space before the equal sign:
$tag[0]=~s/ ?= ?/=/g;
print "line: $tag[1]: col: $tag[2]: $tag[0]\n";
}
}
Here is a program that will read a html file tag wise:
use HTML::TagReader;
my $p=new HTML::TagReader "file.html";
my @tag;
while(@tag = $p->getbytoken(1)){
if ($tag[1] eq ""){
print "line: $tag[2]: not a tag (some text), \"$tag[0]\"\n\n";
}else{
print "line: $tag[2]: col: $tag[2]: is a tag, $tag[0]\n\n";
}
}
new HTML::TagReader $file;
Returns a reference to a TagReader object. This reference can be used
with gettag() or getbytoken() to read the next tag.
gettag($showerr);
Returns in an array context tag, line number and character in the line
(column). In a scalar context just the next tag is returned.
An empty string or and empty array is returned if the file contains no
further tags. html/xml comments and any tags inside the comments are
ignored.
The returned tag string has all white space (tab, newline...) reduced to
just a single space otherwise upper and lower case, quotes etc are as in
the original file. The line numbers are those where the tag starts.
You must provide 0 (or undef) or 1 as an argument to gettag. If 0 is
provided then gettag will not print any errors if it finds a syntax
error in the html/sgml/xml code.
Currently only the following warning messages are implemented to warn
about possible html syntax errors:
- A starting '<' was found but no closing '>' after 300 characters
- A single '<' was found which was not followed by [!/a-zA-Z]. Such a
'<' should be written as <
- A single '>' was found outside a tag.
getbytoken($showerr);
Returns in an array context tag, tagtype (a, br, img,...), line number
and the character position (column) in the line where the tag starts. In
a scalar context just the next tag is returned.
An empty string or and empty array is returned if the file contains no
further tags.
getbytoken() should be used to process a html file and possibly modify
tags. As opposed to gettag() the getbytoken() does not remove newline or
space from the data. getbytoken() gives you access to the entire file
and not only to the tags. That is: you can process the tags and the text
between the tags.
tagtype is always lower case. The tagtype is the string starting the tag
such as "a" in or "!--" in . tagtype is
empty if this is not a tag (normal text or newline).
You must provide 0 (or undef) or 1 as an argument to getbytoken. If 0 is
provided then gettag will not print any errors if it finds a syntax
error in the html/sgml/xml code.
Currently only the following warning messages are implemented to warn
about possible html syntax errors:
- A starting '<' was found but no closing '>' after 300 characters
- A single '<' was found which was not followed by [!/a-zA-Z]. Such a
'<' should be written as <
- A single '>' was found outside a tag.
Working without HTML::TagReader
In special cases it is possible to do processing of files by tag in an
efficient way without the HTML::TagReader package. This can be done by
setting the record separator variable in perl ($/). This causes however
problems with faulty html code where individual '<'-characters appear in
the middle of the text. An example of such a program written in plain
perl (without HTML::TagReader) is the tr_tagcontentgrep program which is
part of the HTML::TagReader distribution. Think first then write your
code! (HTML::TagReader is in most cases the best choice, not in all ;-)
Limitations
There are no limitation to the size of the file.
If you need a more sophisticated interface you might want to take a look
at HTML::Parser. HTML:TagReader is fast generic and straight forward to
use.
COPYRIGHT
Copyright (c) Guido Socher
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
SEE ALSO
homepage of this program: http://linuxfocus.org/~guido/ or
http://cpan.org/authors/id/G/GU/GUS/
perl(1) HTML::Parser(3)