...

=================================
  Script "gen_tree" Version 2.2
=================================

-----------------------------
    Script for Perl 5.002
-----------------------------

(should work with later versions of Perl as well)


Legal stuff:
------------

Copyright (c) 1996, 1997, 1998, 1999 by Steffen Beyer.
All rights reserved.

This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.


Requirements:
-------------

Perl version 5.002 or higher. Compatibility of your web pages with
the Apache HTTP server (this only concerns the syntax of server side
includes and server side image maps - see also the FAQ further below!).
Should work with other web servers as well.


What does it do:
----------------

This script scans the tree (better: the directed graph) of HTML pages
of a web site. (It's not always a tree because circles and loops are
possible!)

It starts at the home page of that site (called the "root page" here)
and follows all hyperlinks in a recursive descent (width first, in
order to produce a representation in the expected way).

(You can also scan just a subtree of your web site if you want)

Since it scans files in the file system of the host bearing the web
site, it is confined to pages lying physically on one host (!).

The web server (HTTP daemon) of the web site is NOT used at all (!).

Circles and loops are recognized through unique identification of each
page by the device and inode numbers of its corresponding file.

(This was the main reason for not using the "libwww" (LWP) module.
Another reason was the wish to be able to use this script wether or
not the HTTP daemon is running. Speed considerations (communication
with a HTTP server is slow compared to the direct access to the file
system) were also important.)

Therefore, this script is confined to UNIX hosts or hosts where the
device and inode numbers returned by "stat" serve the same purpose
as with UNIX.

One could abandon this latter restriction if one used checksums (for
instance by using the MD5 module) for identification instead. This is
not 100% reliable, however (different files could have the same check-
sum), and would require additional checking.

When scanning of the web site is complete, an HTML page is generated
which contains all the pages found in form of one hyperlink to each
of them.

(The parse tree that is built in memory during the scanning phase is
traversed in a recursive descent, this time depth first, to yield a
tree that looks the expected way.)

The tree structure of the web site is reflected in this page by the
indentation of these hyperlinks.

The text which is displayed in these hyperlinks is extracted from the
<TITLE> ... </TITLE> tags inside the corresponding page.


Supported features:
-------------------

This script is capable of executing server side includes and of analyzing
server side image maps (client side image maps wouldn't be very hard to
add). Their syntax must be compatible (!) with the Apache HTTP server's.

(This means that the use of the Apache server is NOT necessarily required!)

This way, no important hyperlinks are missed. (Many home pages consist of
an image map and nothing else!)

It is also able to analyze CGI scripts simply by calling them and analyzing
their output. (Therefore, no HTTP server is needed!)

Passing of variable parameters to CGI scripts is not supported, however,
whereas passing of constants (the same for all CGI scripts) via environment
variables is possible.

(Passing of variable parameters (like query strings) is problematic con-
ceptually: Imagine you get back a list (a possibly quite individual list
at that) of hyperlinks from a full text search CGI script on your web site!)

While the web site is being scanned, a detailed log file is written. Most
of the time, it's a good idea to read it because it lets you discover
flaws in your web site that often go unnoticed otherwise!

The files generated by this script (log file and output file) are never
overwritten: instead, older versions are archived by appending an ever
increasing number to their file names.

This way, you can always go back to a previous state if anything goes
wrong.

Note that the use of the <BASE> tag to define the base for relative URLs
in an HTML document is not supported. (Again, it shouldn't be much of a
problem to add)


How to use (and where to get) it:
---------------------------------

Simply install this script wherever you like.

Although the script is quite fast (about 7 seconds on a web site with about
70 pages on a 486 66 MHz PC with FreeBSD), it's probably best to run this
script once a night (as a "cron" job) or manually whenever you add or remove
pages (or change their <TITLE>) on your web site.

Why make the visitors of your web site wait by using this script as a CGI
script when they are in need of quick help and orientation?!

This is also the reason why the page which is generated by this script
doesn't use any graphics - it's intended to give your visitors assistance
when they need it, in the fastest possible way!

The configuration of the script is quite simple, just follow the directives
in the script itself!

You'll probably need to change the two subroutines "url_to_file" and
"file_to_url" to reflect the file path conventions at your web site.

At our site, HTML pages do not lie directly in the user home directories,
but in a special, hidden subdirectory named ".www".

Also, all user home directories have the form /u/<login>. The same is true
for group home directories (/g/<login>) and other entity home directories
(/e/<login>).

(At our site, there are no other valid URLs than the ones mentioned above!)

Therefore, "url_to_file" inserts ".www" into URLs and "file_to_url" removes
".www" from file paths.

Delete the corresponding lines (under the header "transformation for hidden
HTML subdirectories in user home directories") if they don't apply to your
site, or modify them according to your file path conventions.

You'll probably also want a different layout of the final page. Change the
two subroutines "html_header" and "html_footer" accordingly!

If your CGI scripts need more environment variables, add them in the sub-
routine "setup_for_cgi"!

If you want to see a working example, direct your web browser to the
following site:

http://sb.fluomedia.org/sitemap/

You can also download this script from that site:

http://sb.fluomedia.org/download/pkg/gen_tree-2.2.tar.gz

Or download it from any CPAN (= "Comprehensive Perl Archive Network")
mirror server near you:

http://www.perl.com/CPAN/authors/id/STBEY/gen_tree-2.2.tar.gz


Frequently Asked Questions (FAQ):
---------------------------------

Q: Is it difficult to adapt this script for other HTTP servers?

A: Not really. You just need to change the two regular expressions
   that analyze server side include directives and the lines of
   a (server side) image map in this script:

   while (${$line} =~
       m,<!--#include\s+(virtual|file)\s*=\s*"\s*(\S+?)\s*"\s*-->,i)

   while ($line =~ m!\b(?:rect|circle|poly|default)\s+([^<>'"\s]+)\s!i)

   These two regular expressions assume the following syntax (examples):

   <!--#include file="....."-->
   <!--#include virtual="....."-->
 
   rect     /e/www/       .....
   circle   ../../        .....
   poly     ../info.html  .....
   default  ../none.cgi   .....

   I suppose in fact that many other HTTP servers besides the Apache use
   this same syntax.

Q: Why does the script need to run under root?

A: In order to emulate the HTTP server, who changes his real and effective
   UID and GID to either "nobody" (in the case of a CGI program) or the file
   owner and group of a "secure" CGI program (*.scgi) before executing such
   a (S)CGI program - in order to minimize the possible damage a (S)CGI pro-
   gram can do if something goes wrong.

   The script needs to be started under root to be able to change its own
   process UID and GID. "root" privilege is also needed to be allowed to
   "chown" the two output files back to their original owner.

Q: Why doesn't the script use parameters, i.e. why is the configuration
   information stored in the script itself and not passed to it as para-
   meters?

A: Because the script runs under root, it would reject any such parameters
   as being "insecure" (this is a feature of Perl).

   Moreover, in case you only want to scan a subtree (which is probably
   the reason why one would like to be able to use parameters in the first
   place) you probably need to experiment with the script first to see if
   it doesn't "leak" into the rest of your site - you will need to include
   the address(es) of the sideways reference(s) into the list of pages to
   skip to avoid this kind of unwanted "leakage".

   So the configuration information is usually not just one or two para-
   meters, which means you wouldn't want to type it in every time you run
   the script - which means you need to store it somewhere - so why not
   in the script itself?

Q: Why isn't the configuration information stored in a configuration file,
   then?

A: This should indeed be possible (unless such items read in from a file
   are considered "insecure" by Perl as well, which I didn't test). But
   you still need to have a separate copy of this script for each page
   you want to generate because you probably want to have different lay-
   outs for each of them, and you need to specify this layout in the two
   routines "html_header" and "html_footer" inside the script - so why
   bother to have two files (the script itself and its configuration
   file) for one task?

   Unless you let "html_header" and "html_footer" each read in a file of
   their own which is copied to the output - but that makes four files
   instead of one!

   Do whatever suits you best!


Version history:
----------------

Version 1.0: Initial release.

Version 1.1: Added a feature which enables the script to display non-HTML-
files in the tree representation (i.e., files for download).

Version 1.2: Improved the documentation (this README file) and changed the
output format from <PRE> ... ... </PRE> to <DL> <DD>... <DD>... </DL>.

Version 2.0: Many many bug fixes. A "$tree_root" set to "/" should work
now. Different root directories for HTML pages and CGI scripts are now
supported. Fixed "url_to_file" and "file_to_url" accordingly. The <TITLE>-
tag is matched case-insensitive now (I forgot to put the "i" modifier on
that match in previous versions). Text anchor links (i.e., "#subsection")
are now completely ignored, as they should (before, they were mistakenly
regarded as links to the default page in the current directory). The <DT>
tag is now used in <DL> ... </DL> lists to satisfy browsers who require it.

Version 2.1: It is now possible to have greater control over which pages
are shown and where (the latter when several links exist to the same page(s))
by specifying the root page of any given subtree and the number of (topmost)
levels to be shown of that subtree, i.e. "0" for hiding the subtree complete-
ly, "1" for showing the root (topmost) page of that subtree only, "2" for
showing the root page and the pages it contains hyperlinks to, and so on.
(The number limits the maximum depth of hyperlinks to follow)

Version 2.2: The layout of the output page has been changed. Code has been
added in order to exclude certain pages and files on my web site.


Credits:
--------

Many thanks to Walter Thyselius <walter@unilog.se> for suggesting the use
of <DL> ... </DL> and <DD> instead of <PRE> ... </PRE> and indentation with
spaces for the output format, as realized in version 1.2!

(Actually he uses <UL> ... </UL> and <LI>, but I find the many different
little enumeration bullets disturbing and confusing - use whatever you
like best!)

Also many thanks to
- Fabrizio Pivari <Fabrizio.Pivari@rupia.agip.it>
- Xin Liu <xliu@merit.edu>
- Pete Wenzel <pete@stc.com>
- Detlev Droege <droege@informatik.uni-koblenz.de>
- Winfield S. Heagy <heagy@csgrad.cs.vt.edu>
for contributing the many suggestions, bug reports and bug fixes
that went into version 2.0!

Special thanks to Michael Bruns <bruns@linmbr.mpae.gwdg.de> for suggesting
the finer tuning capability for excluding pages and subtrees realized in
version 2.1!


Final note:
-----------

If you need any assistance (for example in finding the right configuration
for your site) or have any comments, problems, suggestions, findings,
complaints, questions, insights, compliments or donations to give ;-)
then please don't hesitate to send me some mail:

sb@fluomedia.org (Steffen Beyer)

Best regards,
--
    Steffen Beyer <sb@fluomedia.org>
    http://sb.fluomedia.org/download/          (Free Perl and C Software
    http://www.perl.com/CPAN/authors/id/STBEY/         for Download)
    New: Build'n'Play 2.1.0 (all-purpose Unix batch installation tool)
    http://www.oreilly.de/catalog/perlmodger/bnp.html