![]()
The drawing of a zipper is used here courtesy of |
A script to do the contrary of a database "join" on two database text files (sort of).
The name of this tool is "disjoin" for reasons that I hope will become apparent in a moment.
The tool solves the problem of doing set operations (like intersection, difference, complement) on (plain text) database files.
A text database file is a text file where each line corresponds to one record of the database.
Each record is divided into fields by some field separator character or string.
One or more fields (NOT necessarily adjacent!!!) form the (unique) key of each record (like last name and given name for a person, for instance).
Now suppose you have two such database files, of which you want to know if they share any key values (if there are any people appearing in both database files, for example).
And suppose you want to split your two database files into two parts EACH; one part with the records that have keys that do not appear in the other database file, and another part with the records that have key values that appear in BOTH database files.
(Meditate over the fact that even for keys appearing in BOTH files the data associated with them is not necessarily the same!)
By the way:
The tool allows you to specify a regular expression (in Perl syntax) for determining the field separator character(s).
See the online help (call "disjoin" without parameters or with a parameter "-h" or "-?") for more details on this (option "-F").
To define which fields form the key, use the "-L" option (it takes a comma separated list (without spaces) of field numbers as its argument).
Note that counting starts at one, not zero. If you use the field with number zero in your key field number list, it always returns the empty string, thus not doing any harm (but also no use) when used.
To better illustrate what the tool does, two diagrams:
______ / \ ______ / \/ \ ( / ) \ ( A ( ) ) ( ( C ) ) ( ( ) B ) \ ( / ) \______/\ / \______/
The set ( A + C ) is the set of the keys contained in File_A, the set ( B + C ) is the set of the keys contained in File_B, and the set ( C ) is the set of the keys contained both in File_A und File_B.
File_A File_B | | \ / \ / \ / | | comparison ---> ====+====+==== <--- comparison ( Pass 1 ) | | | | | / \ | V / \ V controls selector ---> /| |\ <--- controls selector ( Pass 2 ) / | | \ / | | \ / | | \ / | | \ / | | \ | | | | V V V V File_A.1 File_A.0 File_B.0 File_B.1
Mnemonics: On computers, the zero ("0") is usually represented by an "O" with a slash ("/") through it. This is also a symbol for "intersection" in set theory. The one ("1") symbolizes "uniqueness".
Note that the original two database files are NOT modified, just read!
[Further explanations under construction]