[4mFsdb[24m(3) User Contributed Perl Documentation [4mFsdb[24m(3)
[1mNAME[0m
Fsdb - a flat-text database for shell scripting
[1mSYNOPSIS[0m
Fsdb, the flatfile streaming database is package of commands for
manipulating flat-ASCII databases from shell scripts. Fsdb is useful
to process medium amounts of data (with very little data you'd do it by
hand, with megabytes you might want a real database). Fsdb was known
as as Jdb from 1991 to Oct. 2008.
Fsdb is very good at doing things like:
+o extracting measurements from experimental output
+o examining data to address different hypotheses
+o joining data from different experiments
+o eliminating/detecting outliers
+o computing statistics on data (mean, confidence intervals,
correlations, histograms)
+o reformatting data for graphing programs
Fsdb is built around the idea of a flat text file as a database. Fsdb
files (by convention, with the extension [4m.fsdb[24m), have a header
documenting the schema (what the columns mean), and then each line
represents a database record (or row).
For example:
#fsdb experiment duration
ufs_mab_sys 37.2
ufs_mab_sys 37.3
ufs_rcp_real 264.5
ufs_rcp_real 277.9
Is a simple file with four experiments (the rows), each with a
description, size parameter, and run time in the first, second, and
third columns.
Rather than hand-code scripts to do each special case, Fsdb provides
higher-level functions. Although it's often easy throw together a
custom script to do any single task, I believe that there are several
advantages to using Fsdb:
+o these programs provide a higher level interface than plain Perl, so
** Fewer lines of simpler code:
dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
Picks out just one type of experiment and computes statistics
on it, rather than:
while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
$mean = $sum / $n; $std_dev = ...
in dozens of places.
+o the library uses names for columns, so
** No more $F[1], use "_duration".
** New or different order columns? No changes to your scripts!
Thus if your experiment gets more complicated with a size
parameter, so your log changes to:
#fsdb experiment size duration
ufs_mab_sys 1024 37.2
ufs_mab_sys 1024 37.3
ufs_rcp_real 1024 264.5
ufs_rcp_real 1024 277.9
ufs_mab_sys 2048 45.3
ufs_mab_sys 2048 44.2
Then the previous scripts still work, even though duration is now
the third column, not the second.
+o A series of actions are self-documenting (the provenance of
processsing done to produce each output is recorded in comments).
** No more wondering what hacks were used to compute the final
data, just look at the comments at the end of the output.
For example, the commands
dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
add to the end of the output the lines
# | dbrow _experiment eq "ufs_mab_sys"
# | dbcolstats duration
+o The library is mature, supporting large datasets (more than 100GB),
parallelism, corner cases, error handling, backed by an automated
test suite.
** No more puzzling about bad output because your custom script
skimped on error checking.
** No more memory thrashing when you try to sort ten million
records.
** Makes use of multiple cores in your computer when it can,
because each pipeline component runs in parallel, and because
key tools (dbsort, dbmapreduce) run in parlallel when possible.
+o Fsdb-2.x supports Perl scripting (in addition to shell scripting),
with libraries to do Fsdb input and output, and easy support for
pipelines. The shell script
dbcol name test1 | dbroweval '_test1 += 5;'
can be written in perl as:
dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
(The disadvantage is that you need to learn what functions Fsdb
provides.)
Fsdb is built on flat-ASCII databases. By storing data in simple text
files and processing it with pipelines it is easy to experiment (in the
shell) and look at the output. To the best of my knowledge, the
original implementation of this idea was "/rdb", a commercial product
described in the book [4mUNIX[24m [4mrelational[24m [4mdatabase[24m [4mmanagement:[24m [4mapplication[0m
[4mdevelopment[24m [4min[24m [4mthe[24m [4mUNIX[24m [4menvironment[24m by Rod Manis, Evan Schaffer, and
Robert Jorgensen (1988 by Prentice Hall, and also at the web page
). Fsdb is an incompatible re-implementation of
their idea without any accelerated indexing or forms support. (But
it's free, and probably has better statistics!).
Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
level support for input, output, and threaded-pipelines. (As of
Fsdb-2.44 it no longer uses Perl threading, just processes, since they
are faster.)
Installation instructions follow at the end of this document. Fsdb-2.x
requires Perl 5.8 to run. All commands have manual pages and provide
usage with the "--help" option. All commands are backed by an
automated test suite.
The most recent version of Fsdb is available on the web at
.
[1mWHAT'S NEW[0m
[1m3.4, tbd tbd[0m
ENHANCEMENT
dbcolsdecimate now has examples in its documentatino.
BUG FIX
dbcolsstats, dbmapreduce, dbcolpercentile, dbfilepivot, and
dbmultistats now correctly propagate the temporary directory into
the sort route, if required. (Previously, while it collected
tmpdir, it did not propagage. This problem only applied if n-tiles
were requested.)
[1mREADME CONTENTS[0m
executive summary
what's new
README CONTENTS
installation
basic data format
basic data manipulation
list of commands
another example
a gradebook example
a password example
history
related work
release notes
copyright
comments
[1mINSTALLATION[0m
Fsdb now uses the standard Perl build and installation from
[1mExtUtil::MakeMaker[22m(3), so the quick answer to installation is to type:
perl Makefile.PL
make
make test
sudo make install
Or, if you want to install it somewhere else, change the first line to
perl Makefile.PL PREFIX=$HOME
then the other commands ("make; make test; make install"; but now
without the sudo), and it will go in your home directory's [4mbin[24m, etc.
(See [1mExtUtil::MakeMaker[22m(3) for more details.)
Fsdb requires perl 5.8 or later.
A test-suite is available, run it with
make test
In the past, the ports existed for FreeBSD and MacOS. If someone
running one of those OSes wants to contribute a new port, please let me
know.
[1mBASIC DATA FORMAT[0m
These programs are based on the idea storing data in simple ASCII
files. A database is a file with one header line and then data or
comment lines. For example:
#fsdb account passwd uid gid fullname homedir shell
johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
greg * 2275 134 Greg_Johnson /home/greg /bin/bash
root * 0 0 Root /root /bin/bash
# this is a simple database
The header line must be first and begins with "#fsdb". There are rows
(records) and columns (fields), just like in a normal database.
Comment lines begin with "#". Column names are any string not
containing spaces or single quote (although it is prudent to keep them
alphanumeric with underscore).
Columns can optionally include type anntations by following name with
:t where t is some type. (Types are not used in Perl, but are relevant
in Python and Go Fsdb bindings.) Types use a subset of perl pack
specifiers: c, s, l, q are signed 8, 16, 32, and 64-bit integers, f is
a float, d is double float, a is utf-8 string, and > and < can
force big or little endianness.
By default, columns are delimited by any amount of whitespace. With
this default configuration, the contents of a field cannot contain
whitespace. However, this limitation can be relaxed by changing the
field separator as described below.
The big advantage of simple flat-text databases is that it is usually
easy to massage data into this format, and it's reasonably easy to take
data out of this format into other (text-based) programs, like gnuplot,
jgraph, and LaTeX. Think Unix. Think pipes. (Or even output to Excel
and HTML if you prefer.)
Since no-whitespace in columns was a problem for some applications,
there's an option which relaxes this rule. You can specify the field
separator in the table header with "-F x" where "x" is a code for the
new field separator. A full list of codes is at [1mdbfilealter[22m(1), but
two common special values are "-F t" which is a separator of a single
tab character, and "-F S", a separator of two spaces. Both allowing
(single) spaces in fields. An example:
#fsdb -F S account passwd uid gid fullname homedir shell
johnh * 2274 134 John Heidemann /home/johnh /bin/bash
greg * 2275 134 Greg Johnson /home/greg /bin/bash
root * 0 0 Root /root /bin/bash
# this is a simple database
See [1mdbfilealter[22m(1) for more details. Regardless of what the column
separator is for the body of the data, it's always whitespace in the
header.
There's also a third format: a "list". Because it's often hard to see
what's columns past the first two, in list format each "column" is on a
separate line. The programs dblistize and dbcolize convert to and from
this format, and all programs work with either formats. The command
dbfilealter -R C < DATA/passwd.fsdb
outputs:
#fsdb -R C account passwd uid gid fullname homedir shell
account: johnh
passwd: *
uid: 2274
gid: 134
fullname: John_Heidemann
homedir: /home/johnh
shell: /bin/bash
account: greg
passwd: *
uid: 2275
gid: 134
fullname: Greg_Johnson
homedir: /home/greg
shell: /bin/bash
account: root
passwd: *
uid: 0
gid: 0
fullname: Root
homedir: /root
shell: /bin/bash
# this is a simple database
# | dblistize
See [1mdbfilealter[22m(1) for more details.
[1mBASIC DATA MANIPULATION[0m
A number of programs exist to manipulate databases. Complex functions
can be made by stringing together commands with shell pipelines. For
example, to print the home directories of everyone with ``john'' in
their names, you would do:
cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
The output might be:
#fsdb homedir
/home/johnh
/home/greg
# this is a simple database
# | dbrow _fullname =~ /John/
# | dbcol homedir
(Notice that comments are appended to the output listing each command,
providing an automatic audit log.)
In addition to typical database functions (select, join, etc.) there
are also a number of statistical functions.
The real power of Fsdb is that one can apply arbitrary code to rows to
do powerful things.
cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
converts "John_Heidemann" into "Heidemann,_John". Not too much more
work could split fullname into firstname and lastname fields.
(Or:
cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
'_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
[1mTALKING ABOUT COLUMNS[0m
An advantage of Fsdb is that you can talk about columns by name
(symbolically) rather than simply by their positions. So in the above
example, "dbcol homedir" pulled out the home directory column, and
"dbrow '_fullname =~ /John/'" matched against column fullname.
In general, you can use the name of the column listed on the "#fsdb"
line to identify it in most programs, and _name to identify it in code.
Some alternatives for flexibility:
+o Numeric values identify columns positionally, numbering from 0. So
0 or _0 is the first column, 1 is the second, etc.
+o In code, _last_columnname gets the value from columname's previous
row.
See [1mdbroweval[22m(1) for more details about writing code.
[1mLIST OF COMMANDS[0m
Enough said. I'll summarize the commands, and then you can experiment.
For a detailed description of each command, see a summary by running it
with the argument "--help" (or "-?" if you prefer.) Full manual pages
can be found by running the command with the argument "--man", or
running the Unix command "man dbcol" or whatever program you want.
[1mTABLE CREATION[0m
dbcolcreate
add columns to a database
dbcoldefine
set the column headings for a non-Fsdb file
[1mTABLE MANIPULATION[0m
dbcol
select columns from a table
dbrow
select rows from a table
dbsort
sort rows based on a set of columns
dbjoin
compute the natural join of two tables
dbcolrename
rename a column
dbcolmerge
merge two columns into one
dbcolsplittocols
split one column into two or more columns
dbcolsplittorows
split one column into multiple rows
dbfilepivot
"pivots" a file, converting multiple rows corresponding to the same
entity into a single row with multiple columns.
dbfilevalidate
check that db file doesn't have some common errors
[1mCOMPUTATION AND STATISTICS[0m
dbcolstats
compute statistics over a column (mean,etc.,optionally median)
dbmultistats
group rows by some key value, then compute stats (mean, etc.) over
each group (equivalent to dbmapreduce with dbcolstats as the
reducer)
dbmapreduce
group rows (map) and then apply an arbitrary function to each group
(reduce)
dbrvstatdiff
compare two samples distributions (mean/conf interval/T-test)
dbcolmovingstats
computing moving statistics over a column of data
dbcolstatscores
compute Z-scores and T-scores over one column of data
dbcolpercentile
compute the rank or percentile of a column
dbcolhisto
compute histograms over a column of data
dbcolscorrelate
compute the coefficient of correlation over several columns
dbcolsdecimate
drop rows selectively, keeping large changes and periodic samples
dbcolsregression
compute linear regression and correlation for two columns
dbrowaccumulate
compute a running sum over a column of data
dbrowcount
count the number of rows (a subset of dbstats)
dbrowdiff
compute differences between a columns in each row of a table
dbrowenumerate
number each row
dbroweval
run arbitrary Perl code on each row
dbrowuniq
count/eliminate identical rows (like Unix [1muniq[22m(1))
dbfilediff
compare fields on rows of a file (something like Unix [1mdiff[22m(1))
[1mOUTPUT CONTROL[0m
dbcolneaten
pretty-print columns
dbfilealter
convert between column or list format, or change the column
separator
dbfilestripcomments
remove comments from a table
dbformmail
generate a script that sends form mail based on each row
[1mCONVERSIONS[0m
(These programs convert data into fsdb. See their web pages for
details.)
cgi_to_db
combined_log_format_to_db
html_table_to_db
HTML tables to fsdb (assuming they're reasonably formatted).
kitrace_to_db
ns_to_db
sqlselect_to_db
the output of SQL SELECT tables to db
tabdelim_to_db
spreadsheet tab-delimited files to db
tcpdump_to_db
(see man [1mtcpdump[22m(8) on any reasonable system)
xml_to_db
XML input to fsdb, assuming they're very regular
(And out of fsdb:)
db_to_csv
Comma-separated-value format from fsdb.
db_to_html_table
simple conversion of Fsdb to html tables
[1mSTANDARD OPTIONS[0m
Many programs have common options:
[1m-? [22mor [1m--help[0m
Show basic usage.
[1m-N [22mon [1m--new-name[0m
When a command creates a new column like dbrowaccumulate's "accum",
this option lets one override the default name of that new column.
[1m-T TmpDir[0m
where to put tmp files. Also uses environment variable TMPDIR, if
-T is not specified. Default is /tmp.
Show basic usage.
[1m-c FRACTION [22mor [1m--confidence FRACTION[0m
Specify confidence interval FRACTION (dbcolstats, dbmultistats,
etc.)
[1m-C S [22mor "--element-separator S"
Specify column separator S (dbcolsplittocols, dbcolmerge).
[1m-d [22mor [1m--debug[0m
Enable debugging (may be repeated for greater effect in some
cases).
[1m-a [22mor [1m--include-non-numeric[0m
Compute stats over all data (treating non-numbers as zeros). (By
default, things that can't be treated as numbers are ignored for
stats purposes)
[1m-S [22mor [1m--pre-sorted[0m
Assume the data is pre-sorted. May be repeated to disable
verification (saving a small amount of work).
[1m-e E [22mor [1m--empty E[0m
give value E as the value for empty (null) records
[1m-i I [22mor [1m--input I[0m
Input data from file I.
[1m-o O [22mor [1m--output O[0m
Write data out to file O.
[1m--header [22mH
Use H as the full Fsdb header, rather than reading a header from
then input. This option is particularly useful when using Fsdb
under Hadoop, where split files don't have heades.
[1m--nolog[22m.
Skip logging the program in a trailing comment.
When giving Perl code (in dbrow and dbroweval) column names can be
embedded if preceded by underscores. Look at [1mdbrow[22m(1) or [1mdbroweval[22m(1)
for examples.)
Most programs run in constant memory and use temporary files if
necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
dbmultistats, dbrowsplituniq.
[1mSTANDARD SORTING OPTIONS[0m
A number of programs do sorting, or depend on defining an ordering of
rows. Such programs use these standard sorting options:
[1m-r [22mor [1m--descending[0m
sort in reverse order (high to low)
[1m-R [22mor [1m--ascending[0m
sort in normal order (low to high)
[1m-t [22mor [1m--type-inferred-sorting[0m
sort fields by type (numeric or leicographic), automatically
[1m-n [22mor [1m--numeric[0m
sort numerically
[1m-N [22mor [1m--lexical[0m
sort lexicographically
[1mANOTHER EXAMPLE[0m
Take the raw data in "DATA/http_bandwidth", put a header on it
("dbcoldefine size bw"), took statistics of each category
("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
mean stddev pct_rsd"), and you get:
#fsdb size mean stddev pct_rsd
1024 1.4962e+06 2.8497e+05 19.047
10240 5.0286e+06 6.0103e+05 11.952
102400 4.9216e+06 3.0939e+05 6.2863
# | dbcoldefine size bw
# | /home/johnh/BIN/DB/dbmultistats -k size bw
# | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
(The whole command was:
cat DATA/http_bandwidth |
dbcoldefine size |
dbmultistats -k size bw |
dbcol size mean stddev pct_rsd
all on one line.)
Then post-process them to get rid of the exponential notation by adding
this to the end of the pipeline:
dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
(Actually, this step is no longer required since dbcolstats now uses a
different default format.)
giving:
#fsdb size mean stddev pct_rsd
1024 1496200 284970 19.047
10240 5028600 601030 11.952
102400 4921600 309390 6.2863
# | dbcoldefine size bw
# | dbmultistats -k size bw
# | dbcol size mean stddev pct_rsd
# | dbroweval { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
In a few lines, raw data is transformed to processed output.
Suppose you expect there is an odd distribution of results of one
datapoint. Fsdb can easily produce a CDF (cumulative distribution
function) of the data, suitable for graphing:
cat DB/DATA/http_bandwidth | \
dbcoldefine size bw | \
dbrow '_size == 102400' | \
dbcol bw | \
dbsort -n bw | \
dbrowenumerate | \
dbcolpercentile count | \
dbcol bw percentile | \
xgraph
The steps, roughly: 1. get the raw input data and turn it into fsdb
format, 2. pick out just the relevant column (for efficiency) and sort
it, 3. for each data point, assign a CDF percentage to it, 4. pick out
the two columns to graph and show them
[1mA GRADEBOOK EXAMPLE[0m
The first commercial program I wrote was a gradebook, so here's how to
do it with Fsdb.
Format your data like DATA/grades.
#fsdb name email id test1
a a@ucla.example.edu 1 80
b b@usc.example.edu 2 70
c c@isi.example.edu 3 65
d d@lmu.example.edu 4 90
e e@caltech.example.edu 5 70
f f@oxy.example.edu 6 90
Or if your students have spaces in their names, use "-F S" and two
spaces to separate each column:
#fsdb -F S name email id test1
alfred aho a@ucla.example.edu 1 80
butler lampson b@usc.example.edu 2 70
david clark c@isi.example.edu 3 65
constantine drovolis d@lmu.example.edu 4 90
debrorah estrin e@caltech.example.edu 5 70
sally floyd f@oxy.example.edu 6 90
To compute statistics on an exam, do
cat DATA/grades | dbstats test1 |dblistize
giving
#fsdb -R C ...
mean: 77.5
stddev: 10.84
pct_rsd: 13.987
conf_range: 11.377
conf_low: 66.123
conf_high: 88.877
conf_pct: 0.95
sum: 465
sum_squared: 36625
min: 65
max: 90
n: 6
...
To do a histogram:
cat DATA/grades | dbcolhisto -n 5 -g test1
giving
#fsdb low histogram
65 *
70 **
75
80 *
85
90 **
# | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
Now you want to send out grades to the students by e-mail. Create a
form-letter (in the file [4mtest1.txt[24m):
To: _email (_name)
From: J. Random Professor
Subject: test1 scores
_name, your score on test1 was _test1.
86+ A
75-85 B
70-74 C
0-69 F
Generate the shell script that will send the mail out:
cat DATA/grades | dbformmail test1.txt > test1.sh
And run it:
sh passwd.fsdb
To convert the group file
cat /etc/group | sed 's/:/ /g' | \
dbcoldefine -F S group password gid members \
>group.fsdb
To show the names of the groups that div7-members are in (assuming DIV7
is in the gecos field):
cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
dbjoin -i - -i group.fsdb gid | dbcol login group
[1mSHORT EXAMPLES[0m
Which Fsdb programs are the most complicated (based on number of test
cases)?
ls TEST/*.cmd | \
dbcoldefine test | \
dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
dbrowuniq -c | \
dbsort -nr count | \
dbcolneaten
(Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
there's a common student id in column "id":
dbcol id hw1 t.fsdb
dbjoin -a -e - grades.fsdb t.fsdb id | \
dbsort name | \
dbcolneaten >new_grades.fsdb
Merging two fsdb files with the same rows:
cat file1.fsdb file2.fsdb >output.fsdb
or if you want to clean things up a bit
cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
or if you want to know where the data came from
for i in 1 2
do
dbcolcreate source $i < file$i.fsdb
done >output.fsdb
(assumes you're using a Bourne-shell compatible shell, not csh).
[1mWARNINGS[0m
As with any tool, one should (which means [4mmust[24m) understand the limits
of the tool.
All Fsdb tools should run in [4mconstant[24m [4mmemory[24m. In some cases (such as
[4mdbcolstats[24m with quartiles, where the whole input must be re-read),
programs will spool data to disk if necessary.
Most tools buffer one or a few lines of data, so memory will scale with
the size of each line. (So lines with many columns, or when columns
have lots data, may cause large memory consumption.)
All Fsdb tools should run in constant or at worst "n log n" time.
All Fsdb tools use normal Perl math routines for computation. Although
I make every attempt to choose numerically stable algorithms (although
I also welcome feedback and suggestions for improvement), normal
rounding due to computer floating point approximations can result in
inaccuracies when data spans a large range of precision. (See for
example the [4mdbcolstats_extrema[24m test cases.)
Any requirements and limitations of each Fsdb tool is documented on its
manual page.
If any Fsdb program violates these assumptions, that is a bug that
should be documented on the tool's manual page or ideally fixed.
Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
bugs. Fsdb should work on perl from version 5.10 onward.
[1mHISTORY[0m
There have been four major versions of Fsdb: fsdb-0.x was begun in 1991
for my personal use. Fsdb 1.0 is a complete re-write of the pre-1995
versions, and was distributed from 1995 to 2007. Fsdb 2.0 is a
significant re-write of the 1.x versions to systematically use a
library and threads (although threads were replaced with full processes
in 2.44). Fsdb 3.0 in 2022 adds type specifiers to the schema, mostly
to support use in languages with stronger typing (like Python, Go, and
C).
Fsdb (in its various forms) has been used extensively by its author
since 1991. Since 1995 it's been used by two other researchers at UCLA
and several at ISI. In February 1998 it was announced to the Internet.
Since then it has found a few users, some outside where I work.
Major changes:
0.1 1991: begun for my personal use, to replace awk.
1.0 1997-07-22: first public release.
2.0 2008-01-25: rewrite to use a common library, and starting to use
threads.
2.12 2008-10-16: completion of the rewrite, and first RPM package.
2.44 2013-10-02: replacing threads with processes for improved
performance
3.0 2022-04-04: adding type specifiers to the schema
[1mFsdb 2.0 Rationale[0m
I've thought about fsdb-2.0 for many years, but it was started in
earnest in 2007. Fsdb-2.0 has the following goals:
in-one-process processing
While fsdb is great on the Unix command line as a pipeline between
programs, it should [4malso[24m be possible to set it up to run in a
single process. And if it does so, it should be able to avoid
serializing and deserializing (converting to and from text) data
between each module. (Accomplished in fsdb-2.0: see dbpipeline,
although still needs tuning.)
clean IO API
Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
very, very crufty. More than just being ugly (but it was that
too), this made things reading from one format file and writing to
another the application's job, when it should be the library's.
(Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
normalized module APIs
Because fsdb modules were added as needed over 10 years, sometimes
the module APIs became inconsistent. (For example, the 1.x
"dbcolcreate" required an empty value following the name of the new
column, but other programs specify empty values with the "-e"
argument.) We should smooth over these inconsistencies.
(Accomplished as each module was ported in 2.0 through 2.7.)
everyone handles all input formats
Given a clean IO API, the distinction between "colized" and
"listized" fsdb files should go away. Any program should be able
to read and write files in any format. (Accomplished in fsdb-2.1.)
Fsdb-2.0 preserves backwards compatibility where possible, but breaks
it where necessary to accomplish the above goals. In August 2008,
Fsdb-2.7 was declared preferred over the 1.x versions. Benchmarking in
2013 showed that threading performed much worse than just using pipes,
because Perl's requirements for data that is shared between multiple
threads is quite heavyweight. Fsdb-2.44 therefore uses threading
"style", but implemented with processes (via my "Freds" library).
[1mFsdb And Muliple Processors[0m
Fsdb's use of Unix pipelines means Fsdb automatically benefits for
multiprocessor computers---each pipeline stage can run on a separate
core. In addition, compute-intensive Fsdb modules like dbsort and
dbmapreduce are explicitly multi-process and will use as many cores as
they can, up to the number of cores on the local computer.
Although Fsdb takes advanatage of as much parallelism as it can, a five
stage pipeline won't necessarily saturate five cores. Pipeline stages
almost always have different amounts of work to do, and some stages are
often data limited. (Dbsort is attempts as much parallelism as it can,
and can run 10-way parallel or more over a large enough input dataset.
But it cannot sustain high parallelism because of the requirement that
it produce one global output.)
[1mFsdb 3.0 Rationale[0m
There are two motiviations for adding optional typing to Fsdb. First,
languages such as Python and Go would really like type information. As
of 2022 there are now users of those languages, so the basic system
should support them.
Second, while pure text is flexible, it's very inefficient---converting
numbers to and from decimal is thousands of instructions, and binary
encodings are often much smaller than text. In the future, I would
love to have a flag that enables a binary encoding.
Typing is optional---omitting types is never wrong.
One somewhat odd thing about typing is that we reuse the Perl pack
definitions of types, so q (for "quadword") stands for 64-bit integer.
These are perhaps not the most mnemonic choices in 2022, but I would
rather pick someone's existing set than try to define my own.
[1mContributors[0m
Fsdb includes code ported from Geoff Kuenning
("Fsdb::Support::TDistribution").
Fsdb contributors: Ashvin Goel [4mgoel@cse.oge.edu[24m, Geoff Kuenning
[4mgeoff@fmg.cs.ucla.edu[24m, Vikram Visweswariah [4mvisweswa@isi.edu[24m, Kannan
Varadahan [4mkannan@isi.edu[24m, Lars Eggert [4mlarse@isi.edu[24m, Arkadi Gelfond
[4markadig@dyna.com[24m, David Graff [4mgraff@ldc.upenn.edu[24m, Haobo Yu
[4mhaoboy@packetdesign.com[24m, Pavlin Radoslavov [4mpavlin@catarina.usc.edu[24m,
Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
Schwendt, Fabio Silva [4mfabio@isi.edu[24m, Jerry Zhao [4mzhaoy@isi.edu[24m, Ning Xu
[4mnxu@aludra.usc.edu[24m, Martin Lukac [4mmlukac@lecs.cs.ucla.edu[24m, Xue Cai,
Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
Wei, Hang Guo, Wes Hardaker.
Fsdb includes datasets contributed from NIST ([4mDATA/nist_zarr13.fsdb[24m),
from
, the
NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
Background and Data. The source is public domain, and reproduced with
permission.
[1mRELATED WORK[0m
As stated in the introduction, Fsdb is an incompatible reimplementation
of the ideas found in "/rdb". By storing data in simple text files and
processing it with pipelines it is easy to experiment (in the shell)
and look at the output. The original implementation of this idea was
/rdb, a commercial product described in the book [4mUNIX[24m [4mrelational[0m
[4mdatabase[24m [4mmanagement:[24m [4mapplication[24m [4mdevelopment[24m [4min[24m [4mthe[24m [4mUNIX[24m [4menvironment[24m by
Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
page ).
While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
makes several different design choices. In particular: rdb attempts to
be closer to a "real" database, with provision for locking, file
indexing. Fsdb focuses on single user use and so eschews these
choices. Rdb also has some support for interactive editing. Fsdb
leaves editing to text editors like emacs or vi.
In August, 2002 I found out Carlo Strozzi extended RDB with his package
NoSQL . According to Mr. Strozzi,
he implemented NoSQL in awk to avoid Perl start-up costs in RDB.
Although I haven't found Perl startup overhead to be a big problem on
my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
want to evaluate his system. The Linux Journal has a description of
NoSQL at . It seems quite
similar to Fsdb. Like /rdb, NoSQL supports indexing (not present in
Fsdb). Fsdb appears to have richer support for statistics, and, as of
Fsdb-2.x, its support for Perl threading may support faster performance
(one-process, less serialization and deserialization).
[1mRELEASE NOTES[0m
Versions prior to 1.0 were released informally on my web page but were
not announced.
[1m0.0 1991[0m
started for my own research use
[1m0.1 26-May-94[0m
first check-in to RCS
[1m0.2 15-Mar-95[0m
parts now require perl5
[1m1.0, 22-Jul-97[0m
adds autoconf support and a test script.
[1m1.1, 20-Jan-98[0m
support for double space field separators, better tests
[1m1.2, 11-Feb-98[0m
minor changes and release on comp.lang.perl.announce
[1m1.3, 17-Mar-98[0m
+o adds median and quartile options to dbstats
+o adds dmalloc_to_db converter
+o fixes some warnings
+o dbjoin now can run on unsorted input
+o fixes a dbjoin bug
+o some more tests in the test suite
[1m1.4, 27-Mar-98[0m
+o improves error messages (all should now report the program that
makes the error)
+o fixed a bug in dbstats output when the mean is zero
[1m1.5, 25-Jun-98[0m
BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
dbstats
NEW dbcolstats computes zscores and tscores over a column
NEW dbcolscorrelate computes correlation coefficients between two
columns
INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
BUG FIX all tests are now ``portable'' (previously some tests ran only
on my system)
BUG FIX you no longer need to have the db programs in your path (fix
arose from a discussion with Arkadi Gelfond)
BUG FIX installation no longer uses cp -f (to work on SunOS 4)
[1m1.6, 24-May-99[0m
NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
files if necessary)
NEW dbcolmovingstats does moving means over a series of data
NEW dbcol has a -v option to get all columns except those listed
NEW dbmultistats does quartiles and medians
NEW dbstripextraheaders now also cleans up bogus comments before the
fist header
BUG FIX dbcolneaten works better with double-space-separated data
[1m1.7, 5-Jan-00[0m
NEW dbcolize now detects and rejects lines that contain embedded copies
of the field separator
NEW configure tries harder to prevent people from improperly
configuring/installing fsdb
NEW tcpdump_to_db converter (incomplete)
NEW tabdelim_to_db converter: from spreadsheet tab-delimited files to
db
NEW mailing lists for fsdb are "fsdb-announce@heidemann.la.ca.us"
and "fsdb-talk@heidemann.la.ca.us"
To subscribe to either, send mail
to "fsdb-announce-request@heidemann.la.ca.us" or
"fsdb-talk-request@heidemann.la.ca.us" with "subscribe" in the
BODY of the message.
BUG FIX dbjoin used to produce incorrect output if there were extra,
unmatched values in the 2nd table. Thanks to Graham Phillips for
providing a test case.
BUG FIX the sample commands in the usage strings now all should
explicitly include the source of data (typically from "cat foo.fsdb
|"). Thanks to Ya Xu for pointing out this documentation deficiency.
BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
[1m1.8, 28-Jun-00[0m
BUG FIX header options are now preserved when writing with dblistize
NEW dbrowuniq now optionally checks for uniqueness only on certain
fields
NEW dbrowsplituniq makes one pass through a file and splits it into
separate files based on the given fields
NEW converter for "crl" format network traces
NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
maps to the last row's value for field _foo.
OPTIMIZATION comment processing slightly changed so that dbmultistats
now is much faster on files with lots of comments (for example, ~100k
lines of comments and 700 lines of data!) (Thanks to Graham Phillips
for pointing out this performance problem.)
BUG FIX dbstats with median/quartiles now correctly handles singleton
data points.
[1m1.9, 6-Nov-00[0m
NEW dbfilesplit, split a single input file into multiple output files
(based on code contributed by Pavlin Radoslavov).
BUG FIX dbsort now works with perl-5.6
[1m1.10, 10-Apr-01[0m
BUG FIX dbstats now handles the case where there are more n-tiles than
data
NEW dbstats now includes a -S option to optimize work on pre-sorted
data (inspired by code contributed by Haobo Yu)
BUG FIX dbsort now has a better estimate of memory usage when run on
data with very short records (problem detected by Haobo Yu)
BUG FIX cleanup of temporary files is slightly better
[1m1.11, 2-Nov-01[0m
BUG FIX dbcolneaten now runs in constant memory
NEW dbcolneaten now supports "field specifiers" that allow some control
over how wide columns should be
OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
(inspired by "Information and Control in Gray-box Systems" by the
Arpaci-Dusseau's at SOSP 2001)
INTERNAL t_distr now ported to perl5 module DbTDistr
[1m1.12, 30-Oct-02[0m
BUG FIX dbmultistats documentation typo fixed
NEW dbcolmultiscale
NEW dbcol has -r option for "relaxed error checking"
NEW dbcolneaten has new -e option to strip end-of-line spaces
NEW dbrow finally has a -v option to negate the test
BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
Scheaffer test cases)
BUG FIX some patches to run with Perl 5.8. Note: some programs
(dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
"Use of uninitialized value in concatenation (.)" or "string at
/usr/lib/perl5/5.8.0/FileCache.pm line 98, line 2". Please
ignore this until I figure out how to suppress it. (Thanks to Jerry
Zhao for noticing perl-5.8 problems.)
BUG FIX fixed an autoconf problem where configure would fail to find a
reasonable prefix (thanks to Fabio Silva for reporting the problem)
NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
NEW dblib now has a function [1mdblib_text2html() [22mthat will do simple
conversion of iso-8859-1 to HTML
[1m1.13, 4-Feb-04[0m
NEW fsdb added to the freebsd ports tree
. Maintainer:
"larse@isi.edu"
BUG FIX properly handle trailing spaces when data must be numeric (ex.
dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
"nxu@aludra.usc.edu".
NEW dbcolize error message improved (bug report from Terrence Brannon),
and list format documented in the README.
NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
BUG FIX handle numeric synonyms for column names in dbcol properly
ENHANCEMENT "talking about columns" section added to README. Lack of
documentation pointed out by Lars Eggert.
CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
mail, rather than sendmail (sendmail is still an option, but mail
doesn't require running as root)
NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
with unicode
NEW dbfilevalidate: check a db file for some common errors
[1m1.14, 24-Aug-06[0m
ENHANCEMENT README cleanup
INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
NEW dbcolsplittorows split one column into multiple rows
NEW dbcolsregression compute linear regression and correlation for two
columns
ENHANCEMENT cvs_to_db: better error handling, normalize field names,
skip blank lines
ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
duplicate names
BUG FIX minor bug fixed in calculation of Student t-distributions
(doesn't change any test output, but may have caused small errors)
[1m1.15, 12-Nov-07[0m
NEW fsdb-1.14 added to the MacOS Fink system
. (Thanks to Lars
Eggert for maintaining this port.)
NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
OO I/O interfaces to Fsdb files. Highly recommended if you use fsdb
directly from perl. In the fullness of time I expect to reimplement
the entire thing using these APIs to replace the current dblib.pl which
is still hobbled by its roots in perl4.
NEW dbmapreduce now implements a Google-style map/reduce abstraction,
generalizing dbmultistats.
ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
instead of autoconf. This change paves the way to better perl-5-style
modularization, proper manual pages, input of both listize and colize
format for every program, and world peace.
ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
BUG FIX dbmultistats now propagates its format argument (-f). Bug and
fix from Martin Lukac (thanks!).
ENHANCEMENT dbformmail documentation now is clearer that it doesn't
send the mail, you have to run the shell script it writes. (Problem
observed by Unkyu Park.)
ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
discarded in favor of The Perl Way.
BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
in O(1) memory
ENHANCEMENT dbroweval -N was finally implemented (eat comments)
[1m2.0, 25-Jan-08[0m
2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
ENHANCEMENT: shifting old programs to Perl modules, with the front-end
program as just a wrapper. In the short-term, this change just means
programs have real man pages. In the long-run, it will mean that one
can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
dbcolstats), dbcolrename, dbcolcreate,
NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
use fsdb commands from within perl (via threads).
It also provides perl function aliases for the internal modules, so
a string of fsdb commands in perl are nearly as terse as in the
shell:
use Fsdb::Filter::dbpipeline qw(:all);
dbpipeline(
dbrow(qw(name test1)),
dbroweval('_test1 += 5;')
);
INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
dbcolstatscores. The new dbcolstats does the same thing as the old
dbstats. This incompatibility is unfortunate but normalizes program
names.
CHANGE: The new dbcolstats program always outputs "-" (the default
empty value) for statistics it cannot compute (for example, standard
deviation if there is only one row), instead of the old mix of "-" and
"na".
INCOMPATIBLE CHANGE: The old dbcolstats program, now called
dbcolstatscores, also has different arguments. The "-t mean,stddev"
option is now "--tmean mean --tstddev stddev". See dbcolstatscores for
details.
INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
default value rather than requiring each column to have an initial
constant value. To change the initial value, sue the new "-e" option.
NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
output (except without differentiating numeric/non-numeric input), or
the equivalent of "dbstripcomments | wc -l".
NEW: dbmerge merges two sorted files. This functionality was previously
embedded in dbsort.
INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
renamed "-a", so as to not conflict with the new standard option "-i"
for input file.
[1m2.1, 6-Apr-08[0m
2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs
understand both listize and colize format
ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
just exactly two.
NEW dbmerge2 is an internal routine that handles merging exactly two
files.
INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
than assuming the first two arguments were tables (as in fsdb-1).
The old dbjoin argument "-i" is now "-a" or <--type=outer>.
A minor change: comments in the source files for dbjoin are now
intermixed with output rather than being delayed until the end.
ENHANCEMENT dbsort now no longer produces warnings when null values are
passed to numeric comparisons.
BUG FIX dbroweval now once again works with code that lacks a trailing
semicolon. (This bug fixes a regression from 1.15.)
INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
spaces) is now "-E" to avoid conflicts with the standard empty field
argument.
INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
correspond.
NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
different options.
ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
format and column-format data, so all converted programs can now
[4mautomatically[24m read either format. This capability was one of the
milestone goals for 2.0, so yea!
[1m2.2, 23-May-08[0m
Release 2.2 is another 2.x alpha release. Now [4mmost[24m of the commands are
ported, but a few remain, and I plan one last incompatible change (to
the file header) before 2.x final.
ENHANCEMENT
shifting more old programs to Perl modules. New in 2.2:
dbrowaccumulate, dbformmail. dbcolmovingstats. dbrowuniq.
dbrowdiff. dbcolmerge. dbcolsplittocols. dbcolsplittorows.
dbmapreduce. dbmultistats. dbrvstatdiff. Also dbrowenumerate
exists only as a front-end (command-line) program.
INCOMPATIBLE CHANGE
The following programs have been dropped from fsdb-2.x:
dbcoltighten, dbfilesplit, dbstripextraheaders,
dbstripleadingspace.
NEW combined_log_format_to_db to convert Apache logfiles
INCOMPATIBLE CHANGE
Options to dbrowdiff are now [1m-B [22mand [1m-I[22m, not [1m-a [22mand [1m-i[22m.
INCOMPATIBLE CHANGE
dbstripcomments is now dbfilestripcomments.
BUG FIXES
dbcolneaten better handles empty columns; dbcolhisto warning
suppressed (actually a bug in high-bucket handling).
INCOMPATIBLE CHANGE
dbmultistats now requires a "-k" option in front of the key (tag)
field, or if none is given, it will group by the first field (both
like dbmapreduce).
KNOWN BUG
dbmultistats with quantile option doesn't work currently.
INCOMPATIBLE CHANGE
dbcoldiff is renamed dbrvstatdiff.
BUG FIXES
dbformmail was leaving its log message as a command, not a
comment. Oops. No longer.
[1m2.3, 27-May-08 (alpha)[0m
Another alpha release, this one just to fix the critical dbjoin bug
listed below (that happens to have blocked my MP3 jukebox :-).
BUG FIX
Dbsort no longer hangs if given an input file with no rows.
BUG FIX
Dbjoin now works with unsorted input coming from a pipeline (like
stdin). Perl-5.8.8 has a bug (?) that was making this case
fail---opening stdin in one thread, reading some, then reading more
in a different thread caused an lseek which works on files, but
fails on pipes like stdin. Go figure.
BUG FIX / KNOWN BUG
The dbjoin fix also fixed dbmultistats -q (it now gives the right
answer). Although a new bug appeared, messages like:
Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
interpreter: 0xa8350b8 during global destruction. So the
dbmultistats_quartile test is still disabled.
[1m2.4, 18-Jun-08[0m
Another alpha release, mostly to fix minor usability problems in
dbmapreduce and client functions.
ENHANCEMENT
dbrow now defaults to running user supplied code without warnings
(as with fsdb-1.x). Use "--warnings" or "-w" to turn them back on.
ENHANCEMENT
dbroweval can now write different format output than the input,
using the "-m" option.
KNOWN BUG
dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
table refcount" and "Scalars leaked" when run with an external
program as a reducer.
dbmultistats emits the warning "Attempt to free unreferenced
scalar" when run with quartiles.
In each case the output is correct. I believe these can be
ignored.
CHANGE
dbmapreduce no longer logs a line for each reducer that is invoked.
[1m2.5, 24-Jun-08[0m
Another alpha release, fixing more minor bugs in "dbmapreduce" and
lossage in "Fsdb::IO".
ENHANCEMENT
dbmapreduce can now tolerate non-map-aware reducers that pass back
the key column in put. It also passes the current key as the last
argument to external reducers.
BUG FIX
Fsdb::IO::Reader, correctly handle "-header" option again. (Broken
since fsdb-2.3.)
[1m2.6, 11-Jul-08[0m
Another alpha release, needed to fix DaGronk. One new port, small bug
fixes, and important fix to dbmapreduce.
ENHANCEMENT
shifting more old programs to Perl modules. New in 2.2:
dbcolpercentile.
INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
use "--rank" to require ranking instead of "-r". Also, "--ascending"
and "--descending" can now be specified separately, both for
"--percentile" and "--rank".
BUG FIX
Sigh, the sense of the --warnings option in dbrow was inverted. No
longer.
BUG FIX
I found and fixed the string leaks (errors like "Unbalanced string
table refcount" and "Scalars leaked") in dbmapreduce and
dbmultistats. (All "IO::Handle"s in threads must be manually
destroyed.)
BUG FIX
The "-C" option to specify the column separator in dbcolsplittorows
now works again (broken since it was ported).
2.7, 30-Jul-08 beta
The beta release of fsdb-2.x. Finally, all programs are ported. As
statistics, the number of lines of non-library code doubled from 7.5k
to 15.5k. The libraries are much more complete, going from 866 to 5164
lines. The overall number of programs is about the same, although 19
were dropped and 11 were added. The number of test cases has grown
from 116 to 175. All programs are now in perl-5, no more shell scripts
or perl-4. All programs now have manual pages.
Although this is a major step forward, I still expect to rename "jdb"
to "fsdb".
ENHANCEMENT
shifting more old programs to Perl modules. New in 2.7:
dbcolscorellate. dbcolsregression. cgi_to_db. dbfilevalidate.
db_to_csv. csv_to_db, db_to_html_table, kitrace_to_db,
tcpdump_to_db, tabdelim_to_db, ns_to_db.
INCOMPATIBLE CHANGE
The following programs have been dropped from fsdb-2.x: db2dcliff,
dbcolmultiscale, crl_to_db. ipchain_logs_to_db. They may come
back, but seemed overly specialized. The following program
dbrowsplituniq was dropped because it is superseded by dbmapreduce.
dmalloc_to_db was dropped pending a test cases and examples.
ENHANCEMENT
dbfilevalidate now has a "-c" option to correct errors.
NEW html_table_to_db provides the inverse of db_to_html_table.
[1m2.8, 5-Aug-08[0m
Change header format, preserving forwards compatibility.
BUG FIX
Complete editing pass over the manual, making sure it aligns with
fsdb-2.x.
SEMI-COMPATIBLE CHANGE
The header of fsdb files has changed, it is now #fsdb, not #h (or
#L) and parsing of -F and -R are also different. See dbfilealter
for the new specification. The v1 file format will be read,
compatibly, but not written.
BUG FIX
dbmapreduce now tolerates comments that precede the first key,
instead of failing with an error message.
[1m2.9, 6-Aug-08[0m
Still in beta; just a quick bug-fix for dbmapreduce.
ENHANCEMENT
dbmapreduce now generates plausible output when given no rows of
input.
[1m2.10, 23-Sep-08[0m
Still in beta, but picking up some bug fixes.
ENHANCEMENT
dbmapreduce now generates plausible output when given no rows of
input.
ENHANCEMENT
dbroweval the warnings option was backwards; now corrected. As a
result, warnings in user code now default off (like in fsdb-1.x).
BUG FIX
dbcolpercentile now defaults to assuming the target column is
numeric. The new option "-N" allows selection of a non-numeric
target.
BUG FIX
dbcolscorrelate now includes "--sample" and "--nosample" options to
compute the sample or full population correlation coefficients.
Thanks to Xue Cai for finding this bug.
[1m2.11, 14-Oct-08[0m
Still in beta, but picking up some bug fixes.
ENHANCEMENT
html_table_to_db is now more aggressive about filling in empty
cells with the official empty value, rather than leaving them blank
or as whitespace.
ENHANCEMENT
dbpipeline now catches failures during pipeline element setup and
exits reasonably gracefully.
BUG FIX
dbsubprocess now reaps child processes, thus avoiding running out
of processes when used a lot.
[1m2.12, 16-Oct-08[0m
Finally, a full (non-beta) 2.x release!
INCOMPATIBLE CHANGE
Jdb has been renamed Fsdb, the flatfile-streaming database. This
change affects all internal Perl APIs, but no shell command-level
APIs. While Jdb served well for more than ten years, it is easily
confused with the Java debugger (even though Jdb was there first!).
It also is too generic to work well in web search engines.
Finally, Jdb stands for ``John's database'', and we're a bit beyond
that. (However, some call me the ``file-system guy'', so one could
argue it retains that meeting.)
If you just used the shell commands, this change should not affect
you. If you used the Perl-level libraries directly in your code,
you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
The jdb-announce list not yet been renamed, but it will be shortly.
With this release I've accomplished everything I wanted to in
fsdb-2.x. I therefore expect to return to boring, bugfix releases.
[1m2.13, 30-Oct-08[0m
BUG FIX
dbrowaccumulate now treats non-numeric data as zero by default.
BUG FIX
Fixed a perl-5.10ism in dbmapreduce that breaks that program under
5.8. Thanks to Martin Lukac for reporting the bug.
[1m2.14, 26-Nov-08[0m
BUG FIX
Improved documentation for dbmapreduce's "-f" option.
ENHANCEMENT
dbcolmovingstats how computes a moving standard deviation in
addition to a moving mean.
[1m2.15, 13-Apr-09[0m
BUG FIX
Fix a [4mmake[24m [4minstall[24m bug reported by Shalindra Fernando.
[1m2.16, 14-Apr-09[0m
BUG FIX
Another minor release bug: on some systems [4mprogramize_module[24m looses
executable permissions. Again reported by Shalindra Fernando.
[1m2.17, 25-Jun-09[0m
TYPO FIXES
Typo in the [4mdbroweval[24m manual fixed.
IMPROVEMENT
There is no longer a comment line to label columns in [4mdbcolneaten[24m,
instead the header line is tweaked to line up. This change
restores the Jdb-1.x behavior, and means that repeated runs of
dbcolneaten no longer add comment lines each time.
BUG FIX
It turns out [4mdbcolneaten[24m was not correctly handling trailing
spaces when given the "-E" option to suppress them. This
regression is now fixed.
EXTENSION
[1mdbroweval[22m(1) can now handle direct references to the last row via
[4m$lfref[24m, a dubious but now documented feature.
BUG FIXES
Separators set with "-C" in [4mdbcolmerge[24m and [4mdbcolsplittocols[24m were
not properly setting the heading, and null fields were not
recognized. The first bug was reported by Martin Lukac.
[1m2.18, 1-Jul-09 A minor release[0m
IMPROVEMENT
Documentation for [4mFsdb::IO::Reader[24m has been improved.
IMPROVEMENT
The package should now be PGP-signed.
[1m2.19, 10-Jul-09[0m
BUG FIX
Internal improvements to debugging output and robustness of
[4mdbmapreduce[24m and [4mdbpipeline[24m. [4mTEST/dbpipeline_first_fails.cmd[24m re-
enabled.
[1m2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against[0m
[1mFedora 12.)[0m
BUG FIX
Loging for [4mdbmapreduce[24m with code refs is now stable (it no longer
includes a hex pointer to the code reference).
BUG FIX
Better handling of mixed blank lines in [4mFsdb::IO::Reader[24m (see test
case [4mdbcolize_blank_lines.cmd[24m).
BUG FIX
[4mhtml_table_to_db[24m now handles multi-line input better, and handles
tables with COLSPAN.
BUG FIX
[4mdbpipeline[24m now cleans up threads in an "eval" to prevent "cannot
detach a joined thread" errors that popped up in perl-5.10.
Hopefully this prevents a race condition that causes the test
suites to hang about 20% of the time (in [4mdbpipeline_first_fails[24m).
IMPROVEMENT
[4mdbmapreduce[24m now detects and correctly fails when the input and
reducer have incompatible field separators.
IMPROVEMENT
[4mdbcolstats[24m, [4mdbcolhisto[24m, [4mdbcolscorrelate[24m, [4mdbcolsregression[24m, and
[4mdbrowcount[24m now all take an "-F" option to let one specify the
output field separator (so they work better with [4mdbmapreduce[24m).
BUG FIX
An omitted "-k" from the manual page of [4mdbmultistats[24m is now there.
Bug reported by Unkyu Park.
[1m2.21, 17-Apr-10 bug fix release[0m
BUG FIX
[4mFsdb::IO::Writer[24m now no longer fails with -outputheader => never
(an obscure bug).
IMPROVEMENT
[4mFsdb[24m (in the warnings section) and [4mdbcolstats[24m now more carefully
document how they handle (and do not handle) numerical precision
problems, and other general limits. Thanks to Yuri Pradkin for
prompting this documentation.
IMPROVEMENT
"Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
IMPROVEMENT
Documention for multiple styles of input approaches (including
performance description) added to Fsdb::IO.
[1m2.22, 2010-10-31 One new tool [4mdbcolcopylast[24m and several bug fixes for Perl[0m
[1m5.10.[0m
BUG FIX
[4mdbmerge[24m now correctly handles n-way merges. Bug reported by Yuri
Pradkin.
INCOMPARABLE CHANGE
[4mdbcolneaten[24m now defaults to [4mnot[24m padding the last column.
ADDITION
[4mdbrowenumerate[24m now takes [1m-N NewColumn [22mto give the new column a name
other than "count". Feature requested by Mike Rouch in January
2005.
ADDITION
New program [4mdbcolcopylast[24m copies the last value of a column into a
new column copylast_column of the next row. New program requested
by Fabio Silva; useful for converting dbmultistats output into
dbrvstatdiff input.
BUG FIX
Several tools (particularly [4mdbmapreduce[24m and [4mdbmultistats[24m) would
report errors like "Unbalanced string table refcount: (1) for
"STDOUT" during global destruction" on exit, at least on certain
versions of Perl (for me on 5.10.1), but similar errors have been
off-and-on for several Perl releases. Although I think my code
looked OK, I worked around this problem with a different way of
handling standard IO redirection.
[1m2.23, 2011-03-10 Several small portability bugfixes; improved [4mdbcolstats[0m
[1mfor large datasets[0m
IMPROVEMENT
Documentation to [4mdbrvstatdiff[24m was changed to use "sd" to refer to
standard deviation, not "ss" (which might be confused with sum-of-
squares).
BUG FIX
This documentation about [4mdbmultistats[24m was missing the [4m-k[24m option in
some cases.
BUG FIX
[4mdbmapreduce[24m was failing on MacOS-10.6.3 for some tests with the
error
dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
The problem seemed to be only in the error, not in operation. On
MacOS, the error is now suppressed. Thanks to Alefiya Hussain for
providing access to a Mac system that allowed debugging of this
problem.
IMPROVEMENT
The [4mcsv_to_db[24m command requires an external Perl library
([4mText::CSV_XS[24m). On computers that lack this optional library,
previously Fsdb would configure with a warning and then test cases
would fail. Now those test cases are skipped with an additional
warning.
BUG FIX
The test suite now supports alternative valid output, as a hack to
account for last-digit floating point differences. (Not very
satisfying :-(
BUG FIX
[4mdbcolstats[24m output for confidence intervals on very large datasets
has changed. Previously it failed for more than 2^31-1 records,
and handling of T-Distributions with thousands of rows was a bit
dubious. Now datasets with more than 10000 are considered
infinitely large and hopefully correctly handled.
[1m2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with[0m
[1mdifferent field separators[0m
IMPROVEMENT
The [4mdbfilealter[24m command had a "--correct" option to work-around
from incompatible field-separators, but it did nothing. Now it
does the correct but sad, data-loosing thing.
IMPROVEMENT
The [4mdbmultistats[24m command previously failed with an error message
when invoked on input with a non-default field separator. The root
cause was the underlying [4mdbmapreduce[24m that did not handle the case
of reducers that generated output with a different field separator
than the input. We now detect and repair incompatible field
separators. This change corrects a problem originally documented
and detected in Fsdb-2.20. Bug re-reported by Unkyu Park.
[1m2.25, 2011-08-07 Two new tools, [4mxml_to_db[24m and [4mdbfilepivot[24m, and a bugfix for[0m
[1mtwo people.[0m
IMPROVEMENT
[4mkitrace_to_db[24m now supports a [4m--utc[24m option, which also fixes this
test case for users outside of the Pacific time zone. Bug reported
by David Graff, and also by Peter Desnoyers (within a week of each
other :-)
NEW [4mxml_to_db[24m can convert simple, very regular XML files into Fsdb.
NEW [4mdbfilepivot[24m "pivots" a file, converting multiple rows corresponding
to the same entity into a single row with multiple columns.
[1m2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.[0m
BUG FIX
Bugs fixed in [1mFsdb::IO::Reader[22m(3) manual page.
BUG FIX
Fixed problems where dbcolstats was truncating floating point
numbers when sorting. This strange behavior happens as of
perl-5.14.2 and it [4mseems[24m like a Perl bug. I've worked around it
for the test suites, but I'm a bit nervous.
[1m2.27, 2012-11-15 Accumulated bug fixes.[0m
IMPROVEMENT
[4mcsv_to_db[24m now reports errors in CVS input with real diagnostics.
IMPROVEMENT
[4mdbcolmovingstats[24m can now compute median, when given the "-m"
option.
BUG FIX
[4mdbcolmovingstats[24m non-numeric handling (the "-a" option) now works
properly.
DOCUMENTATION
The internal [4mt/test_command.t[24m test framework is now documented.
BUG FIX
[4mdbrowuniq[24m now correctly handles the case where there is no input
(previously it output a blank line, which is a malformed fsdb
file). Thanks to Yuri Pradkin for reporting this bug.
[1m2.28, 2012-11-15 A quick release to fix most rpmlint errors.[0m
BUG FIX
Fixed a number of minor release problems (wrong permissions, old
FSF address, etc.) found by rpmlint.
[1m2.29, 2012-11-20 a quick release for CPAN testing[0m
IMPROVEMENT
Tweaked the RPM spec.
IMPROVEMENT
Modified [4mMakefile.PL[24m to fail gracefully on Perl installations that
lack threads. (Without this fix, I get massive failures in the
non-ithreads test system.)
[1m2.30, 2012-11-25 improvements to perl portability[0m
BUG FIX
Removed unicode character in documention of [4mdbcolscorrelated[24m so pod
tests will pass. (Sigh, that should work :-( )
BUG FIX
Fixed test suite failures on 5 tests ([4mdbcolcreate_double_creation[0m
was the first) due to Carp's addition of a period. This problem
was breaking Fsdb on perl-5.17. Thanks to Michael McQuaid for
helping diagnose this problem.
IMPROVEMENT
The test suite now prints out the names of tests it tries.
[1m2.31, 2012-11-28 A release with actual improvements to dbfilepivot and[0m
[1mdbrowuniq.[0m
BUG FIX
Documentation fixes: typos in dbcolscorrelated, bugs in
dbfilepivot, clarification for comment handling in
Fsdb::IO::Reader.
IMPROVEMENT
Previously dbfilepivot assumed the input was grouped by keys and
didn't very that pre-condition. Now there is no pre-condition (it
will sort the input by default), and it checks if the invariant is
violated.
BUG FIX
Previously dbfilepivot failed if the input had comments (oops :-);
no longer.
IMPROVEMENT
Now dbrowuniq has the "-L" option to preserve the last unique row
(instead of the first), a common idiom.
[1m2.32, 2012-12-21 Test suites should now be more numerically robust.[0m
NEW New dbfilediff does fsdb-aware file differencing. It does not do
smart intuition of add/removes like Unix [1mdiff[22m(1), but it does know
about columns, and with "-E", it does numeric-aware differences.
IMPROVEMENT
Test suites that are numeric now use dbfilediff to do numeric-aware
comparisons, so the test suite should now be robust to slightly
different computers and operating systems and compilers than
[4mexactly[24m what I use.
[1m2.33, 2012-12-23 Minor fixes to some test cases.[0m
IMPROVEMENT
dbfilediff and dbrowuniq now supports the "-N" option to give the
new column a different name. (And a test cases where this
duplication mattered have been fixed.)
IMPROVEMENT
dbrvstatdiff now show the t-test breakpoint with a reasonable
number of floating point digits.
BUG FIX
Fixed a numerical stability problem in the [4mdbroweval_last[24m test
case.
[1mWHAT'S NEW[0m
[1m2.34, 2013-02-10 Parallelism in dbmerge.[0m
IMPROVEMENT
Documention for dbjoin now includes resource requirements.
IMPROVEMENT
Default memory usage for dbsort is now about 256MB. (The world
keeps moving forward.)
IMPROVEMENT
dbmerge now does merging in parallel. As a side-effect, dbsort
should be faster when input overflows memory. The level of
parallelism can be limited with the "--parallelism" option. (There
is more work to do here, but we're off to a start.)
[1m2.35, 2013-02-23 Improvements to dbmerge parallelism[0m
BUG FIX
Fsdb temporary files are now created more securely (with
File::Temp).
IMPROVEMENT
Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
dbjoin) now report an error if no fields on which to join or merge
are given.
IMPROVEMENT
Parallelism in dbmerge is should now be more consistent, with less
starting and stopping.
IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
filenames on standard input, rather than the command line. This feature
paves the way for faster dbsort for large inputs (by pipelining sorting
and merging), expected in the next release.
[1m2.36, 2013-02-25 dbsort pipelines with dbmerge[0m
IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
allowing earlier processing.
BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
thereby requiring extra disk space.
[1m2.37, 2013-02-26 quick bugfix to support parallel sort and merge from[0m
[1mrecent releases[0m
BUG FIX Since 2.35, dbmerge delayed removal of input files given by
"--xargs". This problem is now fixed.
[1m2.38, 2013-04-29 minor bug fixes[0m
CLARIFICATION
Configure now rejects Windows since tests seem to hang on some
versions of Windows. (I would love help from a Windows developer
to get this problem fixed, but I cannot do it.) See
[4mhttps://rt.cpan.org/Ticket/Display.html?id=84201[24m.
IMPROVEMENT
All programs that use temporary files (dbcolpercentile,
dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
option and set the temporary directory consistently.
In addition, error messages are better when the temporary directory
has problems. Problem reported by Liang Zhu.
BUG FIX
dbmapreduce was failing with external, map-reduce aware reducers
(when invoked with -M and an external program). (Sigh, did this
case ever work?) This case should now work. Thanks to Yuri
Pradkin for reporting this bug (in 2011).
BUG FIX
Fixed perl-5.10 problem with dbmerge. Thanks to Yuri Pradkin for
reporting this bug (in 2013).
[1m2.39, date 2013-05-31 quick release for the dbrowuniq extension[0m
BUG FIX
Actually in 2.38, the Fedora [4m.spec[24m got cleaner dependencies.
Suggestion from Christopher Meng via
.
ENHANCEMENT
Fsdb files are now explicitly set into UTF-8 encoding, unless one
specifies "-encoding" to "Fsdb::IO".
ENHANCEMENT
dbrowuniq now supports "-I" for incremental counting.
[1m2.40, 2013-07-13 small bug fixes[0m
BUG FIX
dbsort now has more respect for a user-given temporary directory;
it no longer is ignored for merging.
IMPROVEMENT
dbrowuniq now has options to output the first, last, and both first
and last rows of a run ("-F", "-L", and "-B").
BUG FIX
dbrowuniq now correctly handles "-N". Sigh, it didn't work before.
[1m2.41, 2013-07-29 small bug and packaging fixes[0m
ENHANCEMENT
Documentation to dbrvstatdiff improved (inspired by questions from
Qian Kun).
BUG FIX
dbrowuniq no longer duplicates singleton unique lines when
outputting both (with "-B").
BUG FIX
Add missing "XML::Simple" dependency to [4mMakefile.PL[24m.
ENHANCEMENT
Tests now show the diff of the failing output if run with "make
test TEST_VERBOSE=1".
ENHANCEMENT
dbroweval now includes documentation for how to output extra rows.
Suggestion from Yuri Pradkin.
BUG FIX
Several improvements to the Fedora package from Michael Schwendt
via , and from
the harsh master that is [4mrpmlint[24m. (I am stymied at teaching it
that "outliers" is spelled correctly. Maybe I should send it
Schneier's book. And an unresolvable invalid-spec-name lurks in
the SRPM.)
[1m2.42, 2013-07-31 A bug fix and packaging release.[0m
ENHANCEMENT
Documentation to dbjoin improved to better memory usage. (Based on
problem report by Lin Quan.)
BUG FIX
The [4m.spec[24m is now [4mperl-Fsdb.spec[24m to satisfy [4mrpmlint[24m. Thanks to
Christopher Meng for a specific bug report.
BUG FIX
Test [4mdbroweval_last.cmd[24m no longer has a column that caused failures
because of numerical instability.
BUG FIX
Some tests now better handle bugs in old versions of perl (5.10,
5.12). Thanks to Calvin Ardi for help debugging this on a Mac with
perl-5.12, but the fix should affect other platforms.
[1m2.43, 2013-08-27 Adds in-file compression.[0m
BUG FIX
Changed the sort on [4mTEST/dbsort_merge.cmd[24m to strings (from
numerics) so we're less susceptible to false test-failures due to
floating point IO differences.
EXPERIMENTAL ENHANCEMENT
Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
tree of processes at the end of large merge tasks to get maximally
parallelism. Currently this feature is off by default because it
can hang for some inputs. Enable this experimental feature with
"--endgame".
ENHANCEMENT
"Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
by dbmerge).
BUG FIX
Handling of NamedTmpfiles now supports concurrency. This fix will
hopefully fix occasional "Use of uninitialized value $_ in string
ne at ...NamedTmpfile.pm line 93." errors.
BUG FIX
Fsdb now requires perl 5.10. This is a bug fix because some test
cases used to require it, but this fact was not properly
documented. (Back-porting to 5.008 would require removing all "//"
operators.)
ENHANCEMENT
Fsdb now handles automatic compression of file contents. Enable
compression with "dbfilealter -Z xz" (or "gz" or "bz2"). All
programs should operate on compressed files and leave the output
with the same level of compression. "xz" is recommended as fastest
and most efficient. "gz" is produces unrepeatable output (and so
has no output test), it seems to insist on adding a timestamp.
[1m2.44, 2013-10-02 A major change--all threads are gone.[0m
ENHANCEMENT
Fsdb is now thread free and only uses processes for parallelism.
This change is a big change--the entire motivation for Fsdb-2 was
to exploit parallelism via threading. Parallelism--good, but perl
threading--bad for performance. Horribly bad for performance.
About 20x worse than pipes on my box. (See perl bug #119445 for
the discussion.)
NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
forking, with some nice support for callbacks in the parent upon
child termination.
ENHANCEMENT
Details about removing threads: "dbpipeline" is thread free, and
new tests to verify each of its parts. The easy cases are
"dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
"dbcolstatscores", each of which use it in simple ways
(2013-09-09). "dbmerge" is now thread free (2013-09-13), but was a
significant rewrite, which brought "dbsort" along. "dbmapreduce"
is partly thread free (2013-09-21), again as a rewrite, and it
brings "dbmultistats" along. Full "dbmapreduce" support took much
longer (2013-10-02).
BUG FIX
When running with user-only output ("-n"), dbroweval now resets the
output vector $ofref after it has been output.
NEW dbcolcreate will create all columns at the head of each row with
the "--first" option.
NEW dbfilecat will concatenate two files, verifying that they have the
same schema.
ENHANCEMENT
dbmapreduce now passes comments through, rather than eating them as
before.
Also, dbmapreduce now supports a "--" option to prevent
misinterpreting sub-program parameters as for dbmapreduce.
INCOMPATIBLE CHANGE
dbmapreduce no longer figures out if it needs to add the key to the
output. For multi-key-aware reducers, it never does (and cannot).
For non-multi-key-aware reducers, it defaults to add the key and
will now fail if the reducer adds the key (with error "dbcolcreate:
attempt to create pre-existing column..."). In such cases, one
must disable adding the key with the new option "--no-prepend-key".
INCOMPATIBLE CHANGE
dbmapreduce no longer copies the input field separator by default.
For multi-key-aware reducers, it never does (and cannot). For non-
multi-key-aware reducers, it defaults to [4mnot[24m copying the field
separator, but it will copy it (the old default) with the
"--copy-fs" option
[1m2.45, 2013-10-07 cleanup from de-thread-ification[0m
BUG FIX
Corrected a fast busy-wait in dbmerge.
ENHANCEMENT
Endgame mode enabled in dbmerge; it (and also large cases of
dbsort) should now exploit greater parallelism.
BUG FIX
Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
[1m2.46, 2013-10-08 continuing cleanup of our no-threads version[0m
BUG FIX
Fixed some packaging details. (Really, threads are no longer
required, missing tests in the MANIFEST.)
IMPROVEMENT
dbsort now better communicates with the merge process to avoid
bursty parallelism.
Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
IO.
[1m2.47, 2013-10-12 test suite cleanup for non-threaded perls[0m
BUG FIX
Removed some stray "use threads" in some test cases. We didn't
need them, and these were breaking non-threaded perls.
BUG FIX
Better handling of Fred cleanup; should fix intermittent
dbmapreduce failures on BSD.
ENHANCEMENT
Improved test framework to show output when tests fail. (This
time, for real.)
[1m2.48, 2014-01-03 small bugfixes and improved release engineering[0m
ENHANCEMENT
Test suites now skip tests for libraries that are missing. (Patch
for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
ENHANCEMENT
Removed references to Jdb in the package specification. Since the
name was changed in 2008, there's no longer a huge need for
backwards compatibility. (Suggestion form Petr abata.)
ENHANCEMENT
Test suites now invoke the perl using the path from
$Config{perlpath}. Hopefully this helps testing in environments
where there are multiple installed perls and the default perl is
not the same as the perl-under-test (as happens in
cpantesters.org).
BUG FIX
Added specific encoding to this manpage to account for Unicode.
Required to build correctly against perl-5.18.
[1m2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor[0m
[1mpackaging fixes)[0m
BUG FIX
Restored a line in the [4m.spec[24m to chmod g-s.
BUG FIX
Unicode decoding is now handled correctly for programs that read
from standard input. (Also: New test scripts cover unicode input
and output.)
BUG FIX
Fix to Fsdb documentation encoding line. Addresses test failure in
perl-5.16 and earlier. (Who knew "encoding" had to be followed by
a blank line.)
[1mWHAT'S NEW[0m
[1m2.50, 2014-05-27 a quick release for spec tweaks[0m
ENHANCEMENT
In dbroweval, the "-N" (no output, even comments) option now
implies "-n", and it now suppresses the header and trailer.
BUG FIX
A few more tweaks to the [4mperl-Fsdb.spec[24m from Petr abata.
BUG FIX
Fixed 3 uses of "use v5.10" in test suites that were causing test
failures (due to warnings, not real failures) on some platforms.
[1m2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,[0m
[1mdbmapreduce, and new sqlselect_to_db[0m
ENHANCEMENT
dbcolcreate now has a "--no-recreate-fatal" that causes it to
ignore creation of existing columns (instead of failing).
ENHANCEMENT
dbmapreduce once again is robust to reducers that output the key;
"--no-prepend-key" is no longer mandatory.
ENHANCEMENT
dbcolsplittorows can now enumerate the output rows with "-E".
BUG FIX
dbcolmovingstats is more mathematically robust. Previously for
some inputs and some platforms, floating point rounding could
sometimes cause squareroots of negative numbers.
NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
comment into fsdb format.
INCOMPATIBLE CHANGE
dbfilediff now outputs the [4msecond[24m row when doing sloppy numeric
comparisons, to better support test suites.
[1m2.52, 2014-11-03 Fixing the test suite for line number changes.[0m
ENHANCEMENT
Test suites changes to be robust to exact line numbers of failures,
since different Perl releases fail on different lines.
[1m2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce[0m
ENHANCEMENT
The dbfilediff how supports a "--quiet" option.
ENHANCEMENT
Better documention of dbpipeline_filter.
BUGFIX
Added groff-base and perl-podlators to the Fedora package spec.
Fixes . (Also
in package 2.52-2.)
BUGFIX
An important stability improvement to dbmapreduce. It, plus
dbmultistats, and dbcolstats now support controlled parallelism
with the "--pararallelism=N" option. They default to run with the
number of available CPUs. dbmapreduce also moderates its level of
parallelism. Previously it would create reducers as needed,
causing CPU thrashing if reducers ran much slower than data
production.
BUGFIX
The combination of dbmapreduce with dbrowenumerate now works as it
should. (The obscure bug was an interaction with dbcolcreate with
non-multi-key reducers that output their own key. dbmapreduce has
too many useful corner cases.)
[1m2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-[0m
[1mplatform[0m
BUGFIX
Sigh, the test suite now has a test suite. Because, yes, I broke
it, causing many incorrect failures at cpantesters. Now fixed.
[1m2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more[0m
[1mrobust to different numeric precision[0m
ENHANCEMENT
dbfilediff now can be extra quiet, as I continue to try to track
down a numeric difference on FreeBSD AMD boxes.
ENHANCEMENT
dbcolmovingstats gave different test output (just reflecting
rounding error) when stddev approaches zero. We now detect hand
handle this case. See
and thanks
to H. Merijn Brand for the bug report.
BUG FIX
Many, many spelling bugs found by H. Merijn Brand; thanks for the
bug report.
INCOMPATBLE CHANGE
A number of programs had misspelled "separator" in
"--fieldseparator" and "--columnseparator" options as "seperator".
These are now correctly spelled.
[1m2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign[0m
BUG FIX
Internal argument parsing uses Getopt::Long, but mixed pass-through
and <>. Bug reported by Petr Pisar at
.a
BUG FIX
Added missing BuildRequires for "XML::Simple".
[1m2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.[0m
BUG FIX
dbfilecat now honors "--remove-inputs" (previously it didn't).
This omission meant that dbmapreduce (and dbmultistats) would
accumulate files in [4m/tmp[24m when running. Bad news for inputs with 4M
keys.
ENHANCMENT
dbmultistats should be faster with lots of small keys. dbcolstats
now supports "-k" to get some of the functionality of dbmultistats
(if data is pre-sorted and median/quartiles are not required).
dbfilecat now honors "--remove-inputs" (previously it didn't).
This omission meant that dbmapreduce (and dbmultistats) would
accumulate files in [4m/tmp[24m when running. Bad news for inputs with 4M
keys.
[1m2.58, 2015-04-30 Bugfix in dbmerge[0m
BUG FIX
Fixed a case where dbmerge suffered mojobake in endgame mode. This
bug surfaced when dbsort was applied to large files (big enough to
require merging) with unicode in them; the symptom was soemthing
like:
Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
420, line 111.
[1m2.59, 2016-09-01 Collect a few small bug fixes and documentation[0m
[1mimprovements.[0m
BUG FIX
More IO is explicitly marked UTF-8 to avoid Perl's tendency to
mojibake on otherwise valid unicode input. This change helps
html_table_to_db.
ENHANCEMENT
dbcolscorrelate now crossreferences dbcolsregression.
ENHANCEMENT
Documentation for dbrowdiff now clarifies that the default is
baseline mode.
BUG FIX
dbjoin now propagates "-T" into the sorting process (if it is
required). Thanks to Lan Wei for reporting this bug.
[1m2.60, 2016-09-04 Adds support for hash joins.[0m
ENHANCEMENT
dbjoin now supports hash joins with "-t lefthash" and "-t
righthash". Hash joins cache a table in memory, but do not require
that the other table be sorted. They are ideal when joining a
large table against a small one.
[1m2.61, 2016-09-05 Support left and right outer joins.[0m
ENHANCEMENT
dbjoin now handles left and right outer joins with "-t left" and
"-t right".
ENHANCEMENT
dbjoin hash joins are now selected with "-m lefthash" and "-m
righthash" (not the shortlived "-t righthash" option).
(Technically this change is incompatible with Fsdd-2.60, but no one
but me ever used that version.)
[1m2.62, 2016-11-29 A new yaml_to_db and other minor improvements.[0m
ENHANCEMENT
Documentation for xml_to_db now includes sample output.
NEW yaml_to_db converts a specific form of YAML to fsdb.
BUG FIX
The test suite now uses "diff -c -b" rather than "diff -cb" to make
OpenBSD-5.9 happier, I hope.
ENHANCEMENT
Comments that log operations at the end of each file now do simple
quoting of spaces. (It is not guaranteed to be fully shell-
compliant.)
ENHANCEMENT
There is a new standard option, "--header", allowing one to specify
an Fsdb header for inputs that lack it. Currently it is supported
by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
dbpipeline.
ENHANCEMENT
dbfilepivot now allows the [1m--possible-pivots [22moption, and if it is
provided processes the data in one pass.
ENHANCEMENT
dbroweval logs are now quoted.
[1m2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add[0m
[1mmore --header options.[0m
ENHANCEMENT
The option [1m-j [22mis now a synonym for [1m--parallelism[22m. (And several
documention bugs about this option are fixed.)
ENHANCEMENT
Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
dbroweval.
BUG FIX
Version 2.62 was supposed to have this improvement, but did not
(and now does): dbfilepivot now allows the [1m--possible-pivots[0m
option, and if it is provided processes the data in one pass.
BUG FIX
Version 2.62 was supposed to have this improvement, but did not
(and now does): dbroweval logs are now quoted.
[1m2.64, 2017-11-20 several small bugfixes and enhancements[0m
BUG FIX
In dbroweval, the "next row" option previously did not correctly
set up "_last_fieldname". It now does.
ENHANCEMENT
The csv_to_db converter now has an optional "-F x" option to set
the field separator.
ENHANCEMENT
Finally dbcolsplittocols has a "--header" option, and a new "-N"
option to give the list of resulting output columns.
INCOMPATIBLE CHANGE
Now dbcolstats and dbmultistats produce no output (but a schema)
when given no input but a schema. Previously they gave a null row
of output. The "--output-on-no-input" and
"--no-output-on-no-input" options can control this behavior.
[1m2.65, 2018-02-16 Minor release, bug fix and -F option.[0m
ENHANCEMENT
dbmultistats and dbmapreduce now both take a "-F x" option to set
the field separator.
BUG FIX
Fixed missing "use Carp" in dbcolstats. Also went back and cleaned
up all uses of croak(). Thanks to Zefram for the bug report.
[1m2.66, 2018-12-20 Critical bug fix in dbjoin.[0m
BUG FIX
Removed old tests from MANIFEST. (Thanks to Hang Guo for reporting
this bug.)
IMPROVEMENT
Errors for non-existing input files now include the bad filename
(before: "cannot setup filehandle", now: "cannot open input: cannot
open TEST/bad_filename").
BUG FIX
Hash joins with three identical rows were failing with the
assertion failure "internal error: confused about overflow" due to
a now-fixed bug.
[1m2.67, 2019-07-10 add support for reading and writing hdfs[0m
IMPROVEMENT
dbformmail now has an "mh" mechanism that writes messages to
individual files (an mh-style mailbox).
BUG FIX
dbrow failed to include the Carp library, leading to fails on
croak.
BUG FIX
Fixed dbjoin error message for an unsorted right stream was
incorrect (it said left).
IMPROVEMENT
All Fsdb programs can now read from and write to HDFS, when files
that start with "hdfs:" are given to -i and -o options.
[1m2.68, 2019-09-19 All programs now support automatic decompression based on[0m
[1mfile extension.[0m
IMPROVEMENT
The omitted-possible-error test case for dbfilepivot now has an
altnerative output that I saw on some BSD-running systems (thanks
to CPAN).
IMPROVEMENT
dbmerge and dbmerge2 now support "--header". dbmerge2 now gives
better error messages when presented the wrong number of inputs.
BUG FIX
dbsort now works with "--header" even when the file is big (due to
fixes to dbmerge).
IMPROVEMENT
cvs_to_db now processes data with the "binary" option, allowing it
to handle newlines embedded in quoted fields.
IMPROVEMENT
All programs now will transparently decompress input files, if they
are listed as a filename as an input argument that extends with a
standard extension (.gz, .bz2, and .xz).
[1m2.69, 2019-11-22 a small bugfix in dbcolstats[0m
BUG FIX
Filled in the the test case for autodecompress, which was missing
for the 2.68 release.
ENHANCEMENT
The groff program is required for build, and the "Makefile.PL"
fails if groff is missing at build time. Thanks to Chris Williams
for suggesting this check, and the CPAN auto-building system for
trying many platforms.
BUG FIX
The dbcolstats program had numerical instability that sometimes
results in failing with a square-root of a negative number when
many values varied right at the edge of floating-point precision.
We now detect and report that case as 0 stddev. Thanks to Hang Guo
for providing a test case.
[1m2.70, 2020-11-12 Some small quality-of-life enhancements and corner-case[0m
[1mbugfixes.[0m
ENHANCEMENT
dbcol can now take an option "-a" to include all columns, allowing
reordering of certain columns while passing the rest through.
ENHANCEMENT
dbrowuniq and dbmerge now buffer comments in a way that the last
row of data output is no longer in the last block of comments.
(The data is identical, but for humans looking at output, this
change makes it less likely to lose the last row.)
BUG FIX
dbmultistats and dbpipeline documentation now indicates that they
support "--header" (something they did since version 2.62 in
2016-11-29, but now documented.
ENHANCEMENT
dbcolcreate now supports "--header".
BUG FIX
Fixed several spelling errors in deprecated programs and removed
information about the no-longer existing FreeBSD and MacOS ports.
Thanks to Calvin Ardi for the patch.
BUG FIX
dbmerge now handles --xargs when only one file is provided (and
passes the file through unchanged). It also throws a clean error
with --xargs if zero files are provided. (To support dbmerge,
dbcol now has an internal "--saveoutput" option.) Thanks to Yuri
Pradkin for reporting the unhandled corner-case.
[1m2.71, 2020-11-16 Fix a race condition breaking test suites.[0m
BUG FIX
Suppress a race condition in dbcolmerge was sometimes throwing the
error "Fsdb::Support::Freds: ending, but running process:
dbmerge:xargs" in the dbmerge_0_xargs test case, on exit.
[1m2.72, 2020-12-01 A small bug and a packaging improvement.[0m
BUG FIX
dbcolhisto now handles the degenerate case where everything has the
same value (previously it would throw "illegal division by zero").
ENHANCEMENT
The spec for Fedora now includes "make" as BuildRequires, something
required for Fedora 34.
[1m2.73, 2021-05-18 Updates dbcolpercentile with "--weighted", and with more[0m
[1mipv6.[0m
ENHANCEMENT
dbcolpercentile now has a "--weighted" option.
ENHANCEMENT
The new Fsdb::Support::IPv6 package includes ipv6_normalize,
ipv6_zeroize to rewrite ipv6 print addresses in IPv6 normal form,
with a 0 in each 4-nybble field.
[1m2.74, 2021-06-23 More ipv6.[0m
ENHANCEMENT
Fsdb::Support::IPv6 package includes ipv6_fullhex to rewrite ipv6
print addresses as full, 128-bit hex values.
[1m2.75, 2022-04-02 New type specifications in the schema to better support[0m
[1mtype conversions in python.[0m
ENHANCEMENT
Add optional type specifications to the schema. Types are not used
in Perl, but are relevant in Python and Go Fsdb bindings. Types
use a subset of perl pack specifiers: c, s, l, q are signed 8, 16,
32, and 64-bit integers, f is a float, d is double float, a is
utf-8 string, and > and < can force big or little endianness.
The default type for everything is "a", that is, utf-8 strings.
Thanks to Wes Hardaker for pushing to get this long-desired feature
out the door; his Python bindings need types.
ENHANCEMENT
dbcol, dbcolcreate, dbcolcopylast, and dbcolrename now understand
and propagate schema types. dbsort, dbjoin, dbmerge, dbmerge2 and
dbfilepivot all take a new option "-t" to sort by type-inferred
comparision, if a type is given.
ENHANCEMENT
dbcolstat, dbmultistats, and dbcolmovingstats now include type
information in their output schema. (They assumes input variables
are floats, not integers.)
ENHANCEMENT
Even more IPv6: the functions in Fsdb::Support::IPv6 package now
support strings of hex digits as an alternate encoding for IP
address (and they are already the output of ipv6_fullhex), and
"ip_fullhex_to_normal" converts full hex-encoded IPv4 or IPv6
addresses to their "normal" form (dotted-quad or IPv6 printable
format).
[1m3.0, 2022-04-04 Complete type support and accordingly bump major version.[0m
NEW The major version number is now 3.0 to correspond to the addition
of types (although they were actually added in 2.75). Old fsdb
files are supported (Fsdb-3.0 is backwards compatible with
databases), but older versions will confuse types in new files (new
Fsdb files are not forward compatible with old versions).
ENHANCEMENT
Type specifications in a few more programs: dbcolhisto,
dbcolscorrelate, dbcolsregression, dbcolstatscores,
dbrowaccumulate, dbrowcount, dbrowdiff, dbrvstatdiff.
ENHANCEMENT
dbcolhisto now puts an empty value on any empty rows.
NEW dbcoltype redefines column types, or clears them with the "-v"
option.
[1m3.1, 2022-11-22 A post-3.0 cleanup release with minor fixes.[0m
ENHANCEMENT
Type specifications in a few more programs that I missed:
dbrowuniq, dbcolpercentile.
ENHANCEMENT
Minor documentation improvements.
[1m3.2, 2023-10-11 Add new module dbcolsdecimate[0m
NEW dbcolsdecimate reduces density in timeseries data to make graphs
with overly dense points visually similar but smaller.
ENHANCEMENT
yaml_to_db now flattens one level of arrays into comma-separated
lists.
ENHANCEMENT
Clearer installation instructions.
[1m3.3, 2023-10-13 Quickly making dbcolsdecimate more flexible.[0m
INCOMPATBILE ENHANCEMENT
dbcolsdecimate now takes either relative ([1m-p[22m) or absolute ([1m-P[22m)
precision, and precision now affects only subsequent columns.
Also, if absolute precisions are given for all columns, data is not
buffered.
[1mAUTHOR[0m
John Heidemann, "johnh@isi.edu"
See "Contributors" for the many people who have contributed bug reports
and fixes.
[1mCOPYRIGHT[0m
Fsdb is Copyright (C) 1991-2024 by John Heidemann .
This program is free software; you can redistribute it and/or modify it
under the terms of version 2 of the GNU General Public License as
published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
675 Mass Ave, Cambridge, MA 02139, USA.
A copy of the GNU General Public License can be found in the file
``COPYING''.
[1mCOMMENTS and BUG REPORTS[0m
Any comments about these programs should be sent to John Heidemann
"johnh@isi.edu".
perl v5.38.2 2024-01-06 [4mFsdb[24m(3)