=head1 NAME aie - Automatic Information Extraction =head1 DESCRIPTION Attempts to extract regular information from non-binary files. AIE accepts any non-binary file as input. It tries to find a repeating sequence in the file and then generalizes a regular expression to extract the information that varies within the repeating structure. =head1 SYNOPSIS $ aie "./Downloadable NLG systems - ACL Wiki.html" Extracting major patterns Length: 40136 . ........................................ Extracting most useful terms Chose token: $VAR1 = ' class="'; Selected instance 133 of 185 $VAR1 = [ '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)re(.*) \\<\\/p\\>\\
\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)e\\" (.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)\\>\\<\\/(.*) \\
\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\
\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\
(.*)fo(.*)cl(.*)as(.*)la(.*)as(.*)re(.*)as(.*)re(.*)re(.*) c(.*)re(.*) \\<\\/p\\> \\<(.*)\\>', '(.*) class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\
(.*)as(.*)re(.*) c(.*)re(.*)rela(.*)as(.*)fo(.*)as(.*) c(.*)la(.*)re(.*)re(.*)\\" (.*)la(.*)as(.*)fo(.*)la(.*)re(.*)cl(.*)re(.*)\\=\\"(.*)fo(.*)\\"', '(.*) class\\=\\"(.*)e\\" (.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)\\>\\<\\/(.*) \\
\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"', ' class\\=\\"(.*)ree\\" (.*)re(.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*) \\<\\/p\\>\\
(.*)fo(.*) \\<\\/p\\> \\<(.*)\\>', '(.*) class\\=\\"(.*)e\\" (.*)\\=\\"(.*)\\"\\>(.*)\\<\\/(.*)\\>\\<\\/(.*) \\
\\<(.*)re(.*)\\=\\"(.*)fo(.*)\\"' ]; $VAR1 = ' class="(.*)e" (.*)="(.*)">(.*)(.*)
<';
Extracted 23 records
$VAR1 = [
[
'mw-headlin',
'id',
'ASTROGEN',
'ASTROGEN',
'h2>'
],
[
'mw-headlin',
'id',
'Chimera',
'Chimera',
'h2>'
],
[
'mw-headlin',
'id',
'CRISP',
'CRISP',
'h2>'
],
...
=head1 AUTHOR
Andrew John Dougherty
=head1 LICENSE
GPLv3
=head1 INSTALLATION
Using C