=pod
=head1 NAME
Spreadsheet::Reader::ExcelXML - Read xlsx/xlsm/xml extention Excel files
=begin html
=end html
=encoding UTF-8
=head1 SYNOPSIS
The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder of the package
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::Reader::ExcelXML;
my $parser = Spreadsheet::Reader::ExcelXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );
if ( !defined $workbook ) {
die $parser->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
last;# In order not to read all sheets
}
###########################
# SYNOPSIS Screen Output
# 01: Row, Col = (0, 0)
# 02: Value = Category
# 03: Unformatted = Category
# 04:
# 05: Row, Col = (0, 1)
# 06: Value = Total
# 07: Unformatted = Total
# 08:
# 09: Row, Col = (0, 2)
# 10: Value = Date
# 11: Unformatted = Date
# 12:
# 13: Row, Col = (1, 0)
# 14: Value = Red
# 16: Unformatted = Red
# 17:
# 18: Row, Col = (1, 1)
# 19: Value = 5
# 20: Unformatted = 5
# 21:
# 22: Row, Col = (1, 2)
# 23: Value = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25:
# More intermediate rows ...
# 82:
# 83: Row, Col = (6, 2)
# 84: Value = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################
=head1 DESCRIPTION
This is an Excel spreadsheet reading package that should parse all excel files with the
extentions .xlsx, .xlsm, .xml I (L)> that can be opened in the Excel 2007+
applications. The quick-start example provided in the SYNOPSIS attempts to follow the
example from L (.xls binary file reader) as close as possible.
There are additional methods and other approaches that can be used by this package for
spreadsheet reading but the basic access to data from newer xml based Excel files can be
as simple as above.
This is L able to parse .xlsx files on METACPAN. For
now it does appear to be the only package that will parse .xlsm and Excel 2003 .xml
workbooks.
There is some documentation throughout this package for users who intend to extend the
package but the primary documentation is intended for the person who uses the package as
is. Parsing through an Excel workbook is done with three levels of classes;
=head2 Workbook level (This doc)
=over
=item * General L settings that affect parsing of the file in general
=item * The place to L
=item * Object L to retreive document level metadata and worksheets
=back
=head2 L
=over
=item * Object methods to return specific cell instances/L
=item * Access to some worksheet level format information (more access pending)
=item * The place to L
data output formats targeting specific cell ranges
=back
=head2 L
=over
=item * Access to the cell contents
=item * Access to the cell formats (more access pending)
=back
There are some differences from the L package. For instance
in the L the '$parser' and the '$workbook' are actually the same
class for this package. You could therefore combine both steps by calling ->new with
the 'file' attribute called out. The test for load success would then rely on the
method L. Afterward it is still possible to call ->error
on the instance. Another difference is the data formatter and specifically date
handling. This package leverages L to allows for a
simple pluggable custom output format that is very flexible as well as handling dates
in the Excel file older than 1-January-1900. I leveraged coercions from L to do this but anything that follows that general format will work
here.
The why and nitty gritty of design choices I made are in the L section. Some pitfalls are outlined in the L
section. Read the full documentation for all opportunities!
=head2 Primary Methods
These are the primary ways to use this class. They can be used to open a workbook,
investigate information at the workbook level, and provide ways to access sheets in
the workbook.
All methods are object methods and should be implemented on the object instance.
B
my @worksheet_array = $workbook_instance->worksheets;
=head3 parse( $file_name|$file_handle, $formatter )
=over
B This is a convenience method to match L.
It is one way to set the L attribute [and the L attribute].
B
$file = see the L attribute for valid options (required) (required)
[$formatter] = see the L attribute for valid options (optional)
B an instance of the package (not cloned) when passing with the xlsx file successfully
opened or undef for failure.
=back
=head3 worksheets
=over
B This method will return an array (I) containing a list of references
to all worksheets in the workbook as objects. This is not a reccomended method because it builds all
worksheet instance and returns an array of objects. It is provided for compatibility to
Spreadsheet::ParseExcel. For alternatives see the L method and
the L methods. B
B nothing
B an array ref of L
objects for all worksheets in the workbook.
=back
=head3 worksheet( $name )
=over
B This method will return an object to read values in the identified
worksheet. If no value is passed to $name then the 'next' worksheet in physical order
is returned. I<'next' will NOT wrap> It also only iterates through the 'worksheets'
in the workbook (not the 'chartsheets').
B the $name string representing the name of the worksheet object you
want to open. This name is the word visible on the tab when opening the spreadsheet
in Excel. (not the underlying zip member file name - which can be different. It will
not accept chart tab names.)
B a L object with the
ability to read the worksheet of that name. It returns undef and sets the error attribute
if a 'chartsheet' is requested. Or in 'next' mode it returns undef if past the last sheet.
B using the implied 'next' worksheet;
while( my $worksheet = $workbook->worksheet ){
print "Reading: " . $worksheet->name . "\n";
# get the data needed from this worksheet
}
=back
=head3 file_name
=over
B If you pass a file $location/$name string to the attribute L then before
the package converts it to a file handle it will store the string. You can retreive that string
with this method. This is true if you pass a string to the L method as well.
B nothing
B the $location/$name file string if available.
=back
=head3 file_opened
=over
B This method is the test for success that should be used when opening a workbook
using the -Enew method. This allows for the object to store the error without dying
entirely.
B nothing
B 1 if the workbook file was successfully opened
B
use Spreadsheet::Reader::ExcelXML qw( :just_the_data );
my $workbook = Spreadsheet::Reader::ExcelXML->new( file => 'TestBook.xlsx' );
if ( !$workbook->file_opened ) {
die $workbook->error(), "\n";
}
for my $worksheet ( $workbook->worksheets ) {
print "Reading worksheet named: " . $worksheet->get_name . "\n";
while( 1 ){
my $cell = $worksheet->get_next_value;
print "Cell is: $cell\n";
last if $cell eq 'EOF';
}
}
=back
=head3 get_sheet_names
=over
B This method returns an array ref of all the sheet names (tabs) in the
workbook in order. (It includes chartsheets.)
B nothing
B an array ref of strings
=back
=head3 worksheet_name( $position )
=over
B This returns the name of the worksheet in that $position. (counting from zero)
interspersed chartsheets in the file are not considered to hold a position by this accounting.
B $position (an integer)
B the worksheet name
B To return only worksheet positions 2 through 4 without building them all at once
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
=back
=head3 get_worksheet_names
=over
B This method returns an array ref of all the worksheet names (tabs) in the
workbook in order. (No chartsheets.)
B nothing
B an array ref of strings
B Another way to parse a workbook without building all the sheets at
once is;
for $sheet_name ( @{$workbook->worksheet_names} ){
my $worksheet = $workbook->worksheet( $sheet_name );
# Read the worksheet here
}
=back
=head3 worksheet_count
=over
B This returns the total number of recorded worksheets
B nothing
B $total - a count of all worksheets (only)
=back
=head2 Attributes
Data passed to new when creating an instance. For modification of these attributes
see the listed 'attribute methods'. For general information on attributes see
L. For additional lesser used workbook options see
L. There are several grouped default values
for these attributes documented in the L section.
B
$workbook_instance = Spreadsheet::Reader::ExcelXML->new( %attributes )
I$file_handle, $formatter )> method before the rest of the package
can be used.>
=head3 file
=over
B This attribute holds the file handle for the top level workbook. If a
file name is passed it is coerced into an L handle and stored that way. The
originaly file name can be retrieved with the method L.
B no default
B yes
B any unencrypted xlsx|xlsm|xml file that can be opened in Microsoft Excel 2007+.
B Methods provided to adjust this attribute
=over
B
=over
B change the file value in the attribute (this will reboot the workbook instance)
=back
=back
=back
=head3 error_inst
=over
B This attribute holds an 'error' object instance. It should have several
methods for managing errors. Currently no error codes or error language translation
options are available but this should make implementation of that easier.
B a L instance with the attributes set
as;
( should_warn => 0 )
B See the 'Exported methods' section below for methods required by the workbook.
The error instance must also be able to extract the error string from a passed error
object as well. For now the current implementation will attempt ->as_string first
and then ->message if an object is passed.
B Methods provided to manage this attribute
=over
B
=over
B returns this instance
=back
B
=over
B indicates in the error instance has been set
=back
B
The following methods are exported (delegated) to the workbook level
from the stored instance of this class. Links are provided to the default implemenation;
=over
L
L
L
L
L
L
L
L
=back
=back
=back
=head3 formatter_inst
=over
B This attribute holds a 'formatter' object instance. This instance does all
the heavy lifting to transform raw text into desired output. It does include
a role that interprets the excel L
into a L coercion. The default case is actually built from a number of
different elements using L on the fly so you can
just call out the replacement base class or role rather than fully building
the formatter prior to calling new on the workbook. However the naming of the interface
|http://www.cs.utah.edu/~germain/PPS/Topics/interfaces.html> is locked and should not be
tampered with since it manages the methods to be imported into the workbook;
B An instance built with L from the following
arguments (note the instance itself is not built here)
{
superclasses => ['Spreadsheet::Reader::ExcelXML::FmtDefault'], # base class
add_roles_in_sequence =>[qw(
Spreadsheet::Reader::ExcelXML::ParseExcelFormatStrings # role containing the heavy lifting methods
Spreadsheet::Reader::ExcelXML::FormatInterface # the interface
)],
package => 'FormatInstance', # a formality more than anything
}
B A replacement formatter instance or a set of arguments that will lead to building an acceptable
formatter instance. See the 'Exported methods'section below for all methods required methods for the
workbook. The FormatInterface is required by name so a replacement of that role requires the same name.
B Methods provided to manage this attribute
=over
B
=over
B returns the stored formatter instance
=back
B
=over
B sets the formatter instance
=back
B
Additionally the following methods are exported (delegated) to the workbook level
from the stored instance of this class. Links are provided to the default implemenation;
=over
B name_the_workbook_uses_to_access_the_method => B
get_formatter_region => L
has_target_encoding => L
get_target_encoding => L
set_target_encoding => L
change_output_encoding => L
set_defined_excel_formats => L
get_defined_conversion => L
parse_excel_format_string => L
set_date_behavior => L
set_european_first => L
set_formatter_cache_behavior => L
set_workbook_for_formatter => L
=back
=back
=back
=head3 count_from_zero
=over
B Excel spreadsheets count from 1. L
counts from zero. This allows you to choose either way.
B 1
B 1 = counting from zero like Spreadsheet::ParseExcel,
0 = Counting from 1 like Excel
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 file_boundary_flags
=over
B When you request data to the right of the last column or below
the last row of the data this package can return 'EOR' or 'EOF' to indicate that
state. This is especially helpful in 'while' loops. The other option is to
return 'undef'. This is problematic if some cells in your table are empty which
also returns undef. The determination for what constitues the last column and
row is selected with the attributes L, L, and L.
B 1
B 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when
requesting a position that is out of bounds
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 empty_is_end
=over
B The excel convention is to read the table left to right and top
to bottom. Some tables have an uneven number of columns with real data from row
to row. This allows the several methods that excersize a 'next' function to wrap
after the last element with data rather than going to the max column. This also
can combine with the attribute L to
trigger 'EOR' flags after the last data element and before the sheet max column
when not implementing 'next' functionality. It will also return 'EOF' if the
remaining rows are empty even if the max row is farther on.
B 0
B 0 = treat all columns short of the max column for the sheet as being in
the table, 1 = treat all cells after the last cell with data as past the end of
the row. This will be most visible when
L or next functionality is
used in the context of the attribute L.
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 values_only
=over
B Excel will store information about a cell even if it only contains
formatting data. In many cases you only want to see cells that actually have
values. This attribute will change the package behaviour regarding cells that have
formatting stored against that cell but no actual value. If values in the cells
exist as zero length strings or spaces only you can also set those to empty with
the attribute L.
B 0
B 1 = return 'undef' for cells with formatting only,
0 = return the result of L (or cell objects)
for cells that only contain formatting.
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 from_the_edge
=over
B Some data tables start in the top left corner. Others do not. I
don't reccomend that practice but when aquiring data in the wild it is often good
to adapt. This attribute sets whether the file percieves the L and L as the top left edge of the sheeto or
from the top row with data and starting from the leftmost column with data.
B 1
B 1 = treat the top left corner of the sheet as the beginning of rows and
columns even if there is no data in the top row or leftmost column, 0 = Set the
minimum row and minimum columns to be the first row and first column with data
B Methods provided to adjust this attribute
=over
B
=over
B returns the attribute state
=back
=back
=back
=head3 cache_positions
=over
B Using the standard architecture this parser would go back and
read the sharedStrings and styles files sequentially from the beginning each
time it had to access a sub elelement. This trade-off is generally not desired
for these two files since the data is generally stored in a less than sequential
fasion. The solution is to cache these files as they are read the first time so
that a second pass through is not necessary to retreive an earlier element. The
only time this doesn't make sence is if either of the files would overwhelm RAM if
cached. The package has file size break points below which the files will cache.
The thinking is that above these points the RAM is at risk of being overwhelmed
and that not crashing and slow is better than a possible out-of-memory state.
This attribute allows you to change those break points based on the target machine
you are running on. The breaks are set on the byte size of the sub file not on the
cached expansion of the sub file. In general the styles file is cached into a hash
and the shared strings file is cached into an array ref. The attribute
L also affects the size of the cache for the
sharedStrings file since it will not cache the string formats unless the attribute
is set to 'instance'. There is also a setting for caching worksheet data. Some
worksheet row position settings will always be cached in order to speed up multiple
reads over the same sheet or to query meta data about the rows. However, this
cache level is set lower since the row caching creates much deeper data structures.
B
{
shared_strings_interface => 5242880,# 5 MB
styles_interface => 5242880,# 5 MB
worksheet_interface => 2097152,# 2 MB
}
B Methods provided to adjust this attribute
=over
B
=over
B returns the full attribute settings as a hashref
=back
B
=over
B return the max file size allowed to cache for the indicated interface
=back
B $max_file_size )>
=over
B set the $max_file_size in bytes to be cached for the indicated $target_interface
=back
B
=over
B returns true if the $target_interface has a cache size set
=back
=back
=back
=head3 show_sub_file_size
=over
B Especially for zip (xlsx and xlsm) files you may not know how big the
file is and want to the package to tell you what size it thinks the file is. This
attribute turns on a warning statment that prints to STDERR with information on the
size of potientially cached files.
B 0
B 0 = don't warn the file size, 1 = send the potentially cached file sizes to
STDERR for review
=back
=head3 group_return_type
=over
B Traditionally ParseExcel returns a cell object with lots of methods
to reveal information about the cell. In reality the extra information is not used very
much (witness the popularity of L). Because many users don't need or
want the extra cell formatting information it is possible to get either the raw xml value,
the raw visible cell value (seen in the Excel format bar), or the formatted cell value
returned either the way the Excel file specified or the L instead of a Cell instance with
all the data. All empty cells return undef no matter what.
B instance
B instance = returns a populated L instance,
xml_value = returns the string stored in the xml file - for xml based sheets this can sometimes
be different thant the visible value in the cell or formula bar. unformatted = returns just the
raw visible value of the cell shown in the Excel formula bar, value = returns just the formatted
value stored in the excel cell
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 empty_return_type
=over
B Traditionally L returns an empty string for cells
with unique formatting but no stored value. It may be that the more accurate way of returning
undef works better for you. This will turn that behaviour on.
B empty_string
B
empty_string = populates the unformatted value with '' even if it is set to undef
undef_string = if excel stores undef for an unformatted value it will return undef
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 spread_merged_values
=over
B In Excel you visibly see the value of the primary cell in a merged range displayed
in all the cells. This attribute lets the code see the primary value show in each of the merged
cells. There is some mandatory caching to pull this off so it will consume more memory.
B 0 (To match the Excel formula bar, VBscript, and Spreadsheet::ParseExcel)
B 0 = don't spread the primary value, 1 = spread the primary value
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 skip_hidden
=over
B Like the previous attribute this attempts to match a visual effect in Excel.
Even though hidden cells still contain values you can't see them visibly. This allows
you to skip hidden rows and columns (not hidden sheets). The one gotcha is Excel will
place the primary value in the new primary merged cell (formula bar) if a merge range is
only partially obscured to include the original primary cell. This package can't do that.
Either spread the primary to all cells or none.
B 0 (To match VBscript and Spreadsheet::ParseExcel)
B 0 = don't skip hidden rows and columns, 1 = skip hidden rows and columns
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 spaces_are_empty
=over
B Some auto file generators tend to add empty strings or strings with spaces to
fill empty cells. There may be some visual value in this but they can slow down parsing scripts.
this attribute allows the sheet to treat spaces as empty or undef instead of cells with values.
B 0 (To match Excel and Spreadsheet::ParseExcel)
B 0 = cells with zero length strings and spaces are considered to have 'values", 1 = There must
be something other than spaces or a zero length string for the cell to have value.
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head3 merge_data
=over
B For zip based worksheets the merge data is stored at the end of the file. In order for
the parser to arrive at that point it has to read through the whole sheet first. For big worksheet
files this is very slow. If you are willing to not know or implement cell merge information then turn
this off and the sheet should load much faster.
B 1 (collect merge data)
B 1 = The merge data is parsed from the worksheet file when it is opened, 0 = No merge data is
parsed. The effect is equal to the cell merges dissapearing.
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
=back
=back
=head2 FLAGS
The parameter list (Attributes) that are possible to pass to ->new is somewhat long.
Therefore you may want a shortcut that aggregates some set of attribute settings that
are not the defaults but wind up being boilerplate. I have provided possible
alternate sets like this and am open to providing others that are suggested. The
flags will have a : in front of the identifier and will be passed to the class in the
'use' statement for consumption by the import method. The flags can be stacked and
where there is conflict between the flag settings the rightmost passed flag setting is
used. If everything in the flag but one or two settings are desirable still use the flag and
then overwrite those settings when calling new.
Example;
use Spreadsheet::Reader::ExcelXML v0.2 qw( :alt_default :debug );
=head3 :alt_default
This is intended for a deep look at data and skip formatting cells.
=over
B
=over
L => 1
L => 0
L => 1
=back
=back
=head3 :just_the_data
This is intended for a shallow look at data with value formatting implemented
=over
B
=over
L => 0
L => 1
L => 1
L => 'value'
L => 0
L => 'undef_string'
L => 1
L => 0
L => 0
=back
=back
=head3 :just_raw_data
This is intended for a shallow look at raw text and skips all formatting including number formats.
=over
B
=over
L => 0
L => 1
L => 1
L => 'xml_value'
L => 0,
L => 'undef_string'
L => 1
L => 0
L => 0
=back
=back
=head3 :like_ParseExcel
This is a way to force some of the other groups back to instance and count from zero
=over
B
=over
L => 1
L => 'instance'
=back
=back
=head3 :debug
This is a way to turn on as much reporting as possible
=over
B
=over
L ->
error_inst =>{
superclasses => ['Spreadsheet::Reader::ExcelXML::Error'],
package => 'ErrorInstance',
should_warn => 1,
}
L => 1
=back
=back
=head3 :lots_of_ram
This opens the caching size allowances way up
=over
B
=over
L ->
cache_positions =>{
shared_strings_interface => 209715200,# 200 MB
styles_interface => 209715200,# 200 MB
worksheet_interface => 209715200,# 200 MB
},
=back
=back
=head3 :less_ram
This tightens caching size allowances way down
=over
B
=over
L ->
cache_positions =>{
shared_strings_interface => 10240,# 10 KB
styles_interface => 10240,# 10 KB
worksheet_interface => 1024,# 1 KB
},
=back
=back
=head2 Secondary Methods
These are additional ways to use this class. They can be used to open an .xlsx workbook.
They are also ways to investigate information at the workbook level. For information on
how to retrieve data from the worksheets see the
L and
L documentation. For additional workbook
options see the L