Using Perl to convert an Excel Workbook into a MongoDB, with each Worksheet converted into a separate MongoDB collection

Published on
13,248 Points
1 Endorsement
Last Modified:
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded,  installed and started the daemon, and had a console window open.

After an hour or two of playing at the command line I created a database or two, a couple of collections and a number of handcrafted JSON documents. At which point I went in search of a GUI and found RockMongo.

Another half hour of playing and I had a Web based GUI, that's great for ad-hoc queries and admin tasks, but your still to left to manually handcraft and enter your own JSON documents, via a textarea box. At which point I realised that if I was to evaluate the map-reduce functionality,  attempt to join data from two collections, let alone identify and evaluate any Business Reporting tools, that I would need to start cutting some code and convert an existing data source.

Data wasn't really an issue as the Companies business was data, so a simple choice  of either hooking up to a DB or using one or more of the many Excel reports kicking around and picking a language.  After a quick surf of the MongoDB site and read of the Perl tutorial I chose the latter and Perl.

After installing the necessary Perl libraries, enter:
cpan YAML Data::Dumper Spreadsheet::ParseExcel Tie::IxHash Encode Scalar::Util  JSON MongoDB MongoDB::OID File::Basename 

Open in new window

A quick play with both the MongoDB and Spreadsheet::ParseExcel examples and a little bit of thought, I had hacked together a very basic (and slightly naughty - blindly inserts without checking the status) command line tool that will happily convert an Excel Workbook (XLS not XLSX) into:
A Database - named after the file
A Series of collections - One per Worksheet present in the  Workbook, and named accordingly
A Series of Documents in each Collection, where each document represents one row from a Worksheet, with it's Key names taken from the Cell (Column) names in Row 1 of the sheet

Anyway enough of a rant, some code.
#!/usr/bin/perl -w
# Purpose: Insert each Worksheet, in an Excel Workbook, into an existing MongoDB, of the same name as the Excel(.xls).
#          The worksheet names are mapped to the collection names, and the column named to the document hash labels.
#          Assumes each sheet is named and that the first ROW on each sheet contains the hash(field) names.
use strict;
use Spreadsheet::ParseExcel;
use MongoDB;
use MongoDB::OID;
use Tie::IxHash;
die "You must provide a filename to $0 to be parsed as an Excel file" unless @ARGV;
my $sDbName              = $ARGV[0];
   $sDbName              =~ s/\.xls//i;
my $oExcel               = new Spreadsheet::ParseExcel;
my $oBook                = $oExcel->Parse($ARGV[0]);
my $oConn                = MongoDB::Connection->new(host => 'some.server:27017');
my $oDB                  = $oConn->$sDbName;
my ($sColName, %hNewDoc, $hColToInsertInto, $sFieldName, $iR, $iC, $oWkS, $oWkC);
print "FILE  :", $oBook->{File} , "\n";
print "DB: $sDbName\n";
print "Collection Count :", $oBook->{SheetCount} , "\n";
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
 $oWkS                   = $oBook->{Worksheet}[$iSheet];
 $sColName               = $oWkS->{Name};
 $hColToInsertInto       = $oDB->$sColName;
 print "Collection(WorkSheet name):", $sColName, "\n";
 for(my $iR   = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ;  $iR++)
  tie ( %hNewDoc, "Tie::IxHash");
  for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++)
   $sFieldName           = $oWkS->{Cells}[$oWkS->{MinRow}][$iC]->Value;
   $oWkC                 = $oWkS->{Cells}[$iR][$iC];
   $hNewDoc{$sFieldName} = $oWkC->Value if($oWkC && $sFieldName);
 print "Documents inserted(Rows):", ($oWkS->{MaxRow} - $oWkS->{MinRow}), "\n";

Open in new window

Change the connection ($oConn) string to suit, and if needed add a user-id and password to the arguments.
If you need XLSX support a quick switch to Spreadsheet::XLSX is all that's needed. Alternatively it only takes a few lines of code, to detect the filetype and call the appropriate library.
The above is a simple hack, assumes everything in a cell is a string / scalar, if preserving type is important, a little function with a few regexp can be used in conjunction with a few if statements to ensure numbers / dates remain in the applicable format when written to the DB

Apparently the command line is scarey, so if asked to share consider wrapping your logic in a CGI script / upload form :)

The script should return some output along the following lines:
arober11@wibble:~/src/perl> ./mongoTST.pl testData.xls
FILE  :testData.xls
DB: testData
Collection Count :1
Collection(WorkSheet name):Sheet1
Documents inserted(Rows):244

Open in new window

1 Comment
LVL 26

Author Comment

If your looking for a relational DB / SQL equivalent then the following may possibly be of interest.

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Join & Write a Comment

Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month