Using Perl to convert an Excel Workbook into a MongoDB, with each Worksheet converted into a separate MongoDB collection

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded,  installed and started the daemon, and had a console window open.

After an hour or two of playing at the command line I created a database or two, a couple of collections and a number of handcrafted JSON documents. At which point I went in search of a GUI and found RockMongo.

Another half hour of playing and I had a Web based GUI, that's great for ad-hoc queries and admin tasks, but your still to left to manually handcraft and enter your own JSON documents, via a textarea box. At which point I realised that if I was to evaluate the map-reduce functionality,  attempt to join data from two collections, let alone identify and evaluate any Business Reporting tools, that I would need to start cutting some code and convert an existing data source.

Data wasn't really an issue as the Companies business was data, so a simple choice  of either hooking up to a DB or using one or more of the many Excel reports kicking around and picking a language.  After a quick surf of the MongoDB site and read of the Perl tutorial I chose the latter and Perl.

After installing the necessary Perl libraries, enter:
cpan YAML Data::Dumper Spreadsheet::ParseExcel Tie::IxHash Encode Scalar::Util  JSON MongoDB MongoDB::OID File::Basename 

Open in new window

A quick play with both the MongoDB and Spreadsheet::ParseExcel examples and a little bit of thought, I had hacked together a very basic (and slightly naughty - blindly inserts without checking the status) command line tool that will happily convert an Excel Workbook (XLS not XLSX) into:
A Database - named after the file
A Series of collections - One per Worksheet present in the  Workbook, and named accordingly
A Series of Documents in each Collection, where each document represents one row from a Worksheet, with it's Key names taken from the Cell (Column) names in Row 1 of the sheet

Anyway enough of a rant, some code.
#!/usr/bin/perl -w
                      # Purpose: Insert each Worksheet, in an Excel Workbook, into an existing MongoDB, of the same name as the Excel(.xls).
                      #          The worksheet names are mapped to the collection names, and the column named to the document hash labels.
                      #          Assumes each sheet is named and that the first ROW on each sheet contains the hash(field) names.
                      use strict;
                      use Spreadsheet::ParseExcel;
                      use MongoDB;
                      use MongoDB::OID;
                      use Tie::IxHash;
                      die "You must provide a filename to $0 to be parsed as an Excel file" unless @ARGV;
                      my $sDbName              = $ARGV[0];
                         $sDbName              =~ s/\.xls//i;
                      my $oExcel               = new Spreadsheet::ParseExcel;
                      my $oBook                = $oExcel->Parse($ARGV[0]);
                      my $oConn                = MongoDB::Connection->new(host => 'some.server:27017');
                      my $oDB                  = $oConn->$sDbName;
                      my ($sColName, %hNewDoc, $hColToInsertInto, $sFieldName, $iR, $iC, $oWkS, $oWkC);
                      print "FILE  :", $oBook->{File} , "\n";
                      print "DB: $sDbName\n";
                      print "Collection Count :", $oBook->{SheetCount} , "\n";
                      for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
                       $oWkS                   = $oBook->{Worksheet}[$iSheet];
                       $sColName               = $oWkS->{Name};
                       $hColToInsertInto       = $oDB->$sColName;
                       print "Collection(WorkSheet name):", $sColName, "\n";
                       for(my $iR   = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ;  $iR++)
                        tie ( %hNewDoc, "Tie::IxHash");
                        for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++)
                         $sFieldName           = $oWkS->{Cells}[$oWkS->{MinRow}][$iC]->Value;
                         $oWkC                 = $oWkS->{Cells}[$iR][$iC];
                         $hNewDoc{$sFieldName} = $oWkC->Value if($oWkC && $sFieldName);
                       print "Documents inserted(Rows):", ($oWkS->{MaxRow} - $oWkS->{MinRow}), "\n";

Open in new window

Change the connection ($oConn) string to suit, and if needed add a user-id and password to the arguments.
If you need XLSX support a quick switch to Spreadsheet::XLSX is all that's needed. Alternatively it only takes a few lines of code, to detect the filetype and call the appropriate library.
The above is a simple hack, assumes everything in a cell is a string / scalar, if preserving type is important, a little function with a few regexp can be used in conjunction with a few if statements to ensure numbers / dates remain in the applicable format when written to the DB

Apparently the command line is scarey, so if asked to share consider wrapping your logic in a CGI script / upload form :)

The script should return some output along the following lines:
arober11@wibble:~/src/perl> ./ testData.xls
                      FILE  :testData.xls
                      DB: testData
                      Collection Count :1
                      Collection(WorkSheet name):Sheet1
                      Documents inserted(Rows):244

Open in new window


Comments (1)

Top Expert 2010


If your looking for a relational DB / SQL equivalent then the following may possibly be of interest.

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.