<

Go Premium for a chance to win a PS4. Enter to Win

x

Using Perl to convert an Excel Workbook into a MongoDB, with each Worksheet converted into a separate MongoDB collection

Published on
13,090 Points
6,490 Views
1 Endorsement
Last Modified:
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded,  installed and started the daemon, and had a console window open.

After an hour or two of playing at the command line I created a database or two, a couple of collections and a number of handcrafted JSON documents. At which point I went in search of a GUI and found RockMongo.

Another half hour of playing and I had a Web based GUI, that's great for ad-hoc queries and admin tasks, but your still to left to manually handcraft and enter your own JSON documents, via a textarea box. At which point I realised that if I was to evaluate the map-reduce functionality,  attempt to join data from two collections, let alone identify and evaluate any Business Reporting tools, that I would need to start cutting some code and convert an existing data source.

Data wasn't really an issue as the Companies business was data, so a simple choice  of either hooking up to a DB or using one or more of the many Excel reports kicking around and picking a language.  After a quick surf of the MongoDB site and read of the Perl tutorial I chose the latter and Perl.

After installing the necessary Perl libraries, enter:
cpan YAML Data::Dumper Spreadsheet::ParseExcel Tie::IxHash Encode Scalar::Util  JSON MongoDB MongoDB::OID File::Basename 

Open in new window


A quick play with both the MongoDB and Spreadsheet::ParseExcel examples and a little bit of thought, I had hacked together a very basic (and slightly naughty - blindly inserts without checking the status) command line tool that will happily convert an Excel Workbook (XLS not XLSX) into:
A Database - named after the file
A Series of collections - One per Worksheet present in the  Workbook, and named accordingly
A Series of Documents in each Collection, where each document represents one row from a Worksheet, with it's Key names taken from the Cell (Column) names in Row 1 of the sheet

Anyway enough of a rant, some code.
#!/usr/bin/perl -w
# Purpose: Insert each Worksheet, in an Excel Workbook, into an existing MongoDB, of the same name as the Excel(.xls).
#          The worksheet names are mapped to the collection names, and the column named to the document hash labels.
#          Assumes each sheet is named and that the first ROW on each sheet contains the hash(field) names.
#
 
use strict;
use Spreadsheet::ParseExcel;
use MongoDB;
use MongoDB::OID;
use Tie::IxHash;
 
die "You must provide a filename to $0 to be parsed as an Excel file" unless @ARGV;
 
my $sDbName              = $ARGV[0];
   $sDbName              =~ s/\.xls//i;
my $oExcel               = new Spreadsheet::ParseExcel;
my $oBook                = $oExcel->Parse($ARGV[0]);
my $oConn                = MongoDB::Connection->new(host => 'some.server:27017');
my $oDB                  = $oConn->$sDbName;
my ($sColName, %hNewDoc, $hColToInsertInto, $sFieldName, $iR, $iC, $oWkS, $oWkC);
 
print "FILE  :", $oBook->{File} , "\n";
print "DB: $sDbName\n";
print "Collection Count :", $oBook->{SheetCount} , "\n";
 
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
{
 $oWkS                   = $oBook->{Worksheet}[$iSheet];
 $sColName               = $oWkS->{Name};
 $hColToInsertInto       = $oDB->$sColName;
 print "Collection(WorkSheet name):", $sColName, "\n";
 for(my $iR   = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ;  $iR++)
 {
  tie ( %hNewDoc, "Tie::IxHash");
  for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++)
  {
   $sFieldName           = $oWkS->{Cells}[$oWkS->{MinRow}][$iC]->Value;
   $oWkC                 = $oWkS->{Cells}[$iR][$iC];
   $hNewDoc{$sFieldName} = $oWkC->Value if($oWkC && $sFieldName);
  }
  $hColToInsertInto->insert(\%hNewDoc);
 }
 print "Documents inserted(Rows):", ($oWkS->{MaxRow} - $oWkS->{MinRow}), "\n";
}

Open in new window

Note:  
Change the connection ($oConn) string to suit, and if needed add a user-id and password to the arguments.
If you need XLSX support a quick switch to Spreadsheet::XLSX is all that's needed. Alternatively it only takes a few lines of code, to detect the filetype and call the appropriate library.
The above is a simple hack, assumes everything in a cell is a string / scalar, if preserving type is important, a little function with a few regexp can be used in conjunction with a few if statements to ensure numbers / dates remain in the applicable format when written to the DB

Apparently the command line is scarey, so if asked to share consider wrapping your logic in a CGI script / upload form :)

The script should return some output along the following lines:
arober11@wibble:~/src/perl> ./mongoTST.pl testData.xls
FILE  :testData.xls
DB: testData
Collection Count :1
Collection(WorkSheet name):Sheet1
Documents inserted(Rows):244
arober11@wibble:~/src/perl>

Open in new window

1
Comment
Author:arober11
1 Comment
 
LVL 26

Author Comment

by:arober11
If your looking for a relational DB / SQL equivalent then the following may possibly be of interest.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Join & Write a Comment

Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month