Advice needed on storage of unstructured text data

jgrammer42
jgrammer42 used Ask the Experts™
on
I have over 60 stores that use cash registers for sales.  I have a method of collecting every transaction from a cash register by capturing the cash registers electronic journal through the serial port.  The data is all text based, but very unstructured.  (I am including a sample file of this data capture to this post.)

The question I have is this:  IF you were tasked with storing, then retrieving and analyzing these data; what type of database storage system would you use?  And why that one?  

I am waffling on whether to use a SQL or NoSQL solution.

Thank you,
Jeff

Here is a sample of the raw data capture.  (Note: about half of the way down there is something called a "Z report"; which is run when a cash register is closed out.  That zero's out all of the balances and totals during that period of time it was operating.)

DATE  10/09/2019 WED  TIME 02:17
                               
2X             @ 4.29          
Purses/Backpack T1         $8.58
Childrens Clothing T1      $1.99
TAX1                       $0.63
TOTAL                     $11.20
CASH                      $12.00
CHANGE                     $0.80
CLERK 1        No.000072   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:18
                               
5X             @ 5.79          
Jeans T1                  $28.95
Sweater/Sweatshirt T1      $4.59
3X             @ 2.59          
T-Shirt/Cami T1            $7.77
50% OFF                 -50.000%
AMOUNT T1                  -3.89
Furniture T1              $25.00
TAX1                       $3.75
TOTAL                     $66.17
CASH                      $67.00
CHANGE                     $0.83
CLERK 1        No.000073   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:18
                               
Winter Coat T1             $6.29
2X             @ 4.59          
Sweater/Sweatshirt T1      $9.18
Elec/Mech T1              $15.00
TAX1                       $1.83
TOTAL                     $32.30
Credit Card               $32.30
--------------------------------
--------------------------------
CLERK 1        No.000074   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:19
                               
Shoes T1                  $25.00
4X             @ 1.99          
Childrens Clothing T1      $7.96
Newborn - 5t T1            $0.99
Shirt T1                   $4.29
2X             @ 2.59          
T-Shirt/Cami T1            $5.18
DONATION                   $0.92
DONATION                   $0.05
TAX1                       $2.61
TOTAL                     $47.00
Credit Card               $47.00
--------------------------------
--------------------------------
CLERK 1        No.000075   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:19
                               
Dress T1                   $5.79
Jacket/Blazer T1           $6.29
Skirts T1                  $3.99
3X             @ 2.59          
Bra/Slip T1                $7.77
TAX1                       $1.43
TOTAL                     $25.27
CASH                      $30.00
CHANGE                     $4.73
CLERK 1        No.000076   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:20
                               
Z 1  REPORT                00007
LAST REPORT           10/08/2019
--------------------------------
FINANCIAL                      
                               
+PLU TTL                      42
                         $217.61
ADJST TTL                     42
                         $217.61
--------------------------------
NONTAX                     $0.97
Sales                    $216.64
Tax                       $13.01
% 1                            1
                           -3.89
Customer #                     7
                         $230.62
GROSS SALES              $234.51
CASH SALES                     5
                         $151.32
CASH-IN-D                      5
                         $151.32
CHG1-IN-D                      2
                          $79.30
Credit Sales                   2
                          $79.30
DRWR TTL                 $230.62
AVG $/CUST                $32.94
OPEN CHECK TOTAL           $0.00
********************************
                               
GRAND                   $1429.83
                               
CLERK 1        No.000077   00000
                               
                               
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:20
                               
Z 1  REPORT                00007
LAST REPORT           10/08/2019
--------------------------------
ALL CLERK                      
                               
Management                      
NET SALE                       7
                         $230.62
DRWR TTL                 $230.62
--------------------------------
AP-200793                      
--------------------------------
BT-200803                      
--------------------------------
CP-201588                      
--------------------------------
DM-402789                      
--------------------------------
JH-406479                      
--------------------------------
KR-200543                      
--------------------------------
LB-404475                      
--------------------------------
PH-200890                      
--------------------------------
C10                            
--------------------------------
C11                            
--------------------------------
C12                            
--------------------------------
C13                            
--------------------------------
C14                            
--------------------------------
CLERK 15                        
--------------------------------
CLERK 16                        
--------------------------------
CLERK 17                        
--------------------------------
CLERK 18                        
--------------------------------
CLERK 19                        
--------------------------------
CLERK 20                        
--------------------------------
CLERK 21                        
--------------------------------
CLERK 22                        
--------------------------------
CLERK 23                        
--------------------------------
CLERK 24                        
--------------------------------
CLERK 25                        
--------------------------------
                               
CLERK 1        No.000078   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:20
                               
Z 1  REPORT                00007
LAST REPORT           10/08/2019
--------------------------------
GROUP                          
                               
GROUP : 1 TEXTILES              
 CNT                          35
 SALES AMT               $143.06
GROUP : 2 FURNITURE            
 CNT                           1
 SALES AMT                $25.00
GROUP : 3 ELEC/MECH            
 CNT                           1
 SALES AMT                $15.00
GROUP : 4 ACCESSORIES          
 CNT                           3
 SALES AMT                $33.58
GROUP : 10 DONATION            
 CNT                           2
 SALES AMT                 $0.97
********************************
TOTAL CNT                     42
TOTAL AMT                $217.61
                               
CLERK 1        No.000079   00000
                               
                               
                               
                               
                               
                               
                               
DATE  10/09/2019 WED  TIME 02:21
                               
Z 1  REPORT                00007
LAST REPORT           10/08/2019
--------------------------------
ALL PLU                        
                               
Shirt                          
 CNT 1                     $4.29
T-Shirt/Cami                    
 CNT 5                     $9.06
Sweater/Sweatshirt              
 CNT 5                    $22.95
Winter Coat                    
 CNT 5                    $31.45
Jeans                          
 CNT 7                    $40.53
Skirts                          
 CNT 1                     $3.99
Dress                          
 CNT 1                     $5.79
Jacket/Blazer                  
 CNT 1                     $6.29
Bra/Slip                        
 CNT 3                     $7.77
Childrens Clothing              
 CNT 5                     $9.95
Newborn - 5t                    
 CNT 1                     $0.99
Purses/Backpack                
 CNT 2                     $8.58
Furniture                      
 CNT 1                    $25.00
Elec/Mech                      
 CNT 1                    $15.00
Shoes                          
 CNT 1                    $25.00
DONATION                        
 CNT 2                     $0.97
********************************
TOTAL CNT                     42
TOTAL AMT                $217.61
                               
CLERK 1        No.000080   00000
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Dustin SaundersCo-Founder and Chief Architect
Top Expert 2016

Commented:
I assume these will be parsed and broken out into actual data?  

But for storing anything like this where you have transactions and are undoubtedly going to need reporting and aggregation, you'll be best served with a relational SQL database like SQL Server or MySQL.  The only advantage NoSQL has on SQL Server is the ability to scale horizontally which isn't going to be a concern for the vast majority of applications (you can even do "schemaless" implementations in SQL Server with a few tables to support the schemaless schema).

How you build those tables would be based on what you need to parse-- but you'd want to store things such as the quantity, individual cost, product name, the discount that was applied, the clerk involved, etc.  and probably the rest of the transaction such as the tax collected and total costs.

Either way, you will want to parse out the data, but on NoSQL you're going to eat it on aggregation and reporting, and you'll end up storing much more data than is required as opposed to doing something like creating a new product record for each existing product and then referencing and indexing that id.

My rule of thumb (and this isn't everyone's):  NoSQL is good for static, non-changing, non-reporting data.  So something like tons of log files that could have varying schemas, or chat conversations that won't be a part of reporting.
John TsioumprisSoftware & Systems Engineer

Commented:
Keeping unstructured data means that in the end you don't have anything more like a folder full of little papers...you can't get any useful reports...you can't monitor stock, client trends...you name it...but if you want to dump them just somehow for future reference even the simple file system would be just fine (thousands of little 1K files...given that you have 60 stores x 300 days = 18000 files per year.. assuming that the stores work 6 days a week minus the holidays roughly )
What you really need is a good "parser" ...an application that will read your input and based on predefined rules split the "data" to useful data and store them in an appropriate database engine (doesn't really matter which one).
60 stores x 300 days = 18000 files per year.. assuming that the stores work 6 days a week minus the holidays roughly

Assuming of course each store only has one register. Out of curiosity, how many registers are at each location (approx)

Do the stores have a "controller/server" which report to a primary?

Or are you pulling each report manually?
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

If it was me, I'd recommend using a SQL solution (e.g. MySQL) over a NoSQL solution.  As has already been pointed out, NoSQL is mostly about handling scale.  You basically trade transactional safety in return for handling vast amounts of data quickly.  You have small amounts of data, so no need for NoSQL.

The underlying data here is actually pretty structured (a series of items bought at a specific cash register at a particular time).  The problem is that extracting the structure will be hard.

What I would do is start from a super simple database structure - where you simply record each transaction from each register as raw text.  A "raw_transactions" table.  Then alongside that you'd set up a group of tables with the structure you actually want - datetime, register id, cashier, total transaction size, list of items purchased (in another table) etc.

Now the real work begins - which is writing some code to extract each of the raw_transactions and see if you can parse it into the structured form.  If you can, you record the details in the structured tables and mark the raw_transaction as parsed (probably with an index into the parsed table data).  But if you can't, you leave the raw transaction in the original table as something to work on later.

Then you see which you've managed to parse, update your parsing logic (as time allows) and run it again - this time parsing some more into the structured tables.

If you take this approach you're never discarding data - the raw text data is in your database as well as the clean, structured version.  And as your tools for parsing the raw data get better, you can go back and process data you couldn't really support before.

Make sense?

Doug
This makes #5 to support a SQL type solution over NoSQL.  :)

The data looks like it will actually parse rather easily.  Each transaction seems to start with a date/time stamp, followed by a list of sold items (and the transaction close-out items) or a "Z 1 REPORT" header.  There's not a lot of challenge in writing something that would convert the text into something that a load routine could handle.

Use the date/time as a transaction ID (assuming that your cashier isn't so fast that he/she can run multiple customers in under a second).  You might want to prepend that with a store-id.  Then list the sales.  You've got several special items that you'll have to decide how to handle -- "2X @ 2.29", "50 % OFF", etc. but they aren't that much of a challenge.

What you really want is the register to post a UPC or product code on every sales line.  That would give you much better reporting and analysis.  But what you have will still let you categorize sales and volumes.


Kent

Author

Commented:
kenfcamp:
Answers to your questions:   On average each store has 3 cash registers.

What I have at each location is a fanless, miniPC, with 4Gb Ram and 128Gb SSD running Linux with a USB to 4-port serial hydra cable which connects to a serial port of each cash register.   These localized servers can and will upload to a master server.  (but I have not built that yet as I am waffling on just what I should do there as a clearing house DB.)

NOTE TO ALL: This capture is for audit trails.  It will not be used to track inventory or even "live sales data".  How these data will be used on the backend is for historical reporting, trend analysis and financial audits.
John TsioumprisSoftware & Systems Engineer

Commented:
Just setup a cloud VPS with MySQL and push everything there.....for the good data..
For the text part...i would setup a Firebird server...do a bit of parsing and store the Text as BLOBs..

Author

Commented:
Thank you all so far for your very valuable insight and expertise!!

This is greatly helping me think this through!  

I truly cannot thank you all enough.
Top Expert 2014

Commented:
It looks like you have both transaction detail segments and end-of-day reports.  I'd probably separate those two types of data segments.
Tomas Helgi JohannssonDatabase Administrator / Software Engineer

Commented:
Hi,

I would look at the MariaDB database and the Connect Storage Engine and create some sort of parser that can be called using REST API to move the data easily from unstructured to structured data.
The connect storage engine has also other forms to read data like this
Then use InnoDB or Aria storage engines to hold the relational tables.

Regards,
    Tomas Helgi
Should not be too difficult to write something in Perl or some other text processing language to turn this into a series of structured transaction, that will give you much better insight into your sales data than storing as free text

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial