jgrammer42
asked on
Advice needed on storage of unstructured text data
I have over 60 stores that use cash registers for sales. I have a method of collecting every transaction from a cash register by capturing the cash registers electronic journal through the serial port. The data is all text based, but very unstructured. (I am including a sample file of this data capture to this post.)
The question I have is this: IF you were tasked with storing, then retrieving and analyzing these data; what type of database storage system would you use? And why that one?
I am waffling on whether to use a SQL or NoSQL solution.
Thank you,
Jeff
Here is a sample of the raw data capture. (Note: about half of the way down there is something called a "Z report"; which is run when a cash register is closed out. That zero's out all of the balances and totals during that period of time it was operating.)
DATE 10/09/2019 WED TIME 02:17
2X @ 4.29
Purses/Backpack T1 $8.58
Childrens Clothing T1 $1.99
TAX1 $0.63
TOTAL $11.20
CASH $12.00
CHANGE $0.80
CLERK 1 No.000072 00000
DATE 10/09/2019 WED TIME 02:18
5X @ 5.79
Jeans T1 $28.95
Sweater/Sweatshirt T1 $4.59
3X @ 2.59
T-Shirt/Cami T1 $7.77
50% OFF -50.000%
AMOUNT T1 -3.89
Furniture T1 $25.00
TAX1 $3.75
TOTAL $66.17
CASH $67.00
CHANGE $0.83
CLERK 1 No.000073 00000
DATE 10/09/2019 WED TIME 02:18
Winter Coat T1 $6.29
2X @ 4.59
Sweater/Sweatshirt T1 $9.18
Elec/Mech T1 $15.00
TAX1 $1.83
TOTAL $32.30
Credit Card $32.30
-------------------------- ------
-------------------------- ------
CLERK 1 No.000074 00000
DATE 10/09/2019 WED TIME 02:19
Shoes T1 $25.00
4X @ 1.99
Childrens Clothing T1 $7.96
Newborn - 5t T1 $0.99
Shirt T1 $4.29
2X @ 2.59
T-Shirt/Cami T1 $5.18
DONATION $0.92
DONATION $0.05
TAX1 $2.61
TOTAL $47.00
Credit Card $47.00
-------------------------- ------
-------------------------- ------
CLERK 1 No.000075 00000
DATE 10/09/2019 WED TIME 02:19
Dress T1 $5.79
Jacket/Blazer T1 $6.29
Skirts T1 $3.99
3X @ 2.59
Bra/Slip T1 $7.77
TAX1 $1.43
TOTAL $25.27
CASH $30.00
CHANGE $4.73
CLERK 1 No.000076 00000
DATE 10/09/2019 WED TIME 02:20
Z 1 REPORT 00007
LAST REPORT 10/08/2019
-------------------------- ------
FINANCIAL
+PLU TTL 42
$217.61
ADJST TTL 42
$217.61
-------------------------- ------
NONTAX $0.97
Sales $216.64
Tax $13.01
% 1 1
-3.89
Customer # 7
$230.62
GROSS SALES $234.51
CASH SALES 5
$151.32
CASH-IN-D 5
$151.32
CHG1-IN-D 2
$79.30
Credit Sales 2
$79.30
DRWR TTL $230.62
AVG $/CUST $32.94
OPEN CHECK TOTAL $0.00
************************** ******
GRAND $1429.83
CLERK 1 No.000077 00000
DATE 10/09/2019 WED TIME 02:20
Z 1 REPORT 00007
LAST REPORT 10/08/2019
-------------------------- ------
ALL CLERK
Management
NET SALE 7
$230.62
DRWR TTL $230.62
-------------------------- ------
AP-200793
-------------------------- ------
BT-200803
-------------------------- ------
CP-201588
-------------------------- ------
DM-402789
-------------------------- ------
JH-406479
-------------------------- ------
KR-200543
-------------------------- ------
LB-404475
-------------------------- ------
PH-200890
-------------------------- ------
C10
-------------------------- ------
C11
-------------------------- ------
C12
-------------------------- ------
C13
-------------------------- ------
C14
-------------------------- ------
CLERK 15
-------------------------- ------
CLERK 16
-------------------------- ------
CLERK 17
-------------------------- ------
CLERK 18
-------------------------- ------
CLERK 19
-------------------------- ------
CLERK 20
-------------------------- ------
CLERK 21
-------------------------- ------
CLERK 22
-------------------------- ------
CLERK 23
-------------------------- ------
CLERK 24
-------------------------- ------
CLERK 25
-------------------------- ------
CLERK 1 No.000078 00000
DATE 10/09/2019 WED TIME 02:20
Z 1 REPORT 00007
LAST REPORT 10/08/2019
-------------------------- ------
GROUP
GROUP : 1 TEXTILES
CNT 35
SALES AMT $143.06
GROUP : 2 FURNITURE
CNT 1
SALES AMT $25.00
GROUP : 3 ELEC/MECH
CNT 1
SALES AMT $15.00
GROUP : 4 ACCESSORIES
CNT 3
SALES AMT $33.58
GROUP : 10 DONATION
CNT 2
SALES AMT $0.97
************************** ******
TOTAL CNT 42
TOTAL AMT $217.61
CLERK 1 No.000079 00000
DATE 10/09/2019 WED TIME 02:21
Z 1 REPORT 00007
LAST REPORT 10/08/2019
-------------------------- ------
ALL PLU
Shirt
CNT 1 $4.29
T-Shirt/Cami
CNT 5 $9.06
Sweater/Sweatshirt
CNT 5 $22.95
Winter Coat
CNT 5 $31.45
Jeans
CNT 7 $40.53
Skirts
CNT 1 $3.99
Dress
CNT 1 $5.79
Jacket/Blazer
CNT 1 $6.29
Bra/Slip
CNT 3 $7.77
Childrens Clothing
CNT 5 $9.95
Newborn - 5t
CNT 1 $0.99
Purses/Backpack
CNT 2 $8.58
Furniture
CNT 1 $25.00
Elec/Mech
CNT 1 $15.00
Shoes
CNT 1 $25.00
DONATION
CNT 2 $0.97
************************** ******
TOTAL CNT 42
TOTAL AMT $217.61
CLERK 1 No.000080 00000
The question I have is this: IF you were tasked with storing, then retrieving and analyzing these data; what type of database storage system would you use? And why that one?
I am waffling on whether to use a SQL or NoSQL solution.
Thank you,
Jeff
Here is a sample of the raw data capture. (Note: about half of the way down there is something called a "Z report"; which is run when a cash register is closed out. That zero's out all of the balances and totals during that period of time it was operating.)
DATE 10/09/2019 WED TIME 02:17
2X @ 4.29
Purses/Backpack T1 $8.58
Childrens Clothing T1 $1.99
TAX1 $0.63
TOTAL $11.20
CASH $12.00
CHANGE $0.80
CLERK 1 No.000072 00000
DATE 10/09/2019 WED TIME 02:18
5X @ 5.79
Jeans T1 $28.95
Sweater/Sweatshirt T1 $4.59
3X @ 2.59
T-Shirt/Cami T1 $7.77
50% OFF -50.000%
AMOUNT T1 -3.89
Furniture T1 $25.00
TAX1 $3.75
TOTAL $66.17
CASH $67.00
CHANGE $0.83
CLERK 1 No.000073 00000
DATE 10/09/2019 WED TIME 02:18
Winter Coat T1 $6.29
2X @ 4.59
Sweater/Sweatshirt T1 $9.18
Elec/Mech T1 $15.00
TAX1 $1.83
TOTAL $32.30
Credit Card $32.30
--------------------------
--------------------------
CLERK 1 No.000074 00000
DATE 10/09/2019 WED TIME 02:19
Shoes T1 $25.00
4X @ 1.99
Childrens Clothing T1 $7.96
Newborn - 5t T1 $0.99
Shirt T1 $4.29
2X @ 2.59
T-Shirt/Cami T1 $5.18
DONATION $0.92
DONATION $0.05
TAX1 $2.61
TOTAL $47.00
Credit Card $47.00
--------------------------
--------------------------
CLERK 1 No.000075 00000
DATE 10/09/2019 WED TIME 02:19
Dress T1 $5.79
Jacket/Blazer T1 $6.29
Skirts T1 $3.99
3X @ 2.59
Bra/Slip T1 $7.77
TAX1 $1.43
TOTAL $25.27
CASH $30.00
CHANGE $4.73
CLERK 1 No.000076 00000
DATE 10/09/2019 WED TIME 02:20
Z 1 REPORT 00007
LAST REPORT 10/08/2019
--------------------------
FINANCIAL
+PLU TTL 42
$217.61
ADJST TTL 42
$217.61
--------------------------
NONTAX $0.97
Sales $216.64
Tax $13.01
% 1 1
-3.89
Customer # 7
$230.62
GROSS SALES $234.51
CASH SALES 5
$151.32
CASH-IN-D 5
$151.32
CHG1-IN-D 2
$79.30
Credit Sales 2
$79.30
DRWR TTL $230.62
AVG $/CUST $32.94
OPEN CHECK TOTAL $0.00
**************************
GRAND $1429.83
CLERK 1 No.000077 00000
DATE 10/09/2019 WED TIME 02:20
Z 1 REPORT 00007
LAST REPORT 10/08/2019
--------------------------
ALL CLERK
Management
NET SALE 7
$230.62
DRWR TTL $230.62
--------------------------
AP-200793
--------------------------
BT-200803
--------------------------
CP-201588
--------------------------
DM-402789
--------------------------
JH-406479
--------------------------
KR-200543
--------------------------
LB-404475
--------------------------
PH-200890
--------------------------
C10
--------------------------
C11
--------------------------
C12
--------------------------
C13
--------------------------
C14
--------------------------
CLERK 15
--------------------------
CLERK 16
--------------------------
CLERK 17
--------------------------
CLERK 18
--------------------------
CLERK 19
--------------------------
CLERK 20
--------------------------
CLERK 21
--------------------------
CLERK 22
--------------------------
CLERK 23
--------------------------
CLERK 24
--------------------------
CLERK 25
--------------------------
CLERK 1 No.000078 00000
DATE 10/09/2019 WED TIME 02:20
Z 1 REPORT 00007
LAST REPORT 10/08/2019
--------------------------
GROUP
GROUP : 1 TEXTILES
CNT 35
SALES AMT $143.06
GROUP : 2 FURNITURE
CNT 1
SALES AMT $25.00
GROUP : 3 ELEC/MECH
CNT 1
SALES AMT $15.00
GROUP : 4 ACCESSORIES
CNT 3
SALES AMT $33.58
GROUP : 10 DONATION
CNT 2
SALES AMT $0.97
**************************
TOTAL CNT 42
TOTAL AMT $217.61
CLERK 1 No.000079 00000
DATE 10/09/2019 WED TIME 02:21
Z 1 REPORT 00007
LAST REPORT 10/08/2019
--------------------------
ALL PLU
Shirt
CNT 1 $4.29
T-Shirt/Cami
CNT 5 $9.06
Sweater/Sweatshirt
CNT 5 $22.95
Winter Coat
CNT 5 $31.45
Jeans
CNT 7 $40.53
Skirts
CNT 1 $3.99
Dress
CNT 1 $5.79
Jacket/Blazer
CNT 1 $6.29
Bra/Slip
CNT 3 $7.77
Childrens Clothing
CNT 5 $9.95
Newborn - 5t
CNT 1 $0.99
Purses/Backpack
CNT 2 $8.58
Furniture
CNT 1 $25.00
Elec/Mech
CNT 1 $15.00
Shoes
CNT 1 $25.00
DONATION
CNT 2 $0.97
**************************
TOTAL CNT 42
TOTAL AMT $217.61
CLERK 1 No.000080 00000
Keeping unstructured data means that in the end you don't have anything more like a folder full of little papers...you can't get any useful reports...you can't monitor stock, client trends...you name it...but if you want to dump them just somehow for future reference even the simple file system would be just fine (thousands of little 1K files...given that you have 60 stores x 300 days = 18000 files per year.. assuming that the stores work 6 days a week minus the holidays roughly )
What you really need is a good "parser" ...an application that will read your input and based on predefined rules split the "data" to useful data and store them in an appropriate database engine (doesn't really matter which one).
What you really need is a good "parser" ...an application that will read your input and based on predefined rules split the "data" to useful data and store them in an appropriate database engine (doesn't really matter which one).
60 stores x 300 days = 18000 files per year.. assuming that the stores work 6 days a week minus the holidays roughly
Assuming of course each store only has one register. Out of curiosity, how many registers are at each location (approx)
Do the stores have a "controller/server" which report to a primary?
Or are you pulling each report manually?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
This makes #5 to support a SQL type solution over NoSQL. :)
The data looks like it will actually parse rather easily. Each transaction seems to start with a date/time stamp, followed by a list of sold items (and the transaction close-out items) or a "Z 1 REPORT" header. There's not a lot of challenge in writing something that would convert the text into something that a load routine could handle.
Use the date/time as a transaction ID (assuming that your cashier isn't so fast that he/she can run multiple customers in under a second). You might want to prepend that with a store-id. Then list the sales. You've got several special items that you'll have to decide how to handle -- "2X @ 2.29", "50 % OFF", etc. but they aren't that much of a challenge.
What you really want is the register to post a UPC or product code on every sales line. That would give you much better reporting and analysis. But what you have will still let you categorize sales and volumes.
Kent
The data looks like it will actually parse rather easily. Each transaction seems to start with a date/time stamp, followed by a list of sold items (and the transaction close-out items) or a "Z 1 REPORT" header. There's not a lot of challenge in writing something that would convert the text into something that a load routine could handle.
Use the date/time as a transaction ID (assuming that your cashier isn't so fast that he/she can run multiple customers in under a second). You might want to prepend that with a store-id. Then list the sales. You've got several special items that you'll have to decide how to handle -- "2X @ 2.29", "50 % OFF", etc. but they aren't that much of a challenge.
What you really want is the register to post a UPC or product code on every sales line. That would give you much better reporting and analysis. But what you have will still let you categorize sales and volumes.
Kent
ASKER
kenfcamp:
Answers to your questions: On average each store has 3 cash registers.
What I have at each location is a fanless, miniPC, with 4Gb Ram and 128Gb SSD running Linux with a USB to 4-port serial hydra cable which connects to a serial port of each cash register. These localized servers can and will upload to a master server. (but I have not built that yet as I am waffling on just what I should do there as a clearing house DB.)
NOTE TO ALL: This capture is for audit trails. It will not be used to track inventory or even "live sales data". How these data will be used on the backend is for historical reporting, trend analysis and financial audits.
Answers to your questions: On average each store has 3 cash registers.
What I have at each location is a fanless, miniPC, with 4Gb Ram and 128Gb SSD running Linux with a USB to 4-port serial hydra cable which connects to a serial port of each cash register. These localized servers can and will upload to a master server. (but I have not built that yet as I am waffling on just what I should do there as a clearing house DB.)
NOTE TO ALL: This capture is for audit trails. It will not be used to track inventory or even "live sales data". How these data will be used on the backend is for historical reporting, trend analysis and financial audits.
Just setup a cloud VPS with MySQL and push everything there.....for the good data..
For the text part...i would setup a Firebird server...do a bit of parsing and store the Text as BLOBs..
For the text part...i would setup a Firebird server...do a bit of parsing and store the Text as BLOBs..
ASKER
Thank you all so far for your very valuable insight and expertise!!
This is greatly helping me think this through!
I truly cannot thank you all enough.
This is greatly helping me think this through!
I truly cannot thank you all enough.
It looks like you have both transaction detail segments and end-of-day reports. I'd probably separate those two types of data segments.
Hi,
I would look at the MariaDB database and the Connect Storage Engine and create some sort of parser that can be called using REST API to move the data easily from unstructured to structured data.
The connect storage engine has also other forms to read data like this
Then use InnoDB or Aria storage engines to hold the relational tables.
Regards,
Tomas Helgi
I would look at the MariaDB database and the Connect Storage Engine and create some sort of parser that can be called using REST API to move the data easily from unstructured to structured data.
The connect storage engine has also other forms to read data like this
Then use InnoDB or Aria storage engines to hold the relational tables.
Regards,
Tomas Helgi
Should not be too difficult to write something in Perl or some other text processing language to turn this into a series of structured transaction, that will give you much better insight into your sales data than storing as free text
But for storing anything like this where you have transactions and are undoubtedly going to need reporting and aggregation, you'll be best served with a relational SQL database like SQL Server or MySQL. The only advantage NoSQL has on SQL Server is the ability to scale horizontally which isn't going to be a concern for the vast majority of applications (you can even do "schemaless" implementations in SQL Server with a few tables to support the schemaless schema).
How you build those tables would be based on what you need to parse-- but you'd want to store things such as the quantity, individual cost, product name, the discount that was applied, the clerk involved, etc. and probably the rest of the transaction such as the tax collected and total costs.
Either way, you will want to parse out the data, but on NoSQL you're going to eat it on aggregation and reporting, and you'll end up storing much more data than is required as opposed to doing something like creating a new product record for each existing product and then referencing and indexing that id.
My rule of thumb (and this isn't everyone's): NoSQL is good for static, non-changing, non-reporting data. So something like tons of log files that could have varying schemas, or chat conversations that won't be a part of reporting.