Is WordPress ALL I NEED?

I am considering making a site that can auto-analyze a certain type of uploaded report, and instantly display the results as a PDF. There are various steps involved in the creation of the PDF and I want a feeling for the effort and technology needed for each step.

There are three different steps I will discuss here to see where I can use WordPress plugins and where I need to customize the functionality.

The uploaded report would be a merchant's monthly credit card statement, like the following snippet..

Statement Example
1) So, for the first of three steps, I need a WordPress OCR plug-in. Are there many options for that? Is the angle of the text a problem? I can not guarantee neatness. (I added the underlining to make it easier for me to read)

I imagine allowing an authorized user to upload a report. And i need this plug-in to convert images to some form of digital data, like a PDF or a CSV file.

2) I need a way to analyze that data, and wonder if there is a configurable WordPress plugin for this? It will query the items by the Description, then use the numeric values in the Number, Amount and Total columns for mathematical computations. There will be some mathematical steps performed on some of the data as it generates the output for the report.

The results should go into some format, like a CSV file

3) I need a report tool which can import the data results from Step #2 and apply them to various pre-designed fields in the final pre-designed report.

I need to display that report in the browser and enable the user to download and print that report.

I presume steps #1 and #3 can be completely satisfied through the use of a WordPress plugin, with configuration. But Step #2, I am not so sure about.

Must that report analysis logic be created in PHP? It will need to query the data converted by the OCR plug-in, perform math functions, summarize the results and output to some intermediary file format.

Would I be creating a WordPress plugin to perform this step #2? If not, where exactly would this logic go?

And lastly, is there an existing WordPress plug-in which can speed up Step #2?

Please let me know.

newbiewebSr. Software EngineerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I would STRONGLY STRONGLY suggest NOT trying to do OCR of financial data via some WP plugin. OCR technology still makes mistakes that need human correction (even using neatly-printed statements). Additional marks like underlines can increase the risk of OCR mistakes, too.

So you have to ask yourself what the impact would be if the OCR read $750 as $150, for example. If there was no human correction, would an automated mistake like that create a bigger problem later? Again, when it comes to financial data, the answer is usually "yes" but you'll have to be the one to decide that.

If it were up to me, I'd completely separate the OCR into a separate step and invest into the highest quality OCR product/process that you can afford. Normally that would mean a good, sheetfed scanner (to eliminate angled  scans) and an established software product that handles OCR. I'd also suggest have code that tries to validate the OCR results (e.g. compare the OCR-ed total value against the OCR-ed credit and debit values) so you can be notified if you need to manually correct something.

Ideally you would just extract the original data from it's original source (maybe extract it from a statement PDF) which would give you reliable data every time. Is there a reason why you want to go the OCR route?

If you have to take in scanned pages and they might be at an angle or have marks, you'll probably have greater long term efficiency by hiring a data entry person.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
I'm with gr8gonzo. Do all this work with command line tools, which are debuggable.

Using WordPress for this... Yikes!

OCR is fuzzy, so as gr8gonzo said, you'll always require a manual review phase.

For example, I use an OCR printer to scan .pdf statements for generating my tax data.

I'd say roughly 95% of the time the OCR converter converts text correctly.

I'd say roughly 80% of the time the OCR converter aligns columns correctly.

Let's think about gr8gonzo's example a big differently. Let's say your OCR converter worked correctly, so you ended up with $750, except the columns were shifted one space left, so your if you're processing software processed the value as $50, rather than $750.

Here's how I run my OCR pipeline.

1) If I don't have a .pdf file or the .pdf file doesn't have a text component (many .pdf files are just one big jpeg image), then I run the document through my .pdf -> OCR conversion, using a ScanSnap printer, which scans double sided pages at 1 page/sec. Very fast.

2) Then I convert the .pdf to a .txt file using pdftotext (from the poppler utilities suite).

3) Next scanner code runs which attempts to auto correct/align any miss aligned records. This code is very important. It's written to die at any point where a single record of data isn't understood.

I add any oddball records added, till the scanner recognizes every line of a statement.

4) Then I pull every single statement into a manual text editor + scan the physical statement, against my text statement.

99% of the time all's correct. The other 1% I hand edit the file till every single record lines up correctly with the credit + debit column.

5) Only now can this data be used.

Likely you won't like hearing how badly data is formatted.

Each tax year, I find new formatted records + randomized formatting of statements that worked perfectly well, the year before.

All this said... If you're running a business related to analyzing data, you'll likely require some very sharp eyed + deft typist to handle step #4 of this process.
Note that almost every major financial institution today offers CSV or PDF export of statements. They also often offer integration capabilities via APIs but those are usually locked down to only major software developers like the makers of Quicken or similar financial software. Asking users to upload CSVs with headers would be probably THE most reliable and lowest-maintenance route of getting the data into the system.

If the data is coming from different institutions, then you'll have to contend with things like differing header texts (e.g. "DEBIT" vs. "Debit" vs. "DBT" vs. "Out" vs. "d├ębito") or possibly CSV formats where you have only one column for amounts and you can tell whether it's a positive or negative by the formatting (e.g. "(123.00)" and "-123.00" indicate debits while "123.00" indicates a credit), but that's all doable and again, it's a lot more reliable than the "guessing" that OCR does and if there's a new format that the system doesn't understand, you just accept the file and use some "manual processing" status to indicate to the user that they'll have to wait until you manually process the file and then you develop additional code to handle the new format for future documents.

Last thought - I'm "iffy" on using Wordpress for anything related to finances. As long as you continually stay up to date on security releases, it's probably fine, but WP is so popular that it's a huge target for hackers. If a new vulnerability is discovered in some version and you're not going in and updating your WP installation regularly, you're going to be exposed until you upgrade your WP site. Luckily, the upgrade process is EXTREMELY simple, but if you're depending on plugins, the plugins might not always be able to be easily updated, which means your process is broken until the plugin author fixes whatever the upgrade problem is.

You also have the possibility that WP is not a security problem but the plugin itself DOES have a security vulnerability, which is usually more dangerous because you don't often have teams of developers working to keep the plugin up to date like WP does.

So if you go the plugin route, you might want to consider hiring someone who knows how to properly develop a WP plugin and knows a fair amount about secure programming and can help ensure that your whole site / process is properly secured. I believe David Favor (previous commenter) does a fair amount of WP work, although I don't know if he develops plugins or not, or if he's even available, but Experts Exchange does have ways for you to connect with and hire an expert developer.

So just be mindful of all this - it sounds like you're working with other people's financial data, which always means you have to be VERY careful. Don't cut corners or try to take routes that seem unusually inexpensive.
Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

newbiewebSr. Software EngineerAuthor Commented:
Thank you both for your wise commentaries and suggestions. I agree that security is paramount and that creating a garbage report, based on garbage input, is not acceptable. OCR seems like far too hard a road to tackle, when the PDF/CSV route could consistently provide clean data.

IF I used WordPress for this project, and security challenges aside, it sounds as though the primary challenges I face would be to:

1) pre-process the inputted user data into a known intermediary data format, then analyze the data for a final outputted PDF

Does this smaller task fall into the realm of feasibility with WordPress? If so, what kind of plug-in or custom coding might this require?

What other platforms might be better for this project? I am flexible and want to find the make the best technical choice to make it as easy as possible.
This expert suggested creating a Gigs project.
Yes, WP could do that processing step with a plugin.

I'd suggest posting a gig with the details and your budget. Make sure you are thorough with your requirements. Include the full lifetime of data - who enters the data and how, any kind of variations on data format, how it should be processed, how the processed results should look, how the data should be secured, and how and when the data should be removed from the system (you shouldn't keep it forever). Make sure people submit thorough proposals that address all of those items.
Also, don't skimp on the budget. As a very rough guide, an experienced developer will usually START at around $75 an hour (but could easily be double if they have a lot of experience) and I wouldn't expect less than 2 full days (16 hours) to get this done. Consider how long it will take to get set up (communication overhead - people always underestimate how long it takes to write emails), do the initial coding, test with test data, fix bugs, do the visual layout, and the additional work to do after you test it out and realize you forgot something. So 24 hours might be a more reasonable figure, but that's just my ballpark estimate. Someone might be willing to do it for a flat fee or something (e.g. $500 or $1000) but you'll have to be very explicit and fair with the requirements.
newbiewebSr. Software EngineerAuthor Commented:
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Development

From novice to tech pro — start learning today.