scrubbing data before putting into ms tables

Patricia Timm
Patricia Timm used Ask the Experts™
Need to scrub data for a project. The data might be in excel or a delimited file. Traditionally is the data scrubbed before I place in tables or would I import the file into the access table and then scrub using queries -  by data scrubbing I mean check for completeness, no dups, check for null values, itegrity, etc
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
John TsioumprisSoftware & Systems Engineer
Probably your best option is to try to import the data to an access table (make sure you have enough columns to "accommodate" all the cases) and work your way by using DAO.Recordset for iteration and string function for validation/checking....depending on the complexity you might also use collection objects ( Collection/Dictionaries...etc) or queries.
Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014
Typically you do the initial load into staging tables, not the main ("real") table.  As data is scrubbed / confirmed clean, you move it from staging to main / actual production table.
Top Expert 2014
You should define an import table with column/field characteristics that you want to enforce, such as number, string, date.  Also, define whether a column/field can be Null.

You can define a unique index on the columns that define the primary key for the table.
Most Valuable Expert 2015
Distinguished Expert 2018
is the data scrubbed before I place in tables or would I import the file into the access table and then scrub using queries

Neither. Link the file/worksheet.
Then create a normal select query using the linked table as source. Now, step by step, add filters (to rule out invalid records) and expressions to convert the data, for example from text dates to true DateTime values.
Also, apply alias for the fields to change the typical F1, F2, etc. names to meaningful field names.
At any time, you can run the query and study the output and do modifications until the output is exactly as you need it.

Finally, use this query as source for the further processing like updating or appending existing tables.
We call the first table the 'quarantine' table. It has a structure that matches the expected fields from all the sources. We don't put any restrictions like primary keys or validation rules - the goal is to get every record imported.

Once that succeeds, we check for validity and report issues before importing to the main table. We decide whether to allow some of the records to be imported, or reject the whole batch if there's a problem with any of them.

Common scenarios include needing to handle new lookup values, loading parent and child records, incorrect batch totals, or multi-record validation errors.

For diagnostic purposes, we often leave the records in the quarantine table for a time after they've been successfully added to the main table.

Hope this helps!

Armen Stein
J Street Technology
Chief Technology Officer
As you can see from the comments above, the solution you chose can depend on the initial state of your "contaminated data".
If it is so bad that you feel you need to go over it from "scratch" and work it into exactly what you want (which I've had to do several times for client database conversion projects), I use the approach below.  I have to assume that the data is in such a shape that this "cleansing" step is necessary and not an easy matter.
1.  Import the data into Access so you can get full database control of the data.  It is a lot easier to "fix" database data if it is under "database control" vs Excel/csv.
2.  If possible, use a .csv file with an Import Specification where you can define the data types for the columns, being aware that the specification will "cleanse" (convert) the data as specified by the column data types in the specification.  Make SURE that's what you want, or use all text data type so the specification converts nothing, especially if your initial Access import table is all text fields.  You can link the .csv file to Access with an import spec and see if you get what you want.  You can also use the linked file to create a new Access table and see what you get as far as field data types are concerned, etc.
3.  The initial import table should be capable of accepting ALL of the data as-is so you can see what you have and figure out what you want to do with it.  The easiest way to do that is to have a table with all short text fields at 255 size (except where you know the data would require more than 255 characters, use Long Text).  If you use fields with a date/time or numeric data types, you will automatically convert the data in that field to that type, which works if that's what you want, but make sure it converts "properly" and gives you what you want.  One trick is to put a "rowid" autonumber field on the table so you can uniquely identify each record and have an "order of import".
4.  Once you get all that figured out and you have your data in your "quarantine" table, you can then start running query processing on it to "mold" the data into what you want.  Make a copy of your quarantine table before making modifications to the data so you don't have to re-import.  You can also change the field data types with the table in design view and see what you get.
5.  Once you finish that, you can transfer the data into your final table, complete with primary keys, indexes, and specific data types.
6.  Your new table should be perfect at this point.

p.s.  Use Copy-and-Paste from Excel into Access as a very, very "Hail Mary" method of last resort for import, as there can be a lot of problems with this method (such as truncation, invalid conversion, etc.).  If you have to use it, make SURE you get what you want once the data is in Access.


Thanks - appreciate the help

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial