Loading a new database

Posted on 2011-02-13
Last Modified: 2012-05-11
I want to create a new MySQL database that will be loaded from about 50,000 current PayPal records. The PayPal records have the customer's information plus the product information.  I want to split this into a Customer table and an Order table.

My current thinking is to get a CSV from PayPal and input it into a Raw table in MySQL, using phpMyAdmin. Then I could do a SELECT on the Raw table to build a table that just has the customer information fields, and If I add a DISTINCT I should have a table, or "view"  of just the unique customers with their information.  I would then like to feed that data to the Customer table, which will have an auto incrementing  Primary key of Customer_ID and then all the fields for the customer info.  I guess I could output that "view" as a CSV and thne input the CSV file to the new Customer table, but it seems like there should be a better way.

I'm just starting with MySQL and phpMyAdmin so perhaps someone could suggest the "proper" way to build this Customer table from the raw PayPal data.

Then I need to build an Order table, that contains just a single Customer_ID column, in place of all the customer information fields in the Raw data, with the details of each order in the rest of the columns.

Any suggestions on the proper road to down to get this all done would be most welcome.

Question by:steva
  • 3
  • 2

Expert Comment

ID: 34890266
If the raw data is well-defined, i.e. the source fields are all easily identifiable, you can use an INSERT...SELECT statement to extract the data directly from the raw data table into your desired target, e.g. the CUSTOMER table. Not knowing the formats you're using it's hard to be more specific, but here's a simple example;

CREATE TABLE raw_data (raw_text VARCHAR(255);
INSERT INTO raw_data VALUES('John Smith');

This gives us a raw_data table with a single record and a target customer table containing (for the purposes of our simple example) only the custid primary key and first and last name fields.

Now we use an INSERT...SELECT statement:

INSERT INTO customer (id, fname,lname) SELECT DISTINCT NULL, SUBSTRING_INDEX(raw_text,' ',2) FROM raw_data;

Another technique often used by DBAs for handling these types of "one off" conversions is to use SQL to generate SQL. You basically use a combination of queries and literal text strings to get the info you need and build the required INSERT statements in a text file. The basic query, using the above simple example would be:

SELECT CONCAT('INSERT INTO customer (id, fname, lname) VALUES (NULL,\'', LEFT(raw_text,INSTR(raw_text,' ')-1 ),'\',\'',RIGHT(raw_text, INSTR(raw_text,' ')+1),'\');' FROM raw_data;

The benefits of this technique include having a permanent record of the operation (the text file with the SQL In it), and being able to review the intermediate results in the text file and correct any 'gotchas' using your favorite text editor.

Using this technicque along with the data dictionary (e.g. the INFORMATION_SCHEMA database in MySQL) can be very powerful. For example, you can build queries that will create the SQL to recreate your entire schema!

Using this second technique is not quite as easy in MySQL as in some other databases, but there is a good discussion of several ways to use the technique for MySQL given on
Anders Karlsson's blog

Hope that helps!

Expert Comment

ID: 34890327
BTW, building your Orders table can use similar techniques as above. You can join to your new Customer table using whatever unique combination of columns you used to extract the individual Customer data to go and get the corresponding custid primary key to use as a foreign key in your Orders table.

Author Comment

ID: 34893431
Thanks, Paul, for all the information.

I was thinking later, though, why break the raw data into two tables at all? I know if you were starting out with a new database and no data you'd want a customer table that was separate from the order table so you could just have one column in the order table representing the customer.  That is, you wouldn't want to take up the space in every order record for all of the customer info. But space seems to be the only reason not to use the raw PayPal data as is.  It seems that I could certainly do any queries on the single  raw table that I could on the two tables, so if I don't mind using a few extra megaybtes  why bother at all splitting the raw table into two tables?


Accepted Solution

paulwquinn earned 500 total points
ID: 34897493
The process you're describing (splitting the data into two tables, with one record per customer) is known as normalization. The primary reason we normalize a database is to prevent redundant or inaccurate data. With the low cost of storage these days, saving space is of relatively little benefit. But there are many other (more important) reasons for normalizatrion. Databases are typically normalized to make them accurate, fast, and efficient. With normalized data, queries, forms, and reports are generally much easier to design. There are many good references on the Web that explain data normalization, so I won't go into the gory details here (Google is your friend... :^) ), but I'll use a simple example to show some of the issues.

I don't know the exact contents of your PayPal data are, so I'll use a similar (but generic) configuration. Let's suppose that your raw data contains the Customer's address, and you're tasked with preparing a mailing list of customers to send the latest and greatest sales offer. If a Customer has ordered from you three times are you going to send them three mailings? If their address changed between their second and third orders are you going to send to of them to the old (i.e. incorrect) address? If a new order comes in from the same Customer and they've moved again, are you going to store and use that address as well? You could update all of their old orders with the new address, but how can you be sure you got them all or didn't change too many? (Did I mention you have three customers named 'John Doe' in your data?).

So, at the end of the day, it really depends on what you plan to do with the data in the longer term. If yours is a "one off" project, maybe querying the raw data for what you need is OK. In fact, there are legitimate reasons for occasionally denormalizing data to make life easier. However, if your intent is to create a real customer database, I'd recommend splitting up the data. Creating our mailing list (as above) is then straightforward: each customer has only one address. Applying a change of address is easier: the customer's address is only stored in one place. The two "John Does" each have their own customer record,etc.

The important thing is what your plans for using the data are. In the vast majority of cases, if you're going to use the data regularly, normalization is a best practise.If your project is a "one off, fire-and-forget" exercise, you can probably get by without splitting the raw data.

Author Comment

ID: 34899004
Thanks Paul, again,  for the great information.

I was hoping to use DISTINCT in the queries to at least avoid the duplicate problem you mentioned.  But I agree.  Once a customer changes his address it's certainly easier to change it in one record in the customer table than in every record for them in the order table.  

It's going to come down to how difficult it is to split the current data into the two tables. I'll try to split them first, using the suggestions from your first post.

I gave you the points.

I hope you're around the next time I post a MySQL question.


Featured Post

Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Using in clause in query with many values 7 50
MySQL - Limit or Top Records 15 49
RDBMS and No sql database 4 58
How to count in a table in php 22 35
Fore-Foreword Today (2016) Maxmind has a new approach to the distribution of its data sets.  This article may be obsolete.  Instead of using the examples here, have a look at the MaxMind API ( …
I use MySQL for many of my development projects in a Windows environment. To manage my databases (and perform queries) for years I used a tool called MySQL administrator.  This tool has since been replaced by MySQL Workbench. So I decided to m…
This Micro Tutorial will give you a basic overview how to record your screen with Microsoft Expression Encoder. This program is still free and open for the public to download. This will be demonstrated using Microsoft Expression Encoder 4.
Email security requires an ever evolving service that stays up to date with counter-evolving threats. The Email Laundry perform Research and Development to ensure their email security service evolves faster than cyber criminals. We apply our Threat…

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question