Loading a new database

I want to create a new MySQL database that will be loaded from about 50,000 current PayPal records. The PayPal records have the customer's information plus the product information.  I want to split this into a Customer table and an Order table.

My current thinking is to get a CSV from PayPal and input it into a Raw table in MySQL, using phpMyAdmin. Then I could do a SELECT on the Raw table to build a table that just has the customer information fields, and If I add a DISTINCT I should have a table, or "view"  of just the unique customers with their information.  I would then like to feed that data to the Customer table, which will have an auto incrementing  Primary key of Customer_ID and then all the fields for the customer info.  I guess I could output that "view" as a CSV and thne input the CSV file to the new Customer table, but it seems like there should be a better way.

I'm just starting with MySQL and phpMyAdmin so perhaps someone could suggest the "proper" way to build this Customer table from the raw PayPal data.

Then I need to build an Order table, that contains just a single Customer_ID column, in place of all the customer information fields in the Raw data, with the details of each order in the rest of the columns.

Any suggestions on the proper road to down to get this all done would be most welcome.

Who is Participating?
The process you're describing (splitting the data into two tables, with one record per customer) is known as normalization. The primary reason we normalize a database is to prevent redundant or inaccurate data. With the low cost of storage these days, saving space is of relatively little benefit. But there are many other (more important) reasons for normalizatrion. Databases are typically normalized to make them accurate, fast, and efficient. With normalized data, queries, forms, and reports are generally much easier to design. There are many good references on the Web that explain data normalization, so I won't go into the gory details here (Google is your friend... :^) ), but I'll use a simple example to show some of the issues.

I don't know the exact contents of your PayPal data are, so I'll use a similar (but generic) configuration. Let's suppose that your raw data contains the Customer's address, and you're tasked with preparing a mailing list of customers to send the latest and greatest sales offer. If a Customer has ordered from you three times are you going to send them three mailings? If their address changed between their second and third orders are you going to send to of them to the old (i.e. incorrect) address? If a new order comes in from the same Customer and they've moved again, are you going to store and use that address as well? You could update all of their old orders with the new address, but how can you be sure you got them all or didn't change too many? (Did I mention you have three customers named 'John Doe' in your data?).

So, at the end of the day, it really depends on what you plan to do with the data in the longer term. If yours is a "one off" project, maybe querying the raw data for what you need is OK. In fact, there are legitimate reasons for occasionally denormalizing data to make life easier. However, if your intent is to create a real customer database, I'd recommend splitting up the data. Creating our mailing list (as above) is then straightforward: each customer has only one address. Applying a change of address is easier: the customer's address is only stored in one place. The two "John Does" each have their own customer record,etc.

The important thing is what your plans for using the data are. In the vast majority of cases, if you're going to use the data regularly, normalization is a best practise.If your project is a "one off, fire-and-forget" exercise, you can probably get by without splitting the raw data.
If the raw data is well-defined, i.e. the source fields are all easily identifiable, you can use an INSERT...SELECT statement to extract the data directly from the raw data table into your desired target, e.g. the CUSTOMER table. Not knowing the formats you're using it's hard to be more specific, but here's a simple example;

CREATE TABLE raw_data (raw_text VARCHAR(255);
INSERT INTO raw_data VALUES('John Smith');

This gives us a raw_data table with a single record and a target customer table containing (for the purposes of our simple example) only the custid primary key and first and last name fields.

Now we use an INSERT...SELECT statement:

INSERT INTO customer (id, fname,lname) SELECT DISTINCT NULL, SUBSTRING_INDEX(raw_text,' ',2) FROM raw_data;

Another technique often used by DBAs for handling these types of "one off" conversions is to use SQL to generate SQL. You basically use a combination of queries and literal text strings to get the info you need and build the required INSERT statements in a text file. The basic query, using the above simple example would be:

SELECT CONCAT('INSERT INTO customer (id, fname, lname) VALUES (NULL,\'', LEFT(raw_text,INSTR(raw_text,' ')-1 ),'\',\'',RIGHT(raw_text, INSTR(raw_text,' ')+1),'\');' FROM raw_data;

The benefits of this technique include having a permanent record of the operation (the text file with the SQL In it), and being able to review the intermediate results in the text file and correct any 'gotchas' using your favorite text editor.

Using this technicque along with the data dictionary (e.g. the INFORMATION_SCHEMA database in MySQL) can be very powerful. For example, you can build queries that will create the SQL to recreate your entire schema!

Using this second technique is not quite as easy in MySQL as in some other databases, but there is a good discussion of several ways to use the technique for MySQL given on
Anders Karlsson's blog

Hope that helps!
BTW, building your Orders table can use similar techniques as above. You can join to your new Customer table using whatever unique combination of columns you used to extract the individual Customer data to go and get the corresponding custid primary key to use as a foreign key in your Orders table.
stevaAuthor Commented:
Thanks, Paul, for all the information.

I was thinking later, though, why break the raw data into two tables at all? I know if you were starting out with a new database and no data you'd want a customer table that was separate from the order table so you could just have one column in the order table representing the customer.  That is, you wouldn't want to take up the space in every order record for all of the customer info. But space seems to be the only reason not to use the raw PayPal data as is.  It seems that I could certainly do any queries on the single  raw table that I could on the two tables, so if I don't mind using a few extra megaybtes  why bother at all splitting the raw table into two tables?

stevaAuthor Commented:
Thanks Paul, again,  for the great information.

I was hoping to use DISTINCT in the queries to at least avoid the duplicate problem you mentioned.  But I agree.  Once a customer changes his address it's certainly easier to change it in one record in the customer table than in every record for them in the order table.  

It's going to come down to how difficult it is to split the current data into the two tables. I'll try to split them first, using the suggestions from your first post.

I gave you the points.

I hope you're around the next time I post a MySQL question.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.