Solved

Duplicate ID values in CSV file result in error on import

Posted on 2015-02-04
6
108 Views
Last Modified: 2015-02-05
I am having problems with a commercial product I purchased, and having trouble getting support from the company. I want to make sure I'm not doing something wrong, so checking here.

The file I purchased is a large CSV containing postal codes. The first step in the process is to create a table like this:

CREATE TABLE GeoPC_Places (
  ISO VARCHAR(2) NOT NULL,
  Country VARCHAR(50) NOT NULL,
  Language VARCHAR(2) NOT NULL,
  ID BIGINT NOT NULL,
  Region1 VARCHAR(80),
  Region2 VARCHAR(80),
  Region3 VARCHAR(80),
  Region4 VARCHAR(80),
  Locality VARCHAR(80),
  Postcode VARCHAR(15),
  Suburb VARCHAR(80),
  Latitude DECIMAL(10,6),
  Longitude DECIMAL(10,6),
  Elevation INT,
  ISO2 VARCHAR(10),
  FIPS VARCHAR(10),
  NUTS VARCHAR(12),
  HASC VARCHAR(12),
  STAT VARCHAR(20),
  Timezone VARCHAR(30),
  UTC VARCHAR(10),
  DST VARCHAR(10),
  CONSTRAINT GeoPC_Places_pkey PRIMARY KEY CLUSTERED (Language ASC, ID ASC) ON [PRIMARY]
);

Open in new window


I am doing a bulk insert query like this:

BULK INSERT GeoPC_Places FROM 'C:\Users\bbdesign\documents\GeoPC_EU_Places1.csv' WITH (FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', DATAFILETYPE = 'widechar')

Open in new window


I get this error message:

Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'GeoPC_Places_pkey'. Cannot insert duplicate key in object 'dbo.GeoPC_Places'. The duplicate key value is (EN, 2018188927).
The statement has been terminated.

Open in new window


When I check the CSV file, there are indeed two records with 2018188927 in the ID column. Upon further checking, I see many records that have non-distinct ID values.

My thinking is that there is something wrong with this file, but since this comes from a seemingly well-established company (with a slow-to-respond support department), I want to make sure the error is not mine.

I am also not quite sure I necessarily need the ID column to be distinct. I am mostly concerned with using this file for the postal codes, latitude and longitude values. So if I cannot find any other workaround, maybe I can not make the ID column a primary key. But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?

Any advice?
Thanks!
0
Comment
Question by:bbdesign
  • 3
  • 2
6 Comments
 
LVL 65

Expert Comment

by:Jim Horn
ID: 40589818
Can't speak to the company you bought the file from, but a common scenario is to load a file into a temporary 'staging' table without any constraints.  Some developers even go as far as having all varchar columns.

Then once it's loaded, perform whatever T-SQL validations are necessary to make sure ID rows are unique, dates are dates, numbers are numbers, US states are valid US states, etc.  

Then deal with the offenders in T-SQL, then insert from the 'staging' table into your final production table.
0
 

Author Comment

by:bbdesign
ID: 40589825
Thanks. Yes, I could definitely do that.

My question is: I'm not doing something stupid here, am I? If their query has that constraint on the table, they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
0
 
LVL 65

Expert Comment

by:Jim Horn
ID: 40589840
>If their query has that constraint on the table
I see a table definition, not a query.

>they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
Perhaps, but doing ETL work I've learned to not trust source data unless there's a written contract signed in the vendor's blood, with a first born child as collateral, that the data is as described by that vendor.   So having to validate in this manner is pretty normal.  

If this duplication is actually correct data, then they have to back that up, and of course there goes your constraint, with some possible downstream impacts.

>But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?
Correct, but based on my first comment, I'd scrub the data so that it conforms to the constraint, then insert it into the table with the constraint.
0
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

 

Author Comment

by:bbdesign
ID: 40589872
I imported the CSV into a table with no constraints. I found there are 90,972 duplicate ID values. I guess I may need to wait for another contact from their support people to see what that is all about (which could be days, if it comes at all). Here is an example (both have ID=2018188927):

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;51;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;22;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

The column header looks like this:

iso;country;language;id;region1;region2;region3;region4;locality;postcode;suburb;latitude;longitude;elevation;iso2;fips;nuts;hasc;stat;timezone;utc;dst

The latitude and longitude values look the same. The only thing that is not a duplicate is the "elevation" value. I don't care about the elevation value. So I suppose I could get rid of the duplicates. I just hate having to modify a supposedly professionally-prepared file like this. But as you suggested, perhaps professionally-prepared doesn't mean its going to be right, even if I paid for it.
0
 
LVL 69

Accepted Solution

by:
Scott Pletcher earned 500 total points
ID: 40589926
I strongly agree with using a staging table.  

Load the raw file into a staging table, like "GeoPC_Places__Staging", then use a query to load the final "GeoPC_Places" table.

Drop the Elevation from the GeoPC_Places table, since you don't need it anyway, and then try loading it this way and see if you still get dups:

INSERT INTO GeoPC_Places
SELECT DISTINCT
ISO,
Country,
Language,
ID,
Region1,
Region2,
Region3,
Region4,
Locality,
Postcode,
Suburb,
Latitude,
Longitude,
ISO2,
FIPS,
NUTS,
HASC,
STAT,
Timezone,
UTC,
DST
FROM GeoPC_Places__Staging
0
 

Author Comment

by:bbdesign
ID: 40591180
Thanks for your advice!
0

Featured Post

Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Having an SQL database can be a big investment for a small company. Hardware, setup and of course, the price of software all add up to a big bill that some companies may not be able to absorb.  Luckily, there is a free version SQL Express, but does …
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question