Solved

Duplicate ID values in CSV file result in error on import

Posted on 2015-02-04
6
106 Views
Last Modified: 2015-02-05
I am having problems with a commercial product I purchased, and having trouble getting support from the company. I want to make sure I'm not doing something wrong, so checking here.

The file I purchased is a large CSV containing postal codes. The first step in the process is to create a table like this:

CREATE TABLE GeoPC_Places (
  ISO VARCHAR(2) NOT NULL,
  Country VARCHAR(50) NOT NULL,
  Language VARCHAR(2) NOT NULL,
  ID BIGINT NOT NULL,
  Region1 VARCHAR(80),
  Region2 VARCHAR(80),
  Region3 VARCHAR(80),
  Region4 VARCHAR(80),
  Locality VARCHAR(80),
  Postcode VARCHAR(15),
  Suburb VARCHAR(80),
  Latitude DECIMAL(10,6),
  Longitude DECIMAL(10,6),
  Elevation INT,
  ISO2 VARCHAR(10),
  FIPS VARCHAR(10),
  NUTS VARCHAR(12),
  HASC VARCHAR(12),
  STAT VARCHAR(20),
  Timezone VARCHAR(30),
  UTC VARCHAR(10),
  DST VARCHAR(10),
  CONSTRAINT GeoPC_Places_pkey PRIMARY KEY CLUSTERED (Language ASC, ID ASC) ON [PRIMARY]
);

Open in new window


I am doing a bulk insert query like this:

BULK INSERT GeoPC_Places FROM 'C:\Users\bbdesign\documents\GeoPC_EU_Places1.csv' WITH (FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', DATAFILETYPE = 'widechar')

Open in new window


I get this error message:

Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'GeoPC_Places_pkey'. Cannot insert duplicate key in object 'dbo.GeoPC_Places'. The duplicate key value is (EN, 2018188927).
The statement has been terminated.

Open in new window


When I check the CSV file, there are indeed two records with 2018188927 in the ID column. Upon further checking, I see many records that have non-distinct ID values.

My thinking is that there is something wrong with this file, but since this comes from a seemingly well-established company (with a slow-to-respond support department), I want to make sure the error is not mine.

I am also not quite sure I necessarily need the ID column to be distinct. I am mostly concerned with using this file for the postal codes, latitude and longitude values. So if I cannot find any other workaround, maybe I can not make the ID column a primary key. But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?

Any advice?
Thanks!
0
Comment
Question by:bbdesign
  • 3
  • 2
6 Comments
 
LVL 65

Expert Comment

by:Jim Horn
Comment Utility
Can't speak to the company you bought the file from, but a common scenario is to load a file into a temporary 'staging' table without any constraints.  Some developers even go as far as having all varchar columns.

Then once it's loaded, perform whatever T-SQL validations are necessary to make sure ID rows are unique, dates are dates, numbers are numbers, US states are valid US states, etc.  

Then deal with the offenders in T-SQL, then insert from the 'staging' table into your final production table.
0
 

Author Comment

by:bbdesign
Comment Utility
Thanks. Yes, I could definitely do that.

My question is: I'm not doing something stupid here, am I? If their query has that constraint on the table, they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
0
 
LVL 65

Expert Comment

by:Jim Horn
Comment Utility
>If their query has that constraint on the table
I see a table definition, not a query.

>they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
Perhaps, but doing ETL work I've learned to not trust source data unless there's a written contract signed in the vendor's blood, with a first born child as collateral, that the data is as described by that vendor.   So having to validate in this manner is pretty normal.  

If this duplication is actually correct data, then they have to back that up, and of course there goes your constraint, with some possible downstream impacts.

>But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?
Correct, but based on my first comment, I'd scrub the data so that it conforms to the constraint, then insert it into the table with the constraint.
0
Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

 

Author Comment

by:bbdesign
Comment Utility
I imported the CSV into a table with no constraints. I found there are 90,972 duplicate ID values. I guess I may need to wait for another contact from their support people to see what that is all about (which could be days, if it comes at all). Here is an example (both have ID=2018188927):

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;51;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;22;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

The column header looks like this:

iso;country;language;id;region1;region2;region3;region4;locality;postcode;suburb;latitude;longitude;elevation;iso2;fips;nuts;hasc;stat;timezone;utc;dst

The latitude and longitude values look the same. The only thing that is not a duplicate is the "elevation" value. I don't care about the elevation value. So I suppose I could get rid of the duplicates. I just hate having to modify a supposedly professionally-prepared file like this. But as you suggested, perhaps professionally-prepared doesn't mean its going to be right, even if I paid for it.
0
 
LVL 69

Accepted Solution

by:
ScottPletcher earned 500 total points
Comment Utility
I strongly agree with using a staging table.  

Load the raw file into a staging table, like "GeoPC_Places__Staging", then use a query to load the final "GeoPC_Places" table.

Drop the Elevation from the GeoPC_Places table, since you don't need it anyway, and then try loading it this way and see if you still get dups:

INSERT INTO GeoPC_Places
SELECT DISTINCT
ISO,
Country,
Language,
ID,
Region1,
Region2,
Region3,
Region4,
Locality,
Postcode,
Suburb,
Latitude,
Longitude,
ISO2,
FIPS,
NUTS,
HASC,
STAT,
Timezone,
UTC,
DST
FROM GeoPC_Places__Staging
0
 

Author Comment

by:bbdesign
Comment Utility
Thanks for your advice!
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
For both online and offline retail, the cross-channel business is the most recent pattern in the B2C trade space.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now