Link to home
Start Free TrialLog in
Avatar of trevor1940

asked on

PostgreSQL: Copy from File missing data error

I'm trying to build a PostgreSQL/PostGIS database using Entire country files dataset but I'm getting missing data error

I'm wondering if the copy command is affected by diacritics or I've not set the database up properly
Created a new database with UTF8 encoding

I built the table schema based on the given format (but using type text for NM_MODIFY_DATE not varchar,  having these as dates didn't make a diferance)
PostgreSQL docs state fields can be NULL by default  so

create table my_table(
test1 text;  -- can be null
test2 text NOT NULL;  -- Has to have a value

Open in new window

In my  theory every column apart from id doesn't have to contain a value
I used large text viewer to open the uncompressed countries.txt file and copied the top 5 rows into a test file

Using   PostgreSQL Copy this test file imported correctly so I know my schema is correct
copy my_table(List of columns ) from '\\Path\\To\\TestFile.txt' WITH delimiter E'\t' csv HEADER;

Open in new window

When tried to ingest the larger  countries.txt (2.9GB) file I get an error "missing data" for column xyz at line 12345 (Last column with  data in that row, NM_MODIFY_DATE)

Using large text viewer again I located this line with proceeding and following lines into my test file and tried the copy again but get the same error

I opened the test file in excel to  see if there is columns  missing although not every row has data in each column they do  all match
the problem row has UFI -3373955 & UNI 6329083  

I can't edit the large text file to remove the problem row (Software not available on system) also the file is UTF8 and I don't want to corrupt the character set

I contemplated writing a perl script to parse the file replacing the tab with a '|' pipe (and removing the problem row,  although perl nativity reads / writes in UTF8 I don't want to possibly corrupt the file especially the diacritics

I'm going to ask perl experts

PostGreSQL doesn't have the ability to skip problem rows if there is a problem the whole file fails to copy so there could be other rows that errors

Any one got any ideas?
Avatar of Martyn Spencer
Martyn Spencer
Flag of United Kingdom of Great Britain and Northern Ireland image

I am going to suggest that you ask this on the Postgresql IRC channel. There are a few guys there who are incredibly knowledgeable about this type of question and would probably have an answer for you quite quickly.

Edit: It may be helpful if you want some more answers here to include a file that contains a row that works and one where it does not. I imagine it would also help if you go down the IRC channel route.
Avatar of trevor1940


Postgresql IRC channel

Do you mean here?

I don't know if this is relevant but looking at the database properties the 'collection' &  'Character type' are both set as "English_United Kingdom, 1252 " I  didn't set this and creating a  new DB the options are "C",  "English_United Kingdom, 1252 " or "POSIX"

BTW I'm on PostgreSQL 9.0
Yes, the IRC channel is on Freenode. The people there are generally very helpful and also quick to respond. I believe that many of the devs can be found there regularly.

It may well be that if you are using a certain encoding that it would cause problems. This would more likely be the case if you are using a single byte encoding scheme and are attempting to store UTF characters. I have not tested this theory for myself and a quick scan of the docs did not indicate whether or not it would be a problem.

I generally create my databases using something like the following:

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'

Open in new window

Naturally, the values of each parameter may need changing to suit your environment, but the ENCODING, LC_COLLATE and LC_CTYPE are relevant to you.

The Postgresql help pages do advise which character sets are supported:
Had to do this from home
clicked on the link pop up box says mibbit
Enter nickname Terminated

I wasn't able to get the irc chanel to work in any browser so I joined and sent email to the general mailing list
I tried the following

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF8'
       LC_CTYPE = 'en_GB.UTF8'

Open in new window

both give "ERROR: invalid local name en_GB.UTF8  /  en_GB.UTF-8"

The doc you linked to suggests  "On Windows, however, UTF-8 encoding can be used with any locale."

This Page it isn't possible to set local to  en_GB / en_US

If I'm understanding correctly you don't have to because  UTF-8 encoding is valid with English_United Kingdom, 1252

the general mailing list hasn't yet helped with suggestions of upgrading & to use the unstable pg_bukloader and isn't supported under windows

However 1 suggestion is an special character not being escaped properly

I assumed the copy command would  handle this??



-- doesn't need to be escaped to


Open in new window

The script that I use works with en_GB.UTF-8, since I am using Linux and it is a valid choice. The LC_COLLATE and LC_CTYPE are probably a default and tied to the client OS you are using. You could omit these and you will find that Postgresql will usually infer them. I would have expected that the file you are importing would be escaped appropriately. So long as the database encoding is set to UTF8 I would expect that it would handle any UTF characters you wish to import.

Do you have the script that you used to generate the table? Do you mind providing it. I will take a look at importing the file myself and report back my findings.
Do you have the script that you used to generate the table? Do you mind providing it. I will take a look at importing the file myself and report back my findings.

That isn't simple to do it's on a closed system

I can retype it here if that will help?
Yes, please. I would like to create the same table in one of my databases and import the same file to see first-hand what is happening, if that's OK?
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
      TEMPLATE = postgis
       LC_COLLATE = 'English_United Kingdom, 1252'
       LC_CTYPE = 'English_United Kingdom, 1252'

CREATE TABLE  features(
lat REAL,
long REAL,
dms_lat text,
dms_long text,
mgrs CHARACTER(15); 
ccl CHARACTER(255);
adm1 CHARACTER(2);
elev REAL,
cc2 CHARACTER(255);
short_form CHARACTER(128);
generic CHARACTER(28);
short_name_ro  CHARACTER(255);
full_name_ro  CHARACTER(255);
full_name_nd_ro  CHARACTER(255);
short_name_rg  CHARACTER(255);
full_name_rg  CHARACTER(255);
full_name_nd_rg  CHARACTER(255);
note TEXT,
mod_date TEXT;
name_rank  TEXT;
name_link  TEXT;
transl_cd  TEXT;
nm_modify_date  TEXT;
f_effective_dt  TEXT;
f_term_dt TEXT


Open in new window

the above has been typed in so may contain typo's it was ordinarily taken from the header row of the countrynames.txt
the column types from given format

I then used the copy command as in opening remarks

hope that helps
Thanks. Taking a look now. I have replaced the semicolons with commas in your script. I would also (for your own reference) change your longer "character" columns to "character varying" since the Postgresql docs imply that the extra storage required by "character" can impose a small performance penalty. This will not be related to the error you are experiencing.
I have replaced the semicolons with commas in your script

Ooops sorry I did say there may be typo's

I down loaded the separate country files and found the problem row

looking at it I'm wondering if I've been barking up the wrong tree and nothing to with UTF-8 but more to do with not escaping the quotes

4	-3373955	6329083	32.739348	64.994051	32:44:22N	64:59:39E	41SPS8683824150	NI41-16	P	PPL		AF	10				V	rus			DEKHYAK	Dekh”yak	Dekh"yak	DEKHYAK	Dekh”yak	Dekh"yak		2010-05-03	1,2,3,4,5				2005-06-01

Open in new window

It may not be obvious from this it looks as if there are 2 different double quotes not 2 singles as I thought I've uploaded a snip which is clearer User generated image
If this is the issue I'm guessing the only way to resolve this will via a perl script
It looks as though when the file was generated it used tab delimiters but they have also duplicated the quotes. With tab delimiters I would not have expected it to be necessary to do this and I would not have thought it necessary to escape them either, since there is no conflict with the delimiter chosen. What I expect, and what is, may not necessarily prove to be the case. If the usage of quotes is completely consistent, you could more than likely use sed to modify the file (or a perl script if you prefer). I will still have a go at importing the file shortly and see if there is a way of handling the file as-is, rather than tweaking it. This is for personal interest now :)
I agree with you I'd rater use data in it's raw form 1st so my actions don't corrupt the data, I'm confidant they won't but it's a possibility especially when dealing with diacritics and large data sets 2nd We may wish to update this periodicaly

"sed" Is that a Linux  command?
sed is a Linux command, yes. But there is a Windows alternative here (probably others exist as well) and if you look at Cygwin, you would definitely be able to use it. Cygwin is basically a command interpreter environment that lets you run Linux commands on Windows. I used it a lot before i transitioned to Linux for my main environment.
Avatar of Martyn Spencer
Martyn Spencer
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial

Thanx very much

That appears to be working I've left it running
I did use an id field primarily because it's a requirement of QGIS and simple row location
Excellent. Glad to hear that it's working for you :)