Avatar of trevor1940
trevor1940
 asked on

PostgreSQL: Copy from File missing data error

I'm trying to build a PostgreSQL/PostGIS database using Entire country files dataset but I'm getting missing data error

I'm wondering if the copy command is affected by diacritics or I've not set the database up properly
Created a new database with UTF8 encoding

I built the table schema based on the given format (but using type text for NM_MODIFY_DATE not varchar,  having these as dates didn't make a diferance)
PostgreSQL docs state fields can be NULL by default  so

create table my_table(
id SERIAL PRIMARY KEY,
test1 text;  -- can be null
test2 text NOT NULL;  -- Has to have a value
}

Open in new window


In my  theory every column apart from id doesn't have to contain a value
 
I used large text viewer to open the uncompressed countries.txt file and copied the top 5 rows into a test file

Using   PostgreSQL Copy this test file imported correctly so I know my schema is correct
copy my_table(List of columns ) from '\\Path\\To\\TestFile.txt' WITH delimiter E'\t' csv HEADER;

Open in new window



When tried to ingest the larger  countries.txt (2.9GB) file I get an error "missing data" for column xyz at line 12345 (Last column with  data in that row, NM_MODIFY_DATE)

Using large text viewer again I located this line with proceeding and following lines into my test file and tried the copy again but get the same error

I opened the test file in excel to  see if there is columns  missing although not every row has data in each column they do  all match
 
the problem row has UFI -3373955 & UNI 6329083  

I can't edit the large text file to remove the problem row (Software not available on system) also the file is UTF8 and I don't want to corrupt the character set


I contemplated writing a perl script to parse the file replacing the tab with a '|' pipe (and removing the problem row,  although perl nativity reads / writes in UTF8 I don't want to possibly corrupt the file especially the diacritics

I'm going to ask perl experts

PostGreSQL doesn't have the ability to skip problem rows if there is a problem the whole file fails to copy so there could be other rows that errors

Any one got any ideas?
PostgreSQLDatabases

Avatar of undefined
Last Comment
Martyn Spencer

8/22/2022 - Mon
Martyn Spencer

I am going to suggest that you ask this on the Postgresql IRC channel. There are a few guys there who are incredibly knowledgeable about this type of question and would probably have an answer for you quite quickly.

Edit: It may be helpful if you want some more answers here to include a file that contains a row that works and one where it does not. I imagine it would also help if you go down the IRC channel route.
trevor1940

ASKER
Postgresql IRC channel

Do you mean here?

I don't know if this is relevant but looking at the database properties the 'collection' &  'Character type' are both set as "English_United Kingdom, 1252 " I  didn't set this and creating a  new DB the options are "C",  "English_United Kingdom, 1252 " or "POSIX"

BTW I'm on PostgreSQL 9.0
Martyn Spencer

Yes, the IRC channel is on Freenode. The people there are generally very helpful and also quick to respond. I believe that many of the devs can be found there regularly.

It may well be that if you are using a certain encoding that it would cause problems. This would more likely be the case if you are using a single byte encoding scheme and are attempting to store UTF characters. I have not tested this theory for myself and a quick scan of the docs did not indicate whether or not it would be a problem.

I generally create my databases using something like the following:

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;

Open in new window

Naturally, the values of each parameter may need changing to suit your environment, but the ENCODING, LC_COLLATE and LC_CTYPE are relevant to you.

The Postgresql help pages do advise which character sets are supported: https://www.postgresql.org/docs/9.0/static/multibyte.html
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
trevor1940

ASKER
Hi
Had to do this from home
clicked on the link pop up box says mibbit
Enter nickname
irc.freenode.net: Terminated

I wasn't able to get the irc chanel to work in any browser so I joined and sent email to the general mailing list
trevor1940

ASKER
I tried the following

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;
-- AND THIS

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF8'
       LC_CTYPE = 'en_GB.UTF8'
       CONNECTION LIMIT = -1;

Open in new window


both give "ERROR: invalid local name en_GB.UTF8  /  en_GB.UTF-8"

The doc you linked to suggests  "On Windows, however, UTF-8 encoding can be used with any locale."

This Page it isn't possible to set local to  en_GB / en_US

If I'm understanding correctly you don't have to because  UTF-8 encoding is valid with English_United Kingdom, 1252

the general mailing list hasn't yet helped with suggestions of upgrading & to use the unstable pg_bukloader and isn't supported under windows

However 1 suggestion is an special character not being escaped properly

I assumed the copy command would  handle this??

eg

\tO'Tool\t

-- doesn't need to be escaped to

\tO\'Tool\t

Open in new window

Martyn Spencer

The script that I use works with en_GB.UTF-8, since I am using Linux and it is a valid choice. The LC_COLLATE and LC_CTYPE are probably a default and tied to the client OS you are using. You could omit these and you will find that Postgresql will usually infer them. I would have expected that the file you are importing would be escaped appropriately. So long as the database encoding is set to UTF8 I would expect that it would handle any UTF characters you wish to import.

Do you have the script that you used to generate the table? Do you mind providing it. I will take a look at importing the file myself and report back my findings.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
trevor1940

ASKER
Do you have the script that you used to generate the table? Do you mind providing it. I will take a look at importing the file myself and report back my findings.

That isn't simple to do it's on a closed system

I can retype it here if that will help?
Martyn Spencer

Yes, please. I would like to create the same table in one of my databases and import the same file to see first-hand what is happening, if that's OK?
trevor1940

ASKER
CREATE DATABASE geonames
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
      TEMPLATE = postgis
       LC_COLLATE = 'English_United Kingdom, 1252'
       LC_CTYPE = 'English_United Kingdom, 1252'
       CONNECTION LIMIT = -1;


CREATE TABLE  features(
  id SERIAL PRIMARY KEY,
rc INTEGER,
ufi INTEGER,
uni INTEGER,
lat REAL,
long REAL,
dms_lat text,
dms_long text,
mgrs CHARACTER(15); 
jog CHARACTER(7);
fc CHARACTER(1);
dsg CHARACTER(5);
pc INTEGER,
ccl CHARACTER(255);
adm1 CHARACTER(2);
pop INTERGER,
elev REAL,
cc2 CHARACTER(255);
nt CHARACTER(2);
lc CHARACTER(3);
short_form CHARACTER(128);
generic CHARACTER(28);
short_name_ro  CHARACTER(255);
full_name_ro  CHARACTER(255);
full_name_nd_ro  CHARACTER(255);
short_name_rg  CHARACTER(255);
full_name_rg  CHARACTER(255);
full_name_nd_rg  CHARACTER(255);
note TEXT,
mod_date TEXT;
name_rank  TEXT;
name_link  TEXT;
transl_cd  TEXT;
nm_modify_date  TEXT;
f_effective_dt  TEXT;
f_term_dt TEXT


);

Open in new window


the above has been typed in so may contain typo's it was ordinarily taken from the header row of the countrynames.txt
the column types from given format

I then used the copy command as in opening remarks

hope that helps
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Martyn Spencer

Thanks. Taking a look now. I have replaced the semicolons with commas in your script. I would also (for your own reference) change your longer "character" columns to "character varying" since the Postgresql docs imply that the extra storage required by "character" can impose a small performance penalty. This will not be related to the error you are experiencing.
trevor1940

ASKER
I have replaced the semicolons with commas in your script

Ooops sorry I did say there may be typo's

I down loaded the separate country files and found the problem row

looking at it I'm wondering if I've been barking up the wrong tree and nothing to with UTF-8 but more to do with not escaping the quotes

4	-3373955	6329083	32.739348	64.994051	32:44:22N	64:59:39E	41SPS8683824150	NI41-16	P	PPL		AF	10				V	rus			DEKHYAK	Dekh”yak	Dekh"yak	DEKHYAK	Dekh”yak	Dekh"yak		2010-05-03	1,2,3,4,5				2005-06-01

Open in new window


It may not be obvious from this it looks as if there are 2 different double quotes not 2 singles as I thought I've uploaded a snip which is clearer GeoN.JPG
If this is the issue I'm guessing the only way to resolve this will via a perl script
Martyn Spencer

It looks as though when the file was generated it used tab delimiters but they have also duplicated the quotes. With tab delimiters I would not have expected it to be necessary to do this and I would not have thought it necessary to escape them either, since there is no conflict with the delimiter chosen. What I expect, and what is, may not necessarily prove to be the case. If the usage of quotes is completely consistent, you could more than likely use sed to modify the file (or a perl script if you prefer). I will still have a go at importing the file shortly and see if there is a way of handling the file as-is, rather than tweaking it. This is for personal interest now :)
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
trevor1940

ASKER
I agree with you I'd rater use data in it's raw form 1st so my actions don't corrupt the data, I'm confidant they won't but it's a possibility especially when dealing with diacritics and large data sets 2nd We may wish to update this periodicaly

"sed" Is that a Linux  command?
Martyn Spencer

sed is a Linux command, yes. But there is a Windows alternative here (probably others exist as well) and if you look at Cygwin, you would definitely be able to use it. Cygwin is basically a command interpreter environment that lets you run Linux commands on Windows. I used it a lot before i transitioned to Linux for my main environment.
ASKER CERTIFIED SOLUTION
Martyn Spencer

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
trevor1940

ASKER
WOW!

Thanx very much

That appears to be working I've left it running
I did use an id field primarily because it's a requirement of QGIS and simple row location
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Martyn Spencer

Excellent. Glad to hear that it's working for you :)