We help IT Professionals succeed at work.
Troubleshooting Question

PostgreSQL / PostGIS: Joining two tables using nearest neighbor

42 Views
Last Modified: 2020-10-12
I have two tables with point geometry both with SRID 4326  (WGS84)

How do refine the query bellow to loop through every row / point in the pointonsurface  table locating it's  nearest neighbor  in the geonames_org table?

Given this is WGS84 distance is in degrees which is fine
CREATE TABLE pointonsurface_link_geonames(
   id SERIAL PRIMARY KEY,
   pos_id int,
   geoname_id int,
   distance int
   ); 

-- run in psgl
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%'
ORDER BY dist LIMIT 1;

UPDATE:

I think this can be done using LATERAL JOIN
Something like this

SELECT id FROM pointonsurface AS pid LIMIT 5 -- LIMIT set just to test
 CROSS JOIN LATERAL (
    INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
      SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid;

Gives:

ERROR:  syntax error at or near "CROSS"
Comment
Watch Question

CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
Cannot help at all with the GIS aspect.

Check the docs for the syntax on LIMIT:
https://www.postgresql.org/docs/12/queries-limit.html

It goes at the end...

You have it right in the inner query.

Also looks like your insert is in the wrong position.

So try:
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

Author

Commented:
Hi
I'm getting the following error

ERROR:  operator does not exist: integer = pointonsurface
LINE 6:       WHERE g.fcode LIKE 'ADM%' AND p.id = pid
                                                 ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 305
I tried to CAST but the pointonsurface id is already an integer
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT CAST (id AS INTEGER) FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
>>I tried to CAST but the pointonsurface id is already an integer

I already mentioned I cannot help with the GIS aspects at all.  As far as the error, looks like you might need to rethink your design so you have compatible data types.

BUT, you no longer the the original error...

Author

Commented:
The only GIS part is "p.geometry <-> g.geometry" replacing it with  a number gives the same error
However replacing p.id = pid with p.id = 123 gives the error bellow

INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id  FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, 10 AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = 123
      ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Error
ERROR:  column "pid.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: SELECT id FROM pointonsurface AS pid
               ^
SQL state: 42803
Character: 79
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
Error is pretty straight forward.

Why do you have a "group by"?  If you have it, ALL non-aggregated columns MUST be in the group by.  I don't see any aggregate functions so not sure why you have the group by.

Author

Commented:
OK removing the GROUP BY and ignoring the insert for now
I think I need to  work out how to pass the id from the outer query into the inner query returning the 3 values back to the outer query
Any idea how I do that? Google hasn't helped
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
Not sure.   You really doesn't use values between things with inline views.  What does your query look like now?

Author

Commented:
From this page I've got this

SELECT * FROM (
   SELECT id AS pid FROM pointonsurface 
   ) pos
 LEFT JOIN LATERAL (
         SELECT p.id, g.geoname_id, 10 AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id =pos.pid
      ORDER BY dist LIMIT 1
) 

ORDER BY pos.pid
 LIMIT 5 -- LIMIT set just to test;

How would you of done this is Oracle?
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
I'm not all that familiar with the LATERAL join in Postgres so not really sure how to port it to Oracle.

I understand you want the 5 closest but not sure you have to use a lateral join to do so.

If you can provide a simplified test case, with data, and expected results, I can try to provide the SQL on how I would solve it.  If you can simplify it down to the point that doesn't involve GIS points, I would appreciate it.

Author

Commented:
OK
Will try and knock something up in the morning

Author

Commented:
OK this is a bit contrived but hopefully  it shows what I'm trying to do

If I was doing it in an external script like perl or C# I'd do something like this pseudo code
my res = SELECT pos_id FROM loop_tbl;

foreach(my pid in res){
   num1 = SELECT num FROM loop_tbl WHERE pos_id = pid;
  (geoname_id, num2) = SELECT geoname_id, num FROM data_tbl;
   dist = num1 / num2; 
   INSERT INTO link_tbl(pos_id,geoname_id,distance)
   VALUES (pid,geoname_id,dist);
}



SQL

CREATE TABLE loop_tbl(
   pos_id SERIAL PRIMARY KEY,
   num int
);

CREATE TABLE data_tbl(
   geoname_id SERIAL PRIMARY KEY,
   num int
);

CREATE TABLE link_tbl(
   id SERIAL PRIMARY KEY,
   pos_id int,
   geoname_id int,
   distance double precision
);


INSERT INTO loop_tbl(num) VALUES(5);
INSERT INTO loop_tbl(num) VALUES(6);
INSERT INTO loop_tbl(num) VALUES(7);

INSERT INTO data_tbl(num) VALUES(2);

--SELECT round( CAST(float8 '3.1415927' as numeric), 2);

select * FROM loop_tbl;
select * FROM data_tbl;

-- Expected in link_tbl
-- 1,1,2.5
-- 2,1,3
-- 3,1,3.5

-- FROM previous efforts

INSERT INTO link_tbl(pos_id,geoname_id,distance)
SELECT * FROM (
   SELECT pos_id AS pid FROM loop_tbl
   ) pos
 LEFT JOIN LATERAL (
          SELECT p.pos_id, g.geoname_id, p.num / g.num AS dist 
      FROM lopo_tbl p, data_tbl g
      p.pos_id = pos.pid
      ORDER BY dist LIMIT 1 
) 

ORDER BY pos.pid
 LIMIT 5 -- LIMIT set just to test;



NOTE: the math is just to replace the GIS function which finds the distance between points by putting ORDER BY dist LIMIT 1 this finds the closest point


Author

Commented:
I'm wondering if the inner query can or should be done as a function that takes the id from the first table, performs the computations and returns the 3 required values?  
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
Since the inner query uses the same table in the outer query, not seeing the need for the outer query.  Creating a function will likely make it more complex than it needs to be.

I saw the test case you posted this morning.  I haven't had time to get around to looking at it yet.
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
Thanx
This works with the mock data unsure how to scale it up for the real data maybe I either write external script or create a  new question
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
You still probably don't need an external script or function or anything else other than straight SQL.

We can continue to talk about scaling it up here as long as it doesn't deviate away from the original question.

What don't you think will scale?

Author

Commented:
I need to be able to loop through through the "loop_tbl" so I can use a WHERE clause inside the inner query
...... from opening remarks
-- DO GIS Query 
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
       ORDER BY dist LIMIT 1; 

The GIS & above basically means the closest point to a given point otherwise it  will  find the distance to every point

if the loop_tbl has 10 points and the data has 100
I'm expecting a 1 to 1 link so the link_tbl will have 10 rows not 1,000

To simulate add another row to the data_tbl

Hope that makes sense 
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
You don't want the closest distance to ALL values in the data table?

Even if you don't, just create a CTE to do them 1 by 1 and cross join that?
insert into link_tbl(pos_id,geoname_id, distance)
with single_data_row as (
   select * from data_tbl limit 1
)
select l.pos_id,d.geoname_id, l.num/d.num::decimal
from loop_tbl l cross join single_data_row d;

Open in new window



Fiddle here:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=132882393a60fb813dc5ce7fb7b3b171

Author

Commented:
>>  You don't want the closest distance to ALL values in the data table?
no
point 1 in loop table find the closest point in the data table The GIS does this without the LIMIT it will return the distance to every point and crash the database

Put another way if  the loop table were a list of bust stops and  the data table a list of pubs with open or closed bool
I'm  at 1 bus stop  find me the nearest open pub I don't need every pub in the data table

Then find the nearest pub to each bus stop


Your query gives

idpos_idgeoname_iddistance
1112.5
2213
3313.5
If
INSERT INTO data_tbl(num) VALUES(999);
INSERT INTO data_tbl(num) VALUES(123);
Assuming 3 row in each table I would expect something like this:  but like I said the data table has a lot more points than the loop table

idpos_idgeoname_iddistance
1112.5
222166.5
33317.57

If my math is correct

In theory 1 point /pub in the data table could be closest  to the same point / bus stop in the loop table but we'll skip that for now 
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
>>I'm  at 1 bus stop  find me the nearest open pub I don't need every pub in the data table

But to find the nearest you have to look at ALL.

>>Then find the nearest pub to each bus stop

Doesn't that contradict the "no" response to my question "You don't want the closest distance to ALL values in the data table"?


>>Your query gives

It grabs a random single row from data_tbl.  Could be the first, could be the last, could be any one somewhere in the middle.

>>I would expect something like this

Not sure how you get that and I don't have to.  Take the SQL I provided and apply your logic to it?

Author

Commented:
I think we maybe we have our wires crossed

The GIS query uses the GIST index to find the distance from a bus stop to every pub so you might in assumption 
Then use ORDER BY dist & LIMIT 1 to return the closest

I run your fiddle again I got the same results

CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
Yes, we have definitely crossed wires.  You understand your exact requirements.  Many times trying to explain them to others in a Q&A site like this can be a problem.  This is why we ask for sample data and the results from that data.  It helps us visualize the requirements you are trying to explain.

You can use ROW_NUMBER to get the "first" of things.  I don't understand the need to use LIMIT and ORDER by in some inner query.

I can only go by the data you provide and base SQL on what I believe your requirements are.

>>Assuming 3 row in each table I would expect something like this:

I would need to see more data in the test case and the expected results from the data you provide.

Author

Commented:
>> I would need to see more data in the test case and the expected results from the data you provide.

Very hard to provide mock geo data when the question isn't directly related and convert it to a non geo example

Taking my bus route scenario
The route through MyTown has 15 stops 8 going north & 7 going south some are open & some are closed
MyTown has 57 pubs

At each stop I need to find 1 open pub with a garden (limit & distance) I have time to have 1 drink then catch the next bus

I'm at stop 1 The Angel is 50m away open  & the Ferret is  5km and closed  The Pie & Eel is 3km but only 600m from stop 3  & opens in 1 hour
At stop 1 I'm not concerned with the Ferret or The Pie & Eel are because they are closed I'm can buy a Drink in The Angel
At stop 2 The Green Man is Open I don't care about the Black Cat it is closed
by the time i get to stop 3 The Pie & Eel  is now open

At each bus stop you effectively search all 57 pubs returning 1 that is open because the pub status changes in  relation to each stop

Now do you see why a new search needs to be made at each stop?
Unsure how I can convert that into an database with out adding more complexity to change the status of The Pie & Eel assuming it happens
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a3c8f18a4c07df2a6b047035c3d826b6

I'd expect (ignore distance my math not that good)
idpos_idgeoname_idpub_namedistance
111The Angel15724.03777240155
225
The Green Man
141501.61141908905
334
 The Pie & Eel  
298679.3506199334


>> You can use ROW_NUMBER to get the "first" of things.  
Will that not always return the lowest value




CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
Took me a while to work out where in the previous SQL you put the insert statement but got there 

with all_open as (
   select l.pos_id,d.geoname_id, pub_name,l.geog <-> d.geog  AS distance
   from loop_tbl l cross join data_tbl d
   where open
),
ranked as (
   select pos_id, geoname_id, pub_name, distance,
      row_number() over(partition by pos_id order by distance asc) rn
   from all_open
)
INSERT INTO link_tbl(pos_id,geoname_id, pub_name,distance) 
select pos_id, geoname_id, pub_name,  round(CAST(distance  As numeric),2) 
from ranked
where rn=1

BTW: I hope you realize the Bus route pub crawl is just my attempt at trying to explain my logic  

I really appreciate your help on this

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.