asked on

PostgreSQL / PostGIS: Joining two tables using nearest neighbor

I have two tables with point geometry both with SRID 4326 (WGS84)

How do refine the query bellow to loop through every row / point in the pointonsurface table locating it's nearest neighbor in the geonames_org table?

Given this is WGS84 distance is in degrees which is fine

CREATE TABLE pointonsurface_link_geonames(
   id SERIAL PRIMARY KEY,
   pos_id int,
   geoname_id int,
   distance int
   ); 

-- run in psgl
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%'
ORDER BY dist LIMIT 1;

Open in new window

UPDATE:

I think this can be done using LATERAL JOIN
Something like this

SELECT id FROM pointonsurface AS pid LIMIT 5 -- LIMIT set just to test
 CROSS JOIN LATERAL (
    INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
      SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid;

Open in new window

Gives:

ERROR: syntax error at or near "CROSS"

slightwv (䄆 Netminder)

Cannot help at all with the GIS aspect.

Check the docs for the syntax on LIMIT:
https://www.postgresql.org/docs/12/queries-limit.html

It goes at the end...

You have it right in the inner query.

Also looks like your insert is in the wrong position.

So try:

INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

trevor1940

ASKER

Hi
I'm getting the following error

ERROR:  operator does not exist: integer = pointonsurface
LINE 6:       WHERE g.fcode LIKE 'ADM%' AND p.id = pid
                                                 ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 305

Open in new window

I tried to CAST but the pointonsurface id is already an integer

INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT CAST (id AS INTEGER) FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

slightwv (䄆 Netminder)

>>I tried to CAST but the pointonsurface id is already an integer

I already mentioned I cannot help with the GIS aspects at all. As far as the error, looks like you might need to rethink your design so you have compatible data types.

BUT, you no longer the the original error...

trevor1940

ASKER

The only GIS part is "p.geometry <-> g.geometry" replacing it with a number gives the same error
However replacing p.id = pid with p.id = 123 gives the error bellow

INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id  FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, 10 AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = 123
      ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

Error

ERROR:  column "pid.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: SELECT id FROM pointonsurface AS pid
               ^
SQL state: 42803
Character: 79

Open in new window

slightwv (䄆 Netminder)

Error is pretty straight forward.

Why do you have a "group by"? If you have it, ALL non-aggregated columns MUST be in the group by. I don't see any aggregate functions so not sure why you have the group by.

trevor1940

ASKER

OK removing the GROUP BY and ignoring the insert for now
I think I need to work out how to pass the id from the outer query into the inner query returning the 3 values back to the outer query
Any idea how I do that? Google hasn't helped

slightwv (䄆 Netminder)

Not sure. You really doesn't use values between things with inline views. What does your query look like now?

trevor1940

ASKER

From this page I've got this

SELECT * FROM (
   SELECT id AS pid FROM pointonsurface 
   ) pos
 LEFT JOIN LATERAL (
         SELECT p.id, g.geoname_id, 10 AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id =pos.pid
      ORDER BY dist LIMIT 1
) 

ORDER BY pos.pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

How would you of done this is Oracle?

slightwv (䄆 Netminder)

I'm not all that familiar with the LATERAL join in Postgres so not really sure how to port it to Oracle.

I understand you want the 5 closest but not sure you have to use a lateral join to do so.

If you can provide a simplified test case, with data, and expected results, I can try to provide the SQL on how I would solve it. If you can simplify it down to the point that doesn't involve GIS points, I would appreciate it.

trevor1940

ASKER

OK
Will try and knock something up in the morning

trevor1940

ASKER

OK this is a bit contrived but hopefully it shows what I'm trying to do

If I was doing it in an external script like perl or C# I'd do something like this pseudo code

my res = SELECT pos_id FROM loop_tbl;

foreach(my pid in res){
   num1 = SELECT num FROM loop_tbl WHERE pos_id = pid;
  (geoname_id, num2) = SELECT geoname_id, num FROM data_tbl;
   dist = num1 / num2; 
   INSERT INTO link_tbl(pos_id,geoname_id,distance)
   VALUES (pid,geoname_id,dist);
}

Open in new window

SQL

CREATE TABLE loop_tbl(
   pos_id SERIAL PRIMARY KEY,
   num int
);

CREATE TABLE data_tbl(
   geoname_id SERIAL PRIMARY KEY,
   num int
);

CREATE TABLE link_tbl(
   id SERIAL PRIMARY KEY,
   pos_id int,
   geoname_id int,
   distance double precision
);


INSERT INTO loop_tbl(num) VALUES(5);
INSERT INTO loop_tbl(num) VALUES(6);
INSERT INTO loop_tbl(num) VALUES(7);

INSERT INTO data_tbl(num) VALUES(2);

--SELECT round( CAST(float8 '3.1415927' as numeric), 2);

select * FROM loop_tbl;
select * FROM data_tbl;

-- Expected in link_tbl
-- 1,1,2.5
-- 2,1,3
-- 3,1,3.5

-- FROM previous efforts

INSERT INTO link_tbl(pos_id,geoname_id,distance)
SELECT * FROM (
   SELECT pos_id AS pid FROM loop_tbl
   ) pos
 LEFT JOIN LATERAL (
          SELECT p.pos_id, g.geoname_id, p.num / g.num AS dist 
      FROM lopo_tbl p, data_tbl g
      p.pos_id = pos.pid
      ORDER BY dist LIMIT 1 
) 

ORDER BY pos.pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

NOTE: the math is just to replace the GIS function which finds the distance between points by putting ORDER BY dist LIMIT 1 this finds the closest point

trevor1940

ASKER

I'm wondering if the inner query can or should be done as a function that takes the id from the first table, performs the computations and returns the 3 required values?

slightwv (䄆 Netminder)

Since the inner query uses the same table in the outer query, not seeing the need for the outer query. Creating a function will likely make it more complex than it needs to be.

I saw the test case you posted this morning. I haven't had time to get around to looking at it yet.

ASKER CERTIFIED SOLUTION

slightwv (䄆 Netminder)

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

trevor1940

ASKER

Thanx
This works with the mock data unsure how to scale it up for the real data maybe I either write external script or create a new question

slightwv (䄆 Netminder)

You still probably don't need an external script or function or anything else other than straight SQL.

We can continue to talk about scaling it up here as long as it doesn't deviate away from the original question.

What don't you think will scale?

trevor1940

ASKER

I need to be able to loop through through the "loop_tbl" so I can use a WHERE clause inside the inner query
...... from opening remarks

-- DO GIS Query 
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
       ORDER BY dist LIMIT 1;

Open in new window

The GIS & above basically means the closest point to a given point otherwise it will find the distance to every point

if the loop_tbl has 10 points and the data has 100
I'm expecting a 1 to 1 link so the link_tbl will have 10 rows not 1,000

To simulate add another row to the data_tbl

Hope that makes sense

slightwv (䄆 Netminder)

You don't want the closest distance to ALL values in the data table?

Even if you don't, just create a CTE to do them 1 by 1 and cross join that?

insert into link_tbl(pos_id,geoname_id, distance)
with single_data_row as (
   select * from data_tbl limit 1
)
select l.pos_id,d.geoname_id, l.num/d.num::decimal
from loop_tbl l cross join single_data_row d;

Open in new window

Fiddle here:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=132882393a60fb813dc5ce7fb7b3b171

trevor1940

ASKER

>> You don't want the closest distance to ALL values in the data table?
no
point 1 in loop table find the closest point in the data table The GIS does this without the LIMIT it will return the distance to every point and crash the database

Put another way if the loop table were a list of bust stops and the data table a list of pubs with open or closed bool
I'm at 1 bus stop find me the nearest open pub I don't need every pub in the data table

Then find the nearest pub to each bus stop

Your query gives

id	pos_id	geoname_id	distance
1	1	1	2.5
2	2	1	3
3	3	1	3.5

INSERT INTO data_tbl(num) VALUES(999);
INSERT INTO data_tbl(num) VALUES(123);

Open in new window

Assuming 3 row in each table I would expect something like this: but like I said the data table has a lot more points than the loop table

id	pos_id	geoname_id	distance
1	1	1	2.5
2	2	2	166.5
3	3	3	17.57

If my math is correct

In theory 1 point /pub in the data table could be closest to the same point / bus stop in the loop table but we'll skip that for now

slightwv (䄆 Netminder)

>>I'm at 1 bus stop find me the nearest open pub I don't need every pub in the data table

But to find the nearest you have to look at ALL.

>>Then find the nearest pub to each bus stop

Doesn't that contradict the "no" response to my question "You don't want the closest distance to ALL values in the data table"?

>>Your query gives

It grabs a random single row from data_tbl. Could be the first, could be the last, could be any one somewhere in the middle.

>>I would expect something like this

Not sure how you get that and I don't have to. Take the SQL I provided and apply your logic to it?

trevor1940

ASKER

I think we maybe we have our wires crossed

The GIS query uses the GIST index to find the distance from a bus stop to every pub so you might in assumption
Then use ORDER BY dist & LIMIT 1 to return the closest

I run your fiddle again I got the same results

slightwv (䄆 Netminder)

Yes, we have definitely crossed wires. You understand your exact requirements. Many times trying to explain them to others in a Q&A site like this can be a problem. This is why we ask for sample data and the results from that data. It helps us visualize the requirements you are trying to explain.

You can use ROW_NUMBER to get the "first" of things. I don't understand the need to use LIMIT and ORDER by in some inner query.

I can only go by the data you provide and base SQL on what I believe your requirements are.

>>Assuming 3 row in each table I would expect something like this:

I would need to see more data in the test case and the expected results from the data you provide.

trevor1940

ASKER

>> I would need to see more data in the test case and the expected results from the data you provide.

Very hard to provide mock geo data when the question isn't directly related and convert it to a non geo example

Taking my bus route scenario
The route through MyTown has 15 stops 8 going north & 7 going south some are open & some are closed
MyTown has 57 pubs

At each stop I need to find 1 open pub with a garden (limit & distance) I have time to have 1 drink then catch the next bus

I'm at stop 1 The Angel is 50m away open & the Ferret is 5km and closed The Pie & Eel is 3km but only 600m from stop 3 & opens in 1 hour
At stop 1 I'm not concerned with the Ferret or The Pie & Eel are because they are closed I'm can buy a Drink in The Angel
At stop 2 The Green Man is Open I don't care about the Black Cat it is closed
by the time i get to stop 3 The Pie & Eel is now open

At each bus stop you effectively search all 57 pubs returning 1 that is open because the pub status changes in relation to each stop

Now do you see why a new search needs to be made at each stop?
Unsure how I can convert that into an database with out adding more complexity to change the status of The Pie & Eel assuming it happens
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a3c8f18a4c07df2a6b047035c3d826b6

I'd expect (ignore distance my math not that good)

id	pos_id	geoname_id	pub_name	distance
1	1	1	The Angel	15724.03777240155
2	2	5	The Green Man	141501.61141908905
3	3	4	The Pie & Eel	298679.3506199334

>> You can use ROW_NUMBER to get the "first" of things.
Will that not always return the lowest value

SOLUTION

slightwv (䄆 Netminder)

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

trevor1940

ASKER

Took me a while to work out where in the previous SQL you put the insert statement but got there

with all_open as (
   select l.pos_id,d.geoname_id, pub_name,l.geog <-> d.geog  AS distance
   from loop_tbl l cross join data_tbl d
   where open
),
ranked as (
   select pos_id, geoname_id, pub_name, distance,
      row_number() over(partition by pos_id order by distance asc) rn
   from all_open
)
INSERT INTO link_tbl(pos_id,geoname_id, pub_name,distance) 
select pos_id, geoname_id, pub_name,  round(CAST(distance  As numeric),2) 
from ranked
where rn=1

Open in new window

BTW: I hope you realize the Bus route pub crawl is just my attempt at trying to explain my logic

I really appreciate your help on this