Link to home
Start Free TrialLog in
Avatar of MFredin
MFredinFlag for United States of America

asked on

Removing duplicate cities in huge geo database using cfloop. Is there a faster way to do this?

I've got a huge database of all city names, long, lat, etc for every zip code in North America.  Many cities have multiple zip codes and therefore many entries.  What I'm trying to do is make my database smaller by removing all the duplicate rows based on state and city name.  This way, I will only have 1 record for each city.

Here is what I have now.  The problem is it is deathly slow.  Is there a better way I could be doing this without using Coldfusion to loop over each record?  MySQL 5, Coldfusion 9
<cfquery name="states">
SELECT states
FROM states
</cfquery>

<cfloop query="states">

<cfquery name="cities">
SELECT DISTINCT(city)as city
FROM cities
WHERE region1 = '#states.state#'
ORDER BY city
</cfquery>

<cfoutput query="cities">

<cfquery name="info">
SELECT *
FROM cities
WHERE region1 = '#state#'
AND city = '#cities.city#'
LIMIT 1
</cfquery>

<cfquery>
INSERT INTO cities_new
(country, language, regISO2, Region1, Region2, Region3, Region4, zip, city, area1, area2, lat, lng, tz, utc, dst)
VALUES
('#UCASE(info.country)#', '#UCASE(info.language)#', '#UCASE(info.regiso2)#', '#UCASE(info.region1)#', '#UCASE(info.region2)#', '#UCASE(info.region3)#', '#UCASE(info.region4)#', '#UCASE(info.zip)#', '#UCASE(info.city)#', '#UCASE(info.area1)#', '#UCASE(info.area2)#', #info.lat#, #info.lng#, '#UCASE(info.tz)#', '#UCASE(info.utc)#', '#UCASE(info.dst)#')
</cfquery>

</cfoutput>
</cfloop>

Open in new window

Avatar of Ovunc Tukenmez
Ovunc Tukenmez
Flag of Türkiye image

DELETE FROM your_table WHERE id NOT IN (SELECT id FROM your_table GROUP BY Region1, city)
Avatar of MFredin

ASKER

I tried this... but getting the following error:  You can't specify target table 'cities' for update in FROM clause

<cfquery>
DELETE
FROM cities
WHERE country = 'US'
AND id NOT IN (SELECT id FROM cities WHERE country = 'US' GROUP BY Region1, city)
</cfquery>
> DELETE FROM your_table WHERE id NOT IN (SELECT id FROM your_table GROUP BY Region1, city)

seems to me this will not delete any records, the inner SELECT will fetch every single ID so there are no IDs not included by the inner select... therefore, no records will be removed

how about this code...

I hope you've made a backup or are testing in a copy of the table or something.. !


delete cities
WHERE country = 'US'
and ID In (select min(id)
   from cities
   group by city, region
   having count(*) > 1)

Open in new window

oh, I forgot to mention.   That will deleted one record of the duplicate.  You keep running it until it returns 0 results.   Some cities may have 3 or 4 duplicates, if they do, just run the script until no records are returned.    Of course not the best solution for automation, but an easy way if you're doing it manually just one time.

ASKER CERTIFIED SOLUTION
Avatar of gdemaria
gdemaria
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
wow, that's a nasty restriction for a database

 select distinct... can avoid that, see my previous post
Avatar of MFredin

ASKER

Thanks!  Inserting into a new table was the best solution for me.  Fast too! Thanks!