Link to home
Start Free TrialLog in
Avatar of erikTsomik
erikTsomikFlag for United States of America

asked on

Consolidate the query result

I have the query that return the result properly. What I am trying to do is to get the total per school. However, the schools have different IDs but this is basically the same schools. How can do some type of search to get the calculated as a same school.

	SELECT U.SchoolID,LS.SchoolNameTx
		FROM SessionMap SM
		inner join users u on u.userKey = SM.userKey
		inner join lkup_school LS on LS.SchoolID = U.SchoolID
		order by U.SchoolID

Open in new window


The result returned:
SchoolID	SchoolNameTx
429	       Quince Orchard High School
1189	Walter Johnson High School (WJHS)
1228	Academy of the Holy Cross
1395	Academny of the Holy Cross
1395	Academny of the Holy Cross
2656	Academy of Holy Cross
3144	Academy of the Holy Cross High School (AHC)
3144	Academy of the Holy Cross High School (AHC)
3144	Academy of the Holy Cross High School (AHC)
3144	Academy of the Holy Cross High School (AHC)
3144	Academy of the Holy Cross High School (AHC)
3144	Academy of the Holy Cross High School (AHC)
3450	Academy  of the Holy Cross
4892	Stonehill College

Open in new window

Avatar of _agx_
_agx_
Flag of United States of America image

What's an example of the expected results?
Avatar of erikTsomik

ASKER

the expected result is something like this

       Quince Orchard High School
	Walter Johnson High School (WJHS)
	Academy  of the Holy Cross
	Stonehill College

Open in new window

SOLUTION
Avatar of _agx_
_agx_
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
yes that is exactly what I am trying to accomplish. School and the count
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
how can I make it more dynamic . In example below you can see so I am talking about the same schools

Academy of the Holy Cross  AHC	4
Academy of the Holy Cross (AHC)	29
Academy of the Holy Cross High School	1344
Academy of the Holy Cross High School (ACH)	251

Open in new window


Alber Einstein HS	24
Albert Einstein	4
Albert Einstein High School	314
Albert Einstein HS	4
Albert Einstien High School	375

Open in new window

winston chirchill high school	29
Winston Chruchill High School	1842
Winston Churchihl High School	5
Winston Churchil High School	113
Winston Churchill	128
Winston churchill H.S.	214
Winston Churchill High	18
Winston Churchill High School	6107
Winston Churchill High School (WCHS)	3406
Winston Churchill High School Potomac MD	65
Winston Churchill High School, Potomac, MD	64
Winston Churchill Highschool	10
Winston Churchill HS	128
Winston Churchill HS in Potomac	88

Open in new window

For it to be dynamic but still manageable, there'd need to be more of a common pattern in all of the names, like starts with "ABC School%", but there isn't - not for all of them.  

If there's a bunch of these, I'd probably create some sort of grouping or mapping table instead.  That would make for cleaner SQL. Plus you could even create a web screen to add/remove mappings for easy maintenance (optional).
You may think of a function seeing two names are meaning the same just like you think. SQL Server is quite weak on string functions in general, as most any database is and things like SOUNDEX computing a representation of similarly pronounced words also won't help you to find the equality of names.

You could replace some common abbreviations like H.S., HS and HIGHSCHOOL and either completely remove them or always replace them with one of the ways they're written, perhaps the shortest one, to never get into trouble with prolonging names getting too long.

But in the end you shouldn't consolidate the result, you should consolidate the school name data in lkup_school.

A mapping table is ideal for this, as you can start out with automatism and never have to fear wrong consolidations, as you keep the data as is and only create mappings of several  IDs on lkup_school to point to a new consolidated record, that in the end means one more join in the final query, not limited to this query, you can reuse the mappings, once you have them, and once they are not just computed/estimated but manually confirmed, you may then even really consolidate the records in lkup_school itself and cascade foreign keys in relationships with other tables.

Ideally you'd know some norm and id/number of schools, that's intended for this scenario of data analysis, so you don't have to match data by names.

I guess your current lkup_school is just the result of taking every name very literally and there also is no simple mechanism like you have as a human to see the equivalence.

I see two stages of getting from lkup_school to a mapping table having pairs of (consolidatedSchoolId, lkup_SchoolID):

Step 1: Create a table tmpSchoolNames based on lkup_school.SchoolID and SchoolNameTx as ConsolidatedName
Step 2: Do several string replacements in SchoolNameTx like

UPDATE tmpSchoolNames SET SchoolNameTx = REPLACE(SchoolNameTx,"H.S.", "HS")

Open in new window


Then finally after you made such replacements of varying spellings of similar name parts you can simply get your consolidated data as SELECT min(SchoolID), ConsolidatedName grouped by ConsolidatedName.and at the same time have the list of mappings to all Ids having the same consolidated name.

As we have no idea how large your lkup_school table is, it's hard to recommend how much work to invest, you might never get a 100% solution or it is even simpler doing that manually, but I guess from the examples, there's a lot to go through and so you better find a strategy consolidation by such name parts or also disposable parts. A simple thing to do would drop off anything in parenthesis as unimportant not significant part of a name.

Bye, Olaf.