Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


implementing a scalable searching solution; initially map two metadata datasets and searching them

Posted on 2011-04-25
Medium Priority
Last Modified: 2012-05-11
hi folks,

i have got a little complicated problem. I am making a web application using php/mysql that would allow a user to search through the data efficiently just like how google works and give most relevant results.

basically what I want is a scalable solution preferably as a REST service that would spit out a structured XML when you pass a query parameter to it. but the main problem is the kind of data I have. Each dataset contains XML file for every individual record and has its own schema.

the datasets contain around 200,000 records for now and in the future it can scale to a million.

so couple of questions:
how best to map this data so that a keyword search always retrieves the most relevant entries and not just a sequential bunch of entries
i tried using apache solr but couldn't figure out if the schema in apache solr is flexible enough to allow me to enter any fields corresponding to the XML I have and define their type. but looks like there has to be a certain schema syntax that has to be followed.
the most important thing it to make this database system scalable so that different types of metadata datasets can be imported in a plug-n-play fashion and it gets indexed appropriately and available for searching. I think this can only be possible if a mapping schema can be developed but I am open to any other ideas.

I have attached the two sample dataset files.

Question by:dsccgl
LVL 111

Accepted Solution

Ray Paseur earned 1336 total points
ID: 35462916
map this data so that a keyword search always retrieves the most relevant -- Help us understand what constitutes "relevant" please.

Here is data set #1
		<p source="maker">, Mission Indian</p>
		<p source="materials">Sumac coiled on a deergrass bundle foundation, design in natural juncus, bound-under fag end stitches on the interior</p>
		<p source="date_made">before 1910</p>
		<p source="earliest_year">1890</p>
		<p source="id_number">1000.G.10</p>
		<p source="measurements">10 3/4 in x 3 3/4 in (27 cm x 9.5 cm)</p>
		<p source="credit_line">Southwest Museum of the American Indian Collection, Autry National Center, Tomas Lorenzo Duque Memorial Colllection. Collected: Duque, Mr. Tomas Lorenzo, San Diego County, CA</p>
		<p source="subjects">Mission Indian basket, probably Diegueño, sumac coiled on a deergrass bundle foundation, design in natural juncus, before 1910.
Collected by Tomas Lorenzo Duque, from San Felipe - Santa Ysabel - Warner's Ranch area, San Diego County, California, between 1900-1910.
Subjects: bundle foundation, triangles</p>
		<p source="media_id">http://collections.locationsecond.org/thumb/1000_G_10.jpg</p>

Open in new window

here is data set #2
       <title label="Object Name">Awl/perforator</title>
       <online_media mediaCount="2">
         <media type="Images" thumbnail="http://www.location/collections//4027/429/139.thumb.jpg">http://www.location/collections//4027/429/139.700x700.jpg</media>
         <media type="Images" thumbnail="http://www.location/collections//42/319/002408.thumb.jpg">http://www.location/collections//42/319/002408.700x700.jpg</media>
       <freetext category="identifier" label="Catalog Number">2408</freetext>
       <freetext category="dataSource" label="Data Source">smith</freetext>
       <freetext category="objectType" label="Object Type">Tools and Equipment (General)</freetext>
       <freetext category="culture" label="Culture/People">Unknown archaeological culture</freetext>
       <freetext category="physicalDescription" label="Media/Materials">Chalcedony</freetext>
       <freetext category="physicalDescription" label="Techniques">Flaked/chipped</freetext>
       <freetext category="date" label="Date Created">10,000 BC-AD 1600</freetext>
       <freetext category="notes" label="Collection History">Collected by archaeologist George Hubbard Pepper (1873-1924, MAI staff member); acquired by George Heye in 1904.</freetext>
       <freetext category="setName" label="See more items in">Archaeological Items</freetext>
       <freetext category="place" label="place">Ganado region; Ganado, Navajo Reservation; Apache County; Arizona; USA</freetext>
       <freetext category="place" label="Site Name">Ganado region</freetext>
       <object_type>Tools and Equipment (General)</object_type>
       <online_media_type>Catalog Cards</online_media_type>
       <culture>American Indians</culture>
       <culture>Native Americans</culture>
       <culture>Unknown archaeological culture</culture>
       <culture>Unknown archaeological culture</culture>
       <place>Apache County</place>
       <place>Ganado, Navajo Reservation</place>
         <L2 type="Country">USA</L2>
         <L3 type="State">Arizona</L3>
         <L4 type="County">Apache County</L4>
         <L5 type="City">Ganado, Navajo Reservation</L5>

Open in new window


Author Comment

ID: 35463100
That is a very good question. Relevant means returning highly ranked results. This would probably involve storing user interactions. One simplistic approach would be to update the counter for that row in a table for every time a user clicks to find out more about a particular record and using this information to return sorted results based on "Relevance". One good example is the way amazon returns search results but working on a good search based on rankings and/or other variables will be worked on later. I am more interested in knowing more about different technical approaches to solve this problem in a scalable fashion.

I want to know my options as to the kind of technology I should use, pros n cons and all that stuff when it comes to making it more scalable.
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1336 total points
ID: 35463430
would probably involve storing user interactions -- Yep, exactly.

Wow, that is the kind of question that millionaires hire MIT graduates to do research on!  It's also kind of a trade secret of the search engines.  No disrespect to all the SEO experts, but the search engines have multi-billion dollar budgets to tackle this problem, and they have hundreds of professional software developers working hard to improve the algorithms.

Let me try to put scalability into perspective for one of the search engines.  A couple of years ago I had a chance to visit the Yahoo East data center in northern Virginia.  The data center manager planned his day based on the number of truckloads of servers that would arrive that day.  You might not want to deal with questions like that.

A good solution might be to contact Google about enterprise-level application support.  They have kind of set the standard in search and relevance programming, and they have very robust commercial and government business offices.

Assisted Solution

msk_apk earned 664 total points
ID: 35467445
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 37319246
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

577 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question