implementing a scalable searching solution; initially map two metadata datasets and searching them

Posted on 2011-04-25
Last Modified: 2012-05-11
hi folks,

i have got a little complicated problem. I am making a web application using php/mysql that would allow a user to search through the data efficiently just like how google works and give most relevant results.

basically what I want is a scalable solution preferably as a REST service that would spit out a structured XML when you pass a query parameter to it. but the main problem is the kind of data I have. Each dataset contains XML file for every individual record and has its own schema.

the datasets contain around 200,000 records for now and in the future it can scale to a million.

so couple of questions:
how best to map this data so that a keyword search always retrieves the most relevant entries and not just a sequential bunch of entries
i tried using apache solr but couldn't figure out if the schema in apache solr is flexible enough to allow me to enter any fields corresponding to the XML I have and define their type. but looks like there has to be a certain schema syntax that has to be followed.
the most important thing it to make this database system scalable so that different types of metadata datasets can be imported in a plug-n-play fashion and it gets indexed appropriately and available for searching. I think this can only be possible if a mapping schema can be developed but I am open to any other ideas.

I have attached the two sample dataset files.

Question by:dsccgl
    LVL 107

    Accepted Solution

    map this data so that a keyword search always retrieves the most relevant -- Help us understand what constitutes "relevant" please.

    Here is data set #1
    		<p source="maker">, Mission Indian</p>
    		<p source="materials">Sumac coiled on a deergrass bundle foundation, design in natural juncus, bound-under fag end stitches on the interior</p>
    		<p source="date_made">before 1910</p>
    		<p source="earliest_year">1890</p>
    		<p source="id_number">1000.G.10</p>
    		<p source="measurements">10 3/4 in x 3 3/4 in (27 cm x 9.5 cm)</p>
    		<p source="credit_line">Southwest Museum of the American Indian Collection, Autry National Center, Tomas Lorenzo Duque Memorial Colllection. Collected: Duque, Mr. Tomas Lorenzo, San Diego County, CA</p>
    		<p source="subjects">Mission Indian basket, probably Diegueño, sumac coiled on a deergrass bundle foundation, design in natural juncus, before 1910.
    Collected by Tomas Lorenzo Duque, from San Felipe - Santa Ysabel - Warner's Ranch area, San Diego County, California, between 1900-1910.
    Subjects: bundle foundation, triangles</p>
    		<p source="media_id"></p>

    Open in new window

    here is data set #2
           <title label="Object Name">Awl/perforator</title>
           <online_media mediaCount="2">
             <media type="Images" thumbnail="http://www.location/collections//4027/429/139.thumb.jpg">http://www.location/collections//4027/429/139.700x700.jpg</media>
             <media type="Images" thumbnail="http://www.location/collections//42/319/002408.thumb.jpg">http://www.location/collections//42/319/002408.700x700.jpg</media>
           <freetext category="identifier" label="Catalog Number">2408</freetext>
           <freetext category="dataSource" label="Data Source">smith</freetext>
           <freetext category="objectType" label="Object Type">Tools and Equipment (General)</freetext>
           <freetext category="culture" label="Culture/People">Unknown archaeological culture</freetext>
           <freetext category="physicalDescription" label="Media/Materials">Chalcedony</freetext>
           <freetext category="physicalDescription" label="Techniques">Flaked/chipped</freetext>
           <freetext category="date" label="Date Created">10,000 BC-AD 1600</freetext>
           <freetext category="notes" label="Collection History">Collected by archaeologist George Hubbard Pepper (1873-1924, MAI staff member); acquired by George Heye in 1904.</freetext>
           <freetext category="setName" label="See more items in">Archaeological Items</freetext>
           <freetext category="place" label="place">Ganado region; Ganado, Navajo Reservation; Apache County; Arizona; USA</freetext>
           <freetext category="place" label="Site Name">Ganado region</freetext>
           <object_type>Tools and Equipment (General)</object_type>
           <online_media_type>Catalog Cards</online_media_type>
           <culture>American Indians</culture>
           <culture>Native Americans</culture>
           <culture>Unknown archaeological culture</culture>
           <culture>Unknown archaeological culture</culture>
           <place>Apache County</place>
           <place>Ganado, Navajo Reservation</place>
             <L2 type="Country">USA</L2>
             <L3 type="State">Arizona</L3>
             <L4 type="County">Apache County</L4>
             <L5 type="City">Ganado, Navajo Reservation</L5>

    Open in new window


    Author Comment

    That is a very good question. Relevant means returning highly ranked results. This would probably involve storing user interactions. One simplistic approach would be to update the counter for that row in a table for every time a user clicks to find out more about a particular record and using this information to return sorted results based on "Relevance". One good example is the way amazon returns search results but working on a good search based on rankings and/or other variables will be worked on later. I am more interested in knowing more about different technical approaches to solve this problem in a scalable fashion.

    I want to know my options as to the kind of technology I should use, pros n cons and all that stuff when it comes to making it more scalable.
    LVL 107

    Assisted Solution

    by:Ray Paseur
    would probably involve storing user interactions -- Yep, exactly.

    Wow, that is the kind of question that millionaires hire MIT graduates to do research on!  It's also kind of a trade secret of the search engines.  No disrespect to all the SEO experts, but the search engines have multi-billion dollar budgets to tackle this problem, and they have hundreds of professional software developers working hard to improve the algorithms.

    Let me try to put scalability into perspective for one of the search engines.  A couple of years ago I had a chance to visit the Yahoo East data center in northern Virginia.  The data center manager planned his day based on the number of truckloads of servers that would arrive that day.  You might not want to deal with questions like that.

    A good solution might be to contact Google about enterprise-level application support.  They have kind of set the standard in search and relevance programming, and they have very robust commercial and government business offices.
    LVL 4

    Assisted Solution

    LVL 142

    Expert Comment

    by:Guy Hengel [angelIII / a3]
    This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    IT, Stop Being Called Into Every Meeting

    Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

    Suggested Solutions

    Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
    Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
    Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

    759 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    11 Experts available now in Live!

    Get 1:1 Help Now