Solved

file content search, most algorithm-wise, performance-wise approach.

Posted on 2013-01-22
8
339 Views
Last Modified: 2013-02-11
Hi there;

I have a "sample" file which has several records in it. The actual file is meant to have hundreds and thousands of records.

The sample file content is as follows:

Company A:
1  0.6
44 0.13
46       0.17
4620       0.0
468       0.15
4631       0.15
4673       0.9
46732     3.6
Company B:
1       0.92
44       0.5
46       0.2
467       1.0
48       1.2
....

The blocks in the content are tab seperated.

Now, the aim is to search and find the longest matching input which is for example 4673212.
In company A, we will end up it 46732 and the program should output the corresponding item which is 3.6 with comapny letter and in company B, 467 and the output is 1.0 with comapny letter. If no match, simply output no match for that company.

This is fundamentally is to implement but I have to do it most efficient way.
I am using C# and Java.

Now, for the file search the culprit is this:
For each line read, I am checking whether it starts with "Company" literal or not, and i think this is really slowing. Then, I simply exploit try-catch block in which the program fells into catch and if "Company" literal is caught and I start populating a dictionary of values for that company values and finally, I add company dictionaries into a list.

Since, the minimalizing and fast algorithm is what I need, how can I do the check for each line most feasible way? Is it always an if-check?

Note that the content of the files are just to output, no calculation is expected. I assumed the values as int and float but should I go for a list having chars only in it?

Thanks for the help.

Regards.
0
Comment
Question by:jazzIIIlove
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
8 Comments
 
LVL 84

Assisted Solution

by:ozo
ozo earned 334 total points
ID: 38804576
If you are only going to search the file once, a linear scan over the entire file is the best you can do.
If you are going to do many searches over the same file for different inputs,
then you can sort the lines or build a trie.
0
 
LVL 16

Assisted Solution

by:krakatoa
krakatoa earned 166 total points
ID: 38804617
My comment would be the same as ozo's. Depends how many times you are going to do this, and how often the data will change or be updated. The only other key that seems apparent from the sample data is that the "." could be used as a length offset, giving you a way in to the longest record, but again the data would need to be sorted or sortable.
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38804929
There will be many different inputs, but the inputs are coming from console and the question remains the same, I feel that I have to read file once and I am reading the file only once, and storing the values, but with if-check.

I have to search for the file completely for each individual input that comes from user via console. So, do I still need an if-check? I am trying to get rid of the if check concept in the reading of the file.

Also, which tree are we talking about? Should I convert every number block into a character list and search in that list? or should I store each numerical looking character as char in a tree? And which tree? Should I sort the values? The sample content is just as that.

As you can see,

46       0.17
4620       0.0
468       0.15
4631       0.15

It's not vertically sorted. So, should I sort? and which kind of sort for this input?

Regards.
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 16

Expert Comment

by:krakatoa
ID: 38804954
I have a "sample" file which has

Time to make your mind up if the data is live from a console. Or file. Or not . . .
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38804978
Sorry for misunderstanding.

The data is the file and this data should be checked against for user input coming from console, and if the user input matches it will give the float looking value with the Company in it as output, if not just indicates that Company has no corresponding key, so no float looking value.

Regards.
0
 
LVL 84

Accepted Solution

by:
ozo earned 334 total points
ID: 38805070
depending on the size of the file, and the number of searches you want to do on it
(and perhaps how much memory you have available), you can have different trade offs between pre-processing time, and search time.
One extreme may be to build an Aho–Corasick  finite state machine to scan the file,
Another extreme may be to build a trie or prefix tree
But in any case you should be able to keep the total time proportional to the length of the file + the length of the input to be found.
0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38806061
Hi;

I am able to use Aho-Corasick algorithm. In order to use it efficiently, I also read the file content as bytes, but somehow, I couldn't construct a working algorithm on this:

What I am trying to do is:
I first read the file, to mappedbytebuffer, then, while there is anything left in buffer, I get the content to bytebuffer, then transfer it to a byte array. Then, the construction of my algorithm is as:
in the loop, if there is a blank, then the coming byte can be either a character between A and Z, like:
Company A:
A can be coming after blank.

else a value like 0.17 can come. If so, then i try to find the last carriage return and go for +1 of that to reach 4. So, I can take the bytes until the coming blank and reach 46.

46       0.17

But the problem is that I couldn't construct the algorithm as debugging bytes is futile. Can you help me on this?

Below is the code:

try{
			FileInputStream f = new FileInputStream("List.txt");
			FileChannel fileChannel = f.getChannel();
			MappedByteBuffer mbb = fileChannel.map( FileChannel.MapMode.READ_ONLY,
					0L, fileChannel.size( ) );
			byte [] buffer1 = new byte[24];	        
			int i = 0;
			AhoCorasick tree = new AhoCorasick();
			//tree.prepare();
			TreeMap tm = new TreeMap();
			String output = "4620";
			List <Byte> byteList = new ArrayList<Byte> ();
			while ( mbb.hasRemaining( ) ) {	                
				ByteBuffer bb = mbb.get(buffer1, 0, 24);
				byte[] bytearr = new byte[bb.remaining()];
				//if(bytearr[i++] != 13 || bytearr[i++] != 10)
				//bytearr = bb.array();
				bb.get(bytearr);
				if(bytearr[i] == (byte)' ')
				{
					i++;
					if(bytearr[i] >= (byte)'A' || bytearr[i] <= (byte)'Z')
					{
					
					//Letter
					char letter = (char)bytearr[i];
					System.out.println("letterin");
					}
					else
					{
						i=i+2;
						if(bytearr[i] == (byte)'\n')
								{
							i++;
							int k = i;
									while(bytearr[k] != (byte)' ')
									{										
										byteList.add(bytearr[k]);										
									}
									Byte o []  = (Byte[])byteList.toArray();									 
									tree.add(o, output);
									tree.prepare();
								}						
					}
					
				}
				else i++;				
			}
		}
		catch(Exception ex){}

Open in new window

0
 
LVL 12

Author Comment

by:jazzIIIlove
ID: 38847523
Hi there;

Any update for my last comment with the code?

Regards.
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
Suggested Courses

617 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question