# Find a mapping pattern

Hello, attached is a portion of a huge text file
The file only contains four letters:A,G,C,T. I want to find a sequence in the file which has the maximum match. The definition of the maximum match is the total  letters of the occurrence devided by the total number of letters in the file.

For example,
``````AAAGTTACCTTGGAATAAGAAGAAGTCA
``````
. Suppose the above portion is a file, AAG is a sequence. The AAG occuries 4 times, it indicates the total sequence has 12 letters. And the total file has 28 letters, thus the match rate is equal to 12/28.

The question is that we don't know which pattern has the maximum mapping rate except the entire file itself. For a single letter, the rate is about 1/4 when the size of the file increases because of probability.

It involves regular expression and algorithm etc. Thank for any input.
AGCT.txt
###### Who is Participating?

x

Commented:
This version uses your test file.

Here are the best scores:

There were some long patterns, but they did not score well:
``````using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Diagnostics;

namespace PatternScanner
{
public struct Result
{
public string pattern;      // a pattern that was found more than once
public int count;           // the number of times that pattern occurred
public Result(string p, int c) { pattern = p; count = c; }      // constructor/initializer
}

public partial class frmPatternScanner : Form
{
string fileContents = "";                       // contents of file to be examined
string patternChars = "ACGT";                   // valid characters; characters that can appear in the file or in a pattern
StringBuilder pattern = new StringBuilder();    // (varies) pattern be scanned for now
int maxPatternLen = 0;                          // the maximum allowed pattern length
int fileLen = 0;                                // length of file to be examined
List<Result> results = new List<Result>();      // list of patterns found more than once and the number of times that each occurred

public frmPatternScanner()
{
InitializeComponent();
}

private void btnGo_Click(object sender, EventArgs e)
{
DoScan();
}

private void btnExit_Click(object sender, EventArgs e)
{
Close();
}

private void DoScan()
{
//  Replace this with code to read file
//fileContents = "AAAGTTACCTTGGAATAAGAAGAAGTCA";

//  Read the file into a string
string fileName = @"..\..\..\AGCT.txt";
fileContents = fileContents.Replace("\r", "").Replace("\n", "");    // remove line terminators

fileLen = fileContents.Length;
maxPatternLen = fileContents.Length / 2;

pattern.Append(patternChars[0]);                // the initial pattern is the first character from the list of valid characters

// Loop through all potential patterns and determine how many times each occurs in the file
// Record any patterns that occur more than once
// When an empty pattern is returned, we are done
while (pattern.Length > 0)
{
int occurrenceCount = CountOccurrences(fileContents, pattern);
if (occurrenceCount > 1) RecordResult(pattern, occurrenceCount);
pattern = NextPattern(pattern, occurrenceCount);
}

DisplayResults();
}

//  Find the next potential pattern
//  If the previous pattern (s) was found more than once (lastCount > 1), we try to extend it by appending the first of the valid
//  characters (patternChars[0])
//  If the previous pattern was not found, there is no point in extending it because any extension of a failed pattern will also fail
//  If the previous pattern was not found, we first try a new character in the final position
//  If there are no more characters to try in the final position, we discard the last character and try a new character in the next
//  to last position (if there are no more characters to try in that position, we back up again); we do this by calling NextPattern
//  recursively with the shorter string and a lastCount value of 0, which indicates that the reduced pattern should not be extended
//  Once we have reduced the pattern to an empty string, we are finished because extending the empty string would yield the initial
//  pattern we had when we started out
//  We also do not extend the pattern beyond the permitted maximum length
private StringBuilder NextPattern(StringBuilder s, int lastCount)
{
int n = s.Length;

//  If previous pattern was found, try extending it
if (lastCount > 1 && n < maxPatternLen)
s.Append(patternChars[0]);

//  Otherwise, try a new pattern
else
{
int m = patternChars.IndexOf(s[n - 1]);
Debug.Assert(m >= 0);

//  Try changing the last character
if (m + 1 < patternChars.Length)
s[n - 1] = patternChars[++m];
//  If no more pattern chars, try changing next to last character (recursively)
//  If no previous character to change, we're done (return empty string)
else
{
s.Length -= 1;
if (s.Length > 0) s = NextPattern(s, 0);
}
}

return s;
}

private int CountOccurrences(string fileContents, StringBuilder pattern)
{
string s = pattern.ToString();

//  Initialize to negative values so we only need one test in the loop
int count = -1;
int index = -s.Length;

do
{
++count;                                    // count what we found on previous pass (or set initial value to zero)
index += s.Length;                          // advance past the previous match (or set initial value to beginning of fileContents)
index = fileContents.IndexOf(s, index);     // look for the next occurrence of the pattern
}

return count;
}

private void RecordResult(StringBuilder pattern, int occurrenceCount)
{
//  Construct a new Result from the incoming parameters and add it to the list of Results
}

private void DisplayResults()
{
//  Initialize the ListView
lvResults.View = View.Details;
lvResults.FullRowSelect = true;
lvResults.GridLines = true;
lvResults.Sorting = SortOrder.Descending;       // sort descending by mapping rate

//  Define the columns

//  Add each Result to the ListView
foreach (Result r in results)
{
double d = ((double) r.pattern.Length * r.count) / fileLen;     // calculate the mapping rate for this pattern
ListViewItem lvi = new ListViewItem(d.ToString());
}
}
}
}
``````
0

Commented:
Perhaps I'm just dumb, but if you are discounting the whole file as being the maximum, then wouldn't "whole file - 1" be then next maximum? What determines a valid sequence?
0

IT ConsultantCommented:
I think part of the question is how to identify repeating patterns - so given the input string:
``````AAGAAGCCTACCTACCTACCTACCTACCTAAGCT
``````

How can programmatically identify each repeating pattern, and determine that the pattern CCTA has the larget character count to total string length ratio.

@zhshqzyc
Does a pattern necessarily appear consecutively in the string; i.e. given the string "CCTAAGCCTACTCCTA" is "CCTA" still considered a pattern, even though the the three occurences are separated by "AG" and "CT"?
0

Commented:
I agree with Kaufmed.  Based on the information provided, a pattern consisting of the whole file would have the maximum mapping rate, while the two patterns formed by dropping either the first or the last letter of the whole file pattern would each be the second greatest.

Can you be more precise about the characteristics you are seeking in your candidate patterns?  Should a pattern be a string of individual characters, or could it be an arbitrarily complex regular expression?

If you really have a huge text file, an algorithm that checks for all possible patterns is going to run for a long time.
0

Author Commented:
Hi Kaufmed: Theoretically "whole file - 1" could be the solution. However from the real world experience it is impossible.The usually string pattern has around 20-40 letters. The pattern must repeat at least two times, the who file only repeat one times. Thus the who file or who file -1 can't be the pattern. The pattern's length is less than the half count of total letters.

``````.AAGAAGCCTACCTACCTACCTACCTACCTAAGCT
``````
``````CCTA
``````
is a pattern no matter what letters seperate the occurences. The pattern is a substring appear consectively in the file.About the file's size, we can reduce it to 6000-10000 letters. Don't worry about run time, I just to test the algorithm. C sharp is my best choice. I don't want to use perl/sed/awk etc right now.

Thanks
0

IT ConsultantCommented:
Is there a minimum length to a pattern - can it be two characters, e.g. AC in ACACTACGACACACTAC?
0

Author Commented:
No, there is no minimum length to a pattern. Two caracters could be a pattern theoretically but it can not obtain the maximum match in the real case.
The pattern in the string likes
``````<some letters><pattern><some other letters><pattern><some other letters>...
``````
0

Commented:
This program looks at all patterns no greater than half the size of your file that occur more than once.  It produced the output in the attached picture.

A had a score of 0.464286.
AAG had a score of 0.428571.

The program assumes your entire file and your patterns can be held in memory.

I have also attached all of the program files, should you want those.  Three files were not permitted by Experts Exchange.  This may be for your safety.  I added ".txt" to their names so you could see the contents if necessary, but you should be careful.
The suo file goes in PatternScanner\.
The user file goes in PatternScanner\ PatternScanner\.
The settings file goes in PatternScanner\ PatternScanner\Properties\.

``````using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Diagnostics;

namespace PatternScanner
{
public struct Result
{
public string pattern;
public int count;
public Result(string p, int c) { pattern = p; count = c; }
}

public partial class frmPatternScanner : Form
{
string fileContents = "";
string patternChars = "ACGT";
StringBuilder pattern = new StringBuilder();
int maxPatternLen = 0;
int fileLen = 0;
List<Result> results = new List<Result>();

public frmPatternScanner()
{
InitializeComponent();
}

private void btnGo_Click(object sender, EventArgs e)
{
DoScan();
}

private void btnExit_Click(object sender, EventArgs e)
{
Close();
}

private void DoScan()
{
//  Replace this with code to read file
fileContents = "AAAGTTACCTTGGAATAAGAAGAAGTCA";

fileLen = fileContents.Length;
maxPatternLen = fileContents.Length / 2;
pattern.Append(patternChars[0]);

while (pattern.Length > 0)
{
int occurrenceCount = CountOccurrences(fileContents, pattern);
if (occurrenceCount > 1) RecordResult(pattern, occurrenceCount);
pattern = NextPattern(pattern, occurrenceCount);
}

DisplayResults();
}

private StringBuilder NextPattern(StringBuilder s, int lastCount)
{
int n = s.Length;

//  If previous pattern was found, try extending it
if (lastCount > 1 && n < maxPatternLen)
s.Append(patternChars[0]);

//  Otherwise, try a new pattern
else
{
int m = patternChars.IndexOf(s[n - 1]);
Debug.Assert(m >= 0);

//  Try changing the last character
if (m + 1 < patternChars.Length)
s[n - 1] = patternChars[++m];
//  If no more pattern chars, try changing next to last character (recursively)
//  If no previous character to change, we're done (return empty string)
else
{
s.Length -= 1;
if (s.Length > 0) s = NextPattern(s, 0);
}
}

return s;
}

private int CountOccurrences(string fileContents, StringBuilder pattern)
{
string s = pattern.ToString();
int count = -1;
int index = -s.Length;

do
{
++count;
index += s.Length;
index = fileContents.IndexOf(s, index);
}
while (index >= 0);

return count;
}

private void RecordResult(StringBuilder pattern, int occurrenceCount)
{
}

private void DisplayResults()
{
lvResults.View = View.Details;
lvResults.FullRowSelect = true;
lvResults.GridLines = true;
lvResults.Sorting = SortOrder.Descending;

foreach (Result r in results)
{
double d = ((double) r.pattern.Length * r.count) / fileLen;
string s = d.ToString();
ListViewItem lvi = new ListViewItem(d.ToString());
}
}
}
}
``````
PatternScannerTest.PNG
PatternScanner.zip
0

Author Commented:
Hi, could you please explain your algorithm a little bit except windows form part? Thank you very much.
0

Commented:
I have added more comments to the code.  If you have any further questions, please let me know.  Please be as specific as you can.  This is a complete copy of the frmPatternScanner.cs file.  None of the other files changed.

``````using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Diagnostics;

namespace PatternScanner
{
public struct Result
{
public string pattern;      // a pattern that was found more than once
public int count;           // the number of times that pattern occurred
public Result(string p, int c) { pattern = p; count = c; }      // constructor/initializer
}

public partial class frmPatternScanner : Form
{
string fileContents = "";                       // contents of file to be examined
string patternChars = "ACGT";                   // valid characters; characters that can appear in the file or in a pattern
StringBuilder pattern = new StringBuilder();    // (varies) pattern be scanned for now
int maxPatternLen = 0;                          // the maximum allowed pattern length
int fileLen = 0;                                // length of file to be examined
List<Result> results = new List<Result>();      // list of patterns found more than once and the number of times that each occurred

public frmPatternScanner()
{
InitializeComponent();
}

private void btnGo_Click(object sender, EventArgs e)
{
DoScan();
}

private void btnExit_Click(object sender, EventArgs e)
{
Close();
}

private void DoScan()
{
//  Replace this with code to read file
fileContents = "AAAGTTACCTTGGAATAAGAAGAAGTCA";

fileLen = fileContents.Length;
maxPatternLen = fileContents.Length / 2;

pattern.Append(patternChars[0]);                // the initial pattern is the first character from the list of valid characters

// Loop through all potential patterns and determine how many times each occurs in the file
// Record any patterns that occur more than once
// When an empty pattern is returned, we are done
while (pattern.Length > 0)
{
int occurrenceCount = CountOccurrences(fileContents, pattern);
if (occurrenceCount > 1) RecordResult(pattern, occurrenceCount);
pattern = NextPattern(pattern, occurrenceCount);
}

DisplayResults();
}

//  Find the next potential pattern
//  If the previous pattern (s) was found more than once (lastCount > 1), we try to extend it by appending the first of the valid
//  characters (patternChars[0])
//  If the previous pattern was not found, there is no point in extending it because any extension of a failed pattern will also fail
//  If the previous pattern was not found, we first try a new character in the final position
//  If there are no more characters to try in the final position, we discard the last character and try a new character in the next
//  to last position (if there are no more characters to try in that position, we back up again); we do this by calling NextPattern
//  recursively with the shorter string and a lastCount value of 0, which indicates that the reduced pattern should not be extended
//  Once we have reduced the pattern to an empty string, we are finished because extending the empty string would yield the initial
//  pattern we had when we started out
//  We also do not extend the pattern beyond the permitted maximum length
private StringBuilder NextPattern(StringBuilder s, int lastCount)
{
int n = s.Length;

//  If previous pattern was found, try extending it
if (lastCount > 1 && n < maxPatternLen)
s.Append(patternChars[0]);

//  Otherwise, try a new pattern
else
{
int m = patternChars.IndexOf(s[n - 1]);
Debug.Assert(m >= 0);

//  Try changing the last character
if (m + 1 < patternChars.Length)
s[n - 1] = patternChars[++m];
//  If no more pattern chars, try changing next to last character (recursively)
//  If no previous character to change, we're done (return empty string)
else
{
s.Length -= 1;
if (s.Length > 0) s = NextPattern(s, 0);
}
}

return s;
}

private int CountOccurrences(string fileContents, StringBuilder pattern)
{
string s = pattern.ToString();

//  Initialize to negative values so we only need one test in the loop
int count = -1;
int index = -s.Length;

do
{
++count;                                    // count what we found on previous pass (or set initial value to zero)
index += s.Length;                          // advance past the previous match (or set initial value to beginning of fileContents)
index = fileContents.IndexOf(s, index);     // look for the next occurrence of the pattern
}

return count;
}

private void RecordResult(StringBuilder pattern, int occurrenceCount)
{
//  Construct a new Result from the incoming parameters and add it to the list of Results
}

private void DisplayResults()
{
//  Initialize the ListView
lvResults.View = View.Details;
lvResults.FullRowSelect = true;
lvResults.GridLines = true;
lvResults.Sorting = SortOrder.Descending;       // sort descending by mapping rate

//  Define the columns

//  Add each Result to the ListView
foreach (Result r in results)
{
double d = ((double) r.pattern.Length * r.count) / fileLen;     // calculate the mapping rate for this pattern
ListViewItem lvi = new ListViewItem(d.ToString());
}
}
}
}
``````
0

IT ConsultantCommented:
I have a late entry. ;)

I'm not sure how this compares to the suggestion above in terms of performance, memory usage, general logic, etc (sorry, I didn't really understand that solution), but this completes with your sample file in just over 1 second, and the results seem to agree with above:
``````10 Best
Pattern         Total Char Count
CT              1064
TC              938
TT              780
CA              766
TG              760
AT              714
AG              640
AA              596
TA              580
CC              576
Done, 1.1635915 seconds elapsed.
``````

``````using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;
using System.IO;
using System.Text.RegularExpressions;
using System.Diagnostics;

namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();

HashSet<string> patterns = new HashSet<string>();

string dna = File.ReadAllText("dna.txt").Replace("\r\n", "").Replace("\r", "").Replace("\n", ""); // Make sure it's all a single string

if (dna.Length % 2 != 0)
dna = dna + "-"; // Pad the DNA string to an even number of characters

int minPatternLen = 2;
int maxPatternLen = dna.Length / 2;

// Find all possible patterns by looping through the string
// e.g. start with dna.Substring(0, 2) through dna.Substring(n, 2) until n + 2 is greater than dna.Length
// then do dna.Substring(0, 3) though dna.Substring(n, 3) until n + 3 is greater than dna.Length
// all the way through dna.Substring(0, x)...dna.Substring(n, x) where x is one half of dna.Length, until n + x > dna.Length
for (int i = minPatternLen; i < maxPatternLen; i++) // i representing pattern length, 2 through half of dna.lenght
{
// The maximum number of possible unique patterns for any given pattern length
// is n raised to the 4th, where n is the number of chars in a pattern
// so once we've found possiblePatterns patterns, stop looking for patterns
// of this length, and move on to the next longest one
double possiblePatterns = Math.Pow(i, 4);
double patternsFound = 0;
for (int j = 0; i + j < dna.Length; j += i) // j representing an index into the string of dna
{
if (patternsFound >= possiblePatterns)
break;

string pattern = dna.Substring(j, i);
// Sliding down the dna string one char at a time, getting i chars,
// and checking if we've already found this pattern - if not add
// it to the list and increment the number of patterns found
if (!patterns.Contains(pattern))
{
patternsFound++;
}
}
}

Dictionary<string, int> results = new Dictionary<string, int>(); // key = the pattern, val = total char count

// Loop through each pattern to see how many times it occurs
// if more than once, add it to the list of results
Regex patternRegex;
foreach (string pattern in patterns)
{
patternRegex = new Regex(pattern);
MatchCollection matches = patternRegex.Matches(dna);
if (matches.Count > 1)
}

var orderedResults = results.OrderByDescending(x => x.Value).ToArray();
Console.WriteLine("10 Best");
Console.WriteLine("Pattern\t\tTotal Char Count");
for (int i = 0; i < 10; i++)
Console.WriteLine("{0}\t\t{1}", orderedResults[i].Key, orderedResults[i].Value);

stopwatch.Stop();

Console.WriteLine("Done, {0} seconds elapsed.", stopwatch.Elapsed.TotalSeconds);
}
}
}
``````
0

Commented:
tgerbert, that's an interesting approach.

I borrowed your stopwatch idea, but I measured only the search phase of both programs (not the sort or display phases).  Version B (yours) took 8-10 seconds on my computer, while Version A took about 4 seconds.  Clearly, you have a nicer computer; it's much faster than mine.

Also, Version A found 4243 patterns (four of which were single character, which Version B intentionally excluded); Version B found 1855.

I'm not sure line 41 does exactly what you intended.  If you have j += i instead of j += 1, you will skip over many potential patterns.  For example, when i is 2, and the file begins with "TTTATGTCT", TT, TA, TG and TC will be considered; but AT, GT and CT will be omitted.

I tried changing the i to a 1, but the number of potential patterns collected was enormous and it took a very long time and eventually ran out of memory.  In Version A, only successful patterns are stored and XXXXY will not be considered unless XXXX occurred more than once.
0

IT ConsultantCommented:
Yup, you're right - it should be j++, but that causes an OutOfMemoryException on my system. Even if it worked correctly, I'm not surprised that it would run slower - Regex's are noticably slower than build-in string/char handling routines.

I thought it was an interesting question, though, and wanted to take a crack at it. ;)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.