Solved: Unnecessary white space being written to file....

I can't see a difference around those examples u stated... can u give a few more details about what it looks like is happening

S-Twilley

sorry, i've seen what you meant now... ignore my last message

S-Twilley

does this make any difference...

Public Sub Log(ByVal logMessage As String, ByRef w As TextWriter)
w.WriteLine(logMessage)
w.Flush()
End Sub

========================

bit of a dumb question i know

I've rarely used the flush method, does it make any difference removing that. I'm also assuming since you only posted that bit of code up, that the value of logMessage doesn't contain any added whitespace when passed to Log()...

Are those lines all the same... except for the number at the beginning?

addicktz

ASKER

yes, they are all the same...

addicktz

ASKER

that sounds about right

S-Twilley

ok... my idea for this one, is to remove all spaces from the string, split the string up into chunks of 49 chars (49 is the length of what one line should be without spaces)

for each chunk (this is incase you pass more than one line somehow)
if it's a full chunk... insert the spaces back into the right places and write it to the textwriter
remove chunk from list
next

any chunks left over (because they weren't complete)... return it to the calling sub so that it can be completed (by adding on more of the incoming data wherever it is coming from)

S-Twilley

ok, different line of approach... are the format of the lines something like

ID User (Protocol)

ID -> Number
User -> Somename@SomeOthername
Protocol -> Some other text

Obviously the names might be wrong, but im more curious about the layout...

It's easy enough to read up to the end of the last digit...
then the next non-whitespace char is the beginning of user... and the end of user is when the next whitespace char is found
Protocol (or whatever it is), is enclosed in brackets... and assuming that protocol can't contain a bracket character... should be easy to what the line should be... then reformat it and write that to the text writer.

If you have a reference to what these lines should look like... that'll help

S-Twilley

assuming the above format is correct.... if the ")" character isn't permitted in "User" or "Protocol".. that would help solve t his I think, but till you link me to a page with the format of these lines, I can't be sure

addicktz

ASKER

ok, well, it basicly goes like this

Article ID -> number
Subject -> text

and the text can have ")" in it, the http://www.3nter.net/sample.txt is a reference to what they should look like, if you would like more samples please let me know.

S-Twilley

ok, so it's:

ArticleID(whitespace)Subject(newline)

Well you can read up until you get a newline, and then parse and trim some of the junk out of it... but I think with the definition of subject being so loose... it might be hard to know what spaces are valid, and which are not.

913479 Power-Poster@power-post.org (Power-Post 2000)
^ ^ ^

In the above line we've got 3 spaces... we know that the first one is valid as it seperates the ArticleID and Subject... but as for the next two... it get's a bit complicated. We could replace long sets of white space with a single space. I suppose one alternative might be that is that if the subjects change rarely... you could have a list of common subjects (with all white space removed... paired with what it should look like)... then when parsing, you remove all white space from the subject of that line, then look in your list for a matching (cleaned out) subject... if it finds one, it replaces it with the proper subject.

Any other actual definitions like you gave for the other post would be great... but if it is ALWAYS "Power-Poster@power-post.org (Power-Post 2000)" and can't be anything else... or changes so rarely that it's not really too much of a problem, then I think we can code something... but I prefer taking into account a sudden change

addicktz

ASKER

ok, I am very sorry, I was under the impression for some reason that sample.txt was in fact sample2.txt, I was wondering where you were coming up with user and all that =] i am very sorry, please accept my apology. the sample that we should be looking at is

http://www.3nter.net/sample2.txt

I am pretty sure that subject line can contain any character...

addicktz

ASKER

well to be honest
i need the whitespace taken out of both sample.txt and sample2.txt

addicktz

ASKER

and well, if its easier to deal with the space when reading from the file we can go that route as well

S-Twilley

that sample2 link didnt work for me... although my connection is being worked to the max at the mo

addicktz

ASKER

a perfect example in sample2.txt is 1444757

try the link again.....

addicktz

ASKER

1444757 Attn: Zippy Beyonce 2003 Dangerously In Love [05/15] -05 - Be with you.mp3 (07/19)

S-Twilley

ok... i think the best way of going about this is... as you read in each entry (terminated with a new line char or whatever)... remove ALL white space... see if there are any previous entries (with spaces removed) which match the same pattern as the current one... if the patterns match, then this entry is part of that series.

Now, if we have matched it to a previous series, then we check which of the two (current entry or previous entries) contain the least amount of white space... the one with the least amount I assume to be closer to what the line SHOULD be... then format all entries of that group according to the "better" format. This would take into account the first entry of a group being the "padded" entry and the next one being the proper or at least less padded entry.

If the entry doesn't match, then we have a new group.

e.g.

Line A -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (07/16)
Line A' -> 1447710ATTN:/3if//-FaithNoMore-WeCareALot.mp3(#A/#B)
no matching group for A'
create group

...

Line X -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (15/16)
Line X' -> 1447710ATTN:/3if//-FaithNoMore-WeCareALot.mp3(#A/#B)
matches A' group

Line X is less padded than Line A... Line X therefore overrides as better formatting... reformat all lines in group according to Line X

Line A -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (07/16)
'....
Line X -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (15/16)

===================================================

Now... that's the kinda logic i think would work.. because you can never really tell what white space is unnecessary since there's no strict format to the lines... you can only assume that white space is never removed from a line and therefore the shortest (full) line has the best format.

Now, as for implementing this, im a lil pushed for time till wednesday.. have a final year project due in wednesday, which was set back in Sept... and ive yet to really start... so might be a while before i can think about coding this.... feel free to code something yourself or get another expert in on it... but least now you might have something to work with.

If not, you'll just have to be patient till im back!

addicktz

ASKER

i understand up to the reformating of the lines, im not sure how to code that....

S-Twilley

Hi... as mentioned, I have coursework due on Wednesday... so if you post up a comment on here on Wednesday evening, to remind me.... I'll work on some code.

I don't mind if you bring this to the attention of another expert, and award them the points.

Good luck

addicktz

ASKER

remind

S-Twilley

ok, i knocked this together (its 3am)... and im away for the weekend... try and work around with this but post if you have problems getting it working or understanding it and i'll help when i get back:

===========================================================

Dim _reading As Boolean = False
Dim _oldbuffer As String = ""
Dim _newLineChar As String = vbCrLf

Dim regPartGrabber As Regex

Dim allData As New Hashtable

Sub SetupRegex()
Dim sSep() As String = {"of", "de", "/"}
Dim sSepJoined As String = String.Join("|", sSep)
sSepJoined = "(?<joiner>(" & sSepJoined & "))"
Dim sRegex As String = "\D(?<partno>\d+)\s*" & sSepJoined & "\s*(?<parts>\d+)\D"
regPartGrabber = New Regex(sRegex)
End Sub

Sub ReadingIn()
If _reading Then Exit Sub
_reading = True

Try
Dim sBuffer As String
Dim sLine As String

'sBuffer = Data.Read
sBuffer = _oldbuffer & sBuffer

Dim iNewline As Integer
iNewline = sBuffer.IndexOf(_newLineChar)

Do While iNewline >= 0
sLine = sBuffer.Substring(0, iNewline)
If iNewline + _newLineChar.Length >= sBuffer.Length Then
sBuffer = ""
iNewline = -1
Else
sBuffer = sBuffer.Substring(iNewline + _newLineChar.Length)
iNewline = sBuffer.IndexOf(_newLineChar)
End If

Dim iLineTrim As String = Regex.Replace(sLine, "\s", "")
Dim matchNumber As Match = regPartGrabber.Match(iLineTrim)

If matchNumber.Success Then
Dim iLineFormat As String = iLineTrim.Replace(matchNumber.Value, "*PARTDATA_" & matchNumber.Groups("joiner").Value & "*")
Dim thisData As GroupData

If allData.ContainsKey(iLineFormat) Then
thisData = allData(iLineFormat)
If thisData.LineFormat.Length > iLineFormat Then
thisData.LineFormat = iLineFormat
End If
thisData.Parts.Add(matchNumber.Groups("partno").Value)
allData(iLineFormat) = thisData
Else
thisData = New GroupData(iLineFormat, Integer.Parse(matchNumber.Groups("partno").Value), Integer.Parse(matchNumber.Groups("parts").Value), matchNumber.Groups("joiner").Value)
allData.Add(iLineFormat, thisData)
End If
End If
Loop

_oldbuffer = sBuffer
Catch ex As Exception

End Try

_reading = False
End Sub

Class GroupData
Public LineFormat As String
Public Parts As New ArrayList
Public NumberOfParts As Integer
Public Joiner As String

Public Sub New(ByVal sLineFormat As String, ByVal iPartNo As Integer, ByVal iParts As Integer, ByVal sJoiner As String)
LineFormat = sLineFormat
Parts.Add(iPartNo)
NumberOfParts = iParts
Joiner = sJoiner
End Sub

Public Function BuildLine(ByVal iIndex As Integer)
If iIndex >= 0 And iIndex <= Parts.Count - 1 Then
Dim partData As String = Parts(iIndex) & " " & Joiner & " " & NumberOfParts
Return LineFormat.Replace("*PARTDATA_" & Joiner & "*", partData)
Else
Return ""
End If
End Function
End Class

S-Twilley

Ok, going to be a slight problem with that, will try and post up a correction (i forgot to ignore the ID at the beginning of each line)

S-Twilley

I've gotta rush away for the weekend, I have it sort of working now, but not sufficient to post... have noticed a few problems:

1447139 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.mp3" yEnc (05/30)
1447140 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.mp3" yEnc (10/30)
1447141 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.mp3" yEnc (06/30)
1447142 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.mp3" yEnc (03/30)

if you notice... those lines have number values which would confuse the parser ... it might not know which is thart (part no / number of parts) field. I'll have to have another think about it while im away (should be back on sunday/monday)

S-Twilley

in the sample2.txt file, it says there are only 200 or so groups... but when i group all the entries where the unique part of a group entry is of the form (xx/yy)... im getting a lot more. Should I be ignoring certain entries like m3u, nfo and sfv files?

addicktz

ASKER

m3u is a playlist......all the files in the group should be kept together

S-Twilley

Ok, what I'll do is sort the list manually, and see what data to expect. I also noticed some other lines within the data, that werent file descriptions, they seemed to be some sort of bulletins/alerts which don't have the (xx/yy) values in the lines, how should these be handled?

addicktz

ASKER

well, im actually working on another function to parse what each subject line is, the whitespace is messing up the parsing.......after i get the whitespace problem fixed, i do have a question that will relate to those other lines......

addicktz

ASKER

sorry...i was lost for a second, i forgot about your code up top and that you incorporated from the other question.....ok, so i decided for this part, what I need is each unique subject line. So I guess treat them as (1/1) if that makes sense.