addicktz
asked on
Unnecessary white space being written to file....
I am using the following routine to save to a file, but I am recieving lots of whitespace, any ideas on how to stop it?
Public Sub Log(ByVal logMessage As String, ByVal w As TextWriter)
w.Write(logMessage & vbCrLf)
w.Flush()
End Sub
Sample Output file.
http://www.3nter.net/sample.txt
First example at 913171
Second at 913295
Third at 913349
Public Sub Log(ByVal logMessage As String, ByVal w As TextWriter)
w.Write(logMessage & vbCrLf)
w.Flush()
End Sub
Sample Output file.
http://www.3nter.net/sample.txt
First example at 913171
Second at 913295
Third at 913349
I can't see a difference around those examples u stated... can u give a few more details about what it looks like is happening
sorry, i've seen what you meant now... ignore my last message
does this make any difference...
Public Sub Log(ByVal logMessage As String, ByRef w As TextWriter)
w.WriteLine(logMessage)
w.Flush()
End Sub
========================
bit of a dumb question i know
I've rarely used the flush method, does it make any difference removing that. I'm also assuming since you only posted that bit of code up, that the value of logMessage doesn't contain any added whitespace when passed to Log()...
Are those lines all the same... except for the number at the beginning?
Public Sub Log(ByVal logMessage As String, ByRef w As TextWriter)
w.WriteLine(logMessage)
w.Flush()
End Sub
========================
bit of a dumb question i know
I've rarely used the flush method, does it make any difference removing that. I'm also assuming since you only posted that bit of code up, that the value of logMessage doesn't contain any added whitespace when passed to Log()...
Are those lines all the same... except for the number at the beginning?
ASKER
yes, they are all the same...
ASKER
that sounds about right
ok... my idea for this one, is to remove all spaces from the string, split the string up into chunks of 49 chars (49 is the length of what one line should be without spaces)
for each chunk (this is incase you pass more than one line somehow)
if it's a full chunk... insert the spaces back into the right places and write it to the textwriter
remove chunk from list
next
any chunks left over (because they weren't complete)... return it to the calling sub so that it can be completed (by adding on more of the incoming data wherever it is coming from)
for each chunk (this is incase you pass more than one line somehow)
if it's a full chunk... insert the spaces back into the right places and write it to the textwriter
remove chunk from list
next
any chunks left over (because they weren't complete)... return it to the calling sub so that it can be completed (by adding on more of the incoming data wherever it is coming from)
ok, different line of approach... are the format of the lines something like
ID User (Protocol)
ID -> Number
User -> Somename@SomeOthername
Protocol -> Some other text
Obviously the names might be wrong, but im more curious about the layout...
It's easy enough to read up to the end of the last digit...
then the next non-whitespace char is the beginning of user... and the end of user is when the next whitespace char is found
Protocol (or whatever it is), is enclosed in brackets... and assuming that protocol can't contain a bracket character... should be easy to what the line should be... then reformat it and write that to the text writer.
If you have a reference to what these lines should look like... that'll help
ID User (Protocol)
ID -> Number
User -> Somename@SomeOthername
Protocol -> Some other text
Obviously the names might be wrong, but im more curious about the layout...
It's easy enough to read up to the end of the last digit...
then the next non-whitespace char is the beginning of user... and the end of user is when the next whitespace char is found
Protocol (or whatever it is), is enclosed in brackets... and assuming that protocol can't contain a bracket character... should be easy to what the line should be... then reformat it and write that to the text writer.
If you have a reference to what these lines should look like... that'll help
assuming the above format is correct.... if the ")" character isn't permitted in "User" or "Protocol".. that would help solve t his I think, but till you link me to a page with the format of these lines, I can't be sure
ASKER
ok, well, it basicly goes like this
Article ID -> number
Subject -> text
and the text can have ")" in it, the http://www.3nter.net/sample.txt is a reference to what they should look like, if you would like more samples please let me know.
Article ID -> number
Subject -> text
and the text can have ")" in it, the http://www.3nter.net/sample.txt is a reference to what they should look like, if you would like more samples please let me know.
ok, so it's:
ArticleID(whitespace)Subje ct(newline )
Well you can read up until you get a newline, and then parse and trim some of the junk out of it... but I think with the definition of subject being so loose... it might be hard to know what spaces are valid, and which are not.
913479 Power-Poster@power-post.or g (Power-Post 2000)
^ ^ ^
In the above line we've got 3 spaces... we know that the first one is valid as it seperates the ArticleID and Subject... but as for the next two... it get's a bit complicated. We could replace long sets of white space with a single space. I suppose one alternative might be that is that if the subjects change rarely... you could have a list of common subjects (with all white space removed... paired with what it should look like)... then when parsing, you remove all white space from the subject of that line, then look in your list for a matching (cleaned out) subject... if it finds one, it replaces it with the proper subject.
Any other actual definitions like you gave for the other post would be great... but if it is ALWAYS "Power-Poster@power-post.o rg (Power-Post 2000)" and can't be anything else... or changes so rarely that it's not really too much of a problem, then I think we can code something... but I prefer taking into account a sudden change
ArticleID(whitespace)Subje
Well you can read up until you get a newline, and then parse and trim some of the junk out of it... but I think with the definition of subject being so loose... it might be hard to know what spaces are valid, and which are not.
913479 Power-Poster@power-post.or
^ ^ ^
In the above line we've got 3 spaces... we know that the first one is valid as it seperates the ArticleID and Subject... but as for the next two... it get's a bit complicated. We could replace long sets of white space with a single space. I suppose one alternative might be that is that if the subjects change rarely... you could have a list of common subjects (with all white space removed... paired with what it should look like)... then when parsing, you remove all white space from the subject of that line, then look in your list for a matching (cleaned out) subject... if it finds one, it replaces it with the proper subject.
Any other actual definitions like you gave for the other post would be great... but if it is ALWAYS "Power-Poster@power-post.o
ASKER
ok, I am very sorry, I was under the impression for some reason that sample.txt was in fact sample2.txt, I was wondering where you were coming up with user and all that =] i am very sorry, please accept my apology. the sample that we should be looking at is
http://www.3nter.net/sample2.txt
I am pretty sure that subject line can contain any character...
http://www.3nter.net/sample2.txt
I am pretty sure that subject line can contain any character...
ASKER
well to be honest
i need the whitespace taken out of both sample.txt and sample2.txt
i need the whitespace taken out of both sample.txt and sample2.txt
ASKER
and well, if its easier to deal with the space when reading from the file we can go that route as well
that sample2 link didnt work for me... although my connection is being worked to the max at the mo
ASKER
a perfect example in sample2.txt is 1444757
try the link again.....
try the link again.....
ASKER
1444757 Attn: Zippy Beyonce 2003 Dangerously In Love [05/15] -05 - Be with you.mp3 (07/19)
ok... i think the best way of going about this is... as you read in each entry (terminated with a new line char or whatever)... remove ALL white space... see if there are any previous entries (with spaces removed) which match the same pattern as the current one... if the patterns match, then this entry is part of that series.
Now, if we have matched it to a previous series, then we check which of the two (current entry or previous entries) contain the least amount of white space... the one with the least amount I assume to be closer to what the line SHOULD be... then format all entries of that group according to the "better" format. This would take into account the first entry of a group being the "padded" entry and the next one being the proper or at least less padded entry.
If the entry doesn't match, then we have a new group.
e.g.
Line A -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (07/16)
Line A' -> 1447710ATTN:/3if//-FaithNo More-WeCar eALot.mp3( #A/#B)
no matching group for A'
create group
...
Line X -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (15/16)
Line X' -> 1447710ATTN:/3if//-FaithNo More-WeCar eALot.mp3( #A/#B)
matches A' group
Line X is less padded than Line A... Line X therefore overrides as better formatting... reformat all lines in group according to Line X
Line A -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (07/16)
'....
Line X -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (15/16)
========================== ========== ========== =====
Now... that's the kinda logic i think would work.. because you can never really tell what white space is unnecessary since there's no strict format to the lines... you can only assume that white space is never removed from a line and therefore the shortest (full) line has the best format.
Now, as for implementing this, im a lil pushed for time till wednesday.. have a final year project due in wednesday, which was set back in Sept... and ive yet to really start... so might be a while before i can think about coding this.... feel free to code something yourself or get another expert in on it... but least now you might have something to work with.
If not, you'll just have to be patient till im back!
Now, if we have matched it to a previous series, then we check which of the two (current entry or previous entries) contain the least amount of white space... the one with the least amount I assume to be closer to what the line SHOULD be... then format all entries of that group according to the "better" format. This would take into account the first entry of a group being the "padded" entry and the next one being the proper or at least less padded entry.
If the entry doesn't match, then we have a new group.
e.g.
Line A -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (07/16)
Line A' -> 1447710ATTN:/3if//-FaithNo
no matching group for A'
create group
...
Line X -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (15/16)
Line X' -> 1447710ATTN:/3if//-FaithNo
matches A' group
Line X is less padded than Line A... Line X therefore overrides as better formatting... reformat all lines in group according to Line X
Line A -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (07/16)
'....
Line X -> 1447710 ATTN: /3if // - Faith No More - We Care A Lot.mp3 (15/16)
==========================
Now... that's the kinda logic i think would work.. because you can never really tell what white space is unnecessary since there's no strict format to the lines... you can only assume that white space is never removed from a line and therefore the shortest (full) line has the best format.
Now, as for implementing this, im a lil pushed for time till wednesday.. have a final year project due in wednesday, which was set back in Sept... and ive yet to really start... so might be a while before i can think about coding this.... feel free to code something yourself or get another expert in on it... but least now you might have something to work with.
If not, you'll just have to be patient till im back!
ASKER
i understand up to the reformating of the lines, im not sure how to code that....
Hi... as mentioned, I have coursework due on Wednesday... so if you post up a comment on here on Wednesday evening, to remind me.... I'll work on some code.
I don't mind if you bring this to the attention of another expert, and award them the points.
Good luck
I don't mind if you bring this to the attention of another expert, and award them the points.
Good luck
ASKER
remind
ok, i knocked this together (its 3am)... and im away for the weekend... try and work around with this but post if you have problems getting it working or understanding it and i'll help when i get back:
========================== ========== ========== ========== ===
Dim _reading As Boolean = False
Dim _oldbuffer As String = ""
Dim _newLineChar As String = vbCrLf
Dim regPartGrabber As Regex
Dim allData As New Hashtable
Sub SetupRegex()
Dim sSep() As String = {"of", "de", "/"}
Dim sSepJoined As String = String.Join("|", sSep)
sSepJoined = "(?<joiner>(" & sSepJoined & "))"
Dim sRegex As String = "\D(?<partno>\d+)\s*" & sSepJoined & "\s*(?<parts>\d+)\D"
regPartGrabber = New Regex(sRegex)
End Sub
Sub ReadingIn()
If _reading Then Exit Sub
_reading = True
Try
Dim sBuffer As String
Dim sLine As String
'sBuffer = Data.Read
sBuffer = _oldbuffer & sBuffer
Dim iNewline As Integer
iNewline = sBuffer.IndexOf(_newLineCh ar)
Do While iNewline >= 0
sLine = sBuffer.Substring(0, iNewline)
If iNewline + _newLineChar.Length >= sBuffer.Length Then
sBuffer = ""
iNewline = -1
Else
sBuffer = sBuffer.Substring(iNewline + _newLineChar.Length)
iNewline = sBuffer.IndexOf(_newLineCh ar)
End If
Dim iLineTrim As String = Regex.Replace(sLine, "\s", "")
Dim matchNumber As Match = regPartGrabber.Match(iLine Trim)
If matchNumber.Success Then
Dim iLineFormat As String = iLineTrim.Replace(matchNum ber.Value, "*PARTDATA_" & matchNumber.Groups("joiner ").Value & "*")
Dim thisData As GroupData
If allData.ContainsKey(iLineF ormat) Then
thisData = allData(iLineFormat)
If thisData.LineFormat.Length > iLineFormat Then
thisData.LineFormat = iLineFormat
End If
thisData.Parts.Add(matchNu mber.Group s("partno" ).Value)
allData(iLineFormat) = thisData
Else
thisData = New GroupData(iLineFormat, Integer.Parse(matchNumber. Groups("pa rtno").Val ue), Integer.Parse(matchNumber. Groups("pa rts").Valu e), matchNumber.Groups("joiner ").Value)
allData.Add(iLineFormat, thisData)
End If
End If
Loop
_oldbuffer = sBuffer
Catch ex As Exception
End Try
_reading = False
End Sub
Class GroupData
Public LineFormat As String
Public Parts As New ArrayList
Public NumberOfParts As Integer
Public Joiner As String
Public Sub New(ByVal sLineFormat As String, ByVal iPartNo As Integer, ByVal iParts As Integer, ByVal sJoiner As String)
LineFormat = sLineFormat
Parts.Add(iPartNo)
NumberOfParts = iParts
Joiner = sJoiner
End Sub
Public Function BuildLine(ByVal iIndex As Integer)
If iIndex >= 0 And iIndex <= Parts.Count - 1 Then
Dim partData As String = Parts(iIndex) & " " & Joiner & " " & NumberOfParts
Return LineFormat.Replace("*PARTD ATA_" & Joiner & "*", partData)
Else
Return ""
End If
End Function
End Class
==========================
Dim _reading As Boolean = False
Dim _oldbuffer As String = ""
Dim _newLineChar As String = vbCrLf
Dim regPartGrabber As Regex
Dim allData As New Hashtable
Sub SetupRegex()
Dim sSep() As String = {"of", "de", "/"}
Dim sSepJoined As String = String.Join("|", sSep)
sSepJoined = "(?<joiner>(" & sSepJoined & "))"
Dim sRegex As String = "\D(?<partno>\d+)\s*" & sSepJoined & "\s*(?<parts>\d+)\D"
regPartGrabber = New Regex(sRegex)
End Sub
Sub ReadingIn()
If _reading Then Exit Sub
_reading = True
Try
Dim sBuffer As String
Dim sLine As String
'sBuffer = Data.Read
sBuffer = _oldbuffer & sBuffer
Dim iNewline As Integer
iNewline = sBuffer.IndexOf(_newLineCh
Do While iNewline >= 0
sLine = sBuffer.Substring(0, iNewline)
If iNewline + _newLineChar.Length >= sBuffer.Length Then
sBuffer = ""
iNewline = -1
Else
sBuffer = sBuffer.Substring(iNewline
iNewline = sBuffer.IndexOf(_newLineCh
End If
Dim iLineTrim As String = Regex.Replace(sLine, "\s", "")
Dim matchNumber As Match = regPartGrabber.Match(iLine
If matchNumber.Success Then
Dim iLineFormat As String = iLineTrim.Replace(matchNum
Dim thisData As GroupData
If allData.ContainsKey(iLineF
thisData = allData(iLineFormat)
If thisData.LineFormat.Length
thisData.LineFormat = iLineFormat
End If
thisData.Parts.Add(matchNu
allData(iLineFormat) = thisData
Else
thisData = New GroupData(iLineFormat, Integer.Parse(matchNumber.
allData.Add(iLineFormat, thisData)
End If
End If
Loop
_oldbuffer = sBuffer
Catch ex As Exception
End Try
_reading = False
End Sub
Class GroupData
Public LineFormat As String
Public Parts As New ArrayList
Public NumberOfParts As Integer
Public Joiner As String
Public Sub New(ByVal sLineFormat As String, ByVal iPartNo As Integer, ByVal iParts As Integer, ByVal sJoiner As String)
LineFormat = sLineFormat
Parts.Add(iPartNo)
NumberOfParts = iParts
Joiner = sJoiner
End Sub
Public Function BuildLine(ByVal iIndex As Integer)
If iIndex >= 0 And iIndex <= Parts.Count - 1 Then
Dim partData As String = Parts(iIndex) & " " & Joiner & " " & NumberOfParts
Return LineFormat.Replace("*PARTD
Else
Return ""
End If
End Function
End Class
Ok, going to be a slight problem with that, will try and post up a correction (i forgot to ignore the ID at the beginning of each line)
I've gotta rush away for the weekend, I have it sort of working now, but not sufficient to post... have noticed a few problems:
1447139 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m p3" yEnc (05/30)
1447140 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m p3" yEnc (10/30)
1447141 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m p3" yEnc (06/30)
1447142 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m p3" yEnc (03/30)
if you notice... those lines have number values which would confuse the parser ... it might not know which is thart (part no / number of parts) field. I'll have to have another think about it while im away (should be back on sunday/monday)
1447139 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m
1447140 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m
1447141 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m
1447142 []-[ #altbin@EFNet ]-[ Leaves - Breathe (192k VBR) ]-[08/26] - "01-leaves-i_go_down-prv.m
if you notice... those lines have number values which would confuse the parser ... it might not know which is thart (part no / number of parts) field. I'll have to have another think about it while im away (should be back on sunday/monday)
in the sample2.txt file, it says there are only 200 or so groups... but when i group all the entries where the unique part of a group entry is of the form (xx/yy)... im getting a lot more. Should I be ignoring certain entries like m3u, nfo and sfv files?
ASKER
m3u is a playlist......all the files in the group should be kept together
Ok, what I'll do is sort the list manually, and see what data to expect. I also noticed some other lines within the data, that werent file descriptions, they seemed to be some sort of bulletins/alerts which don't have the (xx/yy) values in the lines, how should these be handled?
ASKER
well, im actually working on another function to parse what each subject line is, the whitespace is messing up the parsing.......after i get the whitespace problem fixed, i do have a question that will relate to those other lines......
ASKER
sorry...i was lost for a second, i forgot about your code up top and that you incorporated from the other question.....ok, so i decided for this part, what I need is each unique subject line. So I guess treat them as (1/1) if that makes sense.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.