Ruby - Extract Year from document text

Antworne L. Spann
Antworne L. Spann used Ask the Experts™
on
Hello, I am trying to update a Ruby script that looks in html files downloaded from the IRS site and finds specific text in the file to extract the Year from the document.   The script works fine however there are not 2 types of documents that can be in the folder and the line in question is different in each document.  How can I add an "or" or "if" clause so the script looks for the text formatted either way and pulls out the year based on how the line of text reads.   The line currently in the file is displayed as "TAX PERIOD:    DEC. 31, 2014"  The line in the 2nd document that I need to add is displayed as "Tax Period or Periods:  December, 2014"  I would need the year "2014" extracted from each document.   I have attached screenshots of the script in color and the 2 document types.

 
#Open and read the downloaded file
	transcript_html_name = files_in_dir[i]
	File.open(Dir.pwd + "/Transcripts/#{transcript_html_name}", "r") do |f|
		f.each_line do |line|
			 #Search for the Tax Period date to get the year
			if line.include? "TAX PERIOD:" 
				line.gsub!(/(<[^>]*>)|\n|\t/s) {""}
				line.slice!("TAX PERIOD:")
				line.gsub!(/\w\w\w[.]\s\d\d[,]\s/) {""}
				year = line
			end

Open in new window

script.jpg
document-sample-1.jpg
document-sample-2.jpg
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2016
Commented:
you may try

if line.include? "TAX PERIOD:" 
	line.gsub!(/(<[^>]*>)|\n|\t/s) {""}
	line.slice!("TAX PERIOD:")
	line.gsub!(/\w\w\w[.]\s\d\d[,]\s/) {""}
	year = line
elsif line.include? "Tax Period or Periods:" 
	line.gsub!(/(<[^>]*>)|\n|\t/s) {""}
	line.slice!("Tax Period or Periods:")
	line.gsub!(//\w{3,}[,]/) {""}
	year = line
end

Open in new window


Sara
Antworne L. SpannHead of IT, Systems

Author

Commented:
Thank you Sara, that did work going through the folder but I am still getting an error message on the "Tax Period or Periods:"   html files.   They are generating the error attached.   Thanks for your assistance with this, I did not write the original script and this is my first time trying to use Ruby.
Error_Ruby.jpg
Top Expert 2016

Commented:
line.gsub!(//\w{3,}[,]/) {""}

probably there should be only one slash.

try to change it to

line.gsub!(/\w{3,}[,]/) {""}

Open in new window


if that doesn't help you may change manually a file with "Tax Period or Periods: ...." to have the alternate text "TAX PERIOD:    DEC. 31, 2014"  instead.

if the file works with that then there is still a problem with the regular expressions used. unfortunately i don't have a test environment for ruby.

Sara
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Antworne L. SpannHead of IT, Systems

Author

Commented:
Thanks again, I ended up hard coding the year for the 2nd file type to 0000.  This works out better so those files can be easily identified.

# Search for the Tax Period date to get the year
			if line.include? "TAX PERIOD:" 
				line.gsub!(/(<[^>]*>)|\n|\t/s) {""}
				line.slice!("TAX PERIOD:")
				line.gsub!(/\w\w\w[.]\s\d\d[,]\s/) {""}
				year = line
			elsif line.include? "Tax Period or Periods:" 
				year = "0000"
			end

Open in new window

Antworne L. SpannHead of IT, Systems

Author

Commented:
I've requested that this question be closed as follows:

Accepted answer: 500 points for sarabande's comment #a41488647
Assisted answer: 0 points for sla310's comment #a41488960

for the following reason:

Instead of trying to replace the text, this was the solution we used to correct the file, was to hard code the final output of the file type.
Top Expert 2016

Commented:
thanks.

if the year is at end of string you may do

...
elsif line.include? "Tax Period or Periods:" 
      year = line[line.size-4, line.size-1]                  
end

Open in new window


or

...
elsif line.include? "Tax Period or Periods:" 
      year = line.partition(',').last                 
end

Open in new window


or

...
elsif line.include? "Tax Period or Periods:" 
      year = line.[line.index(',')+1..line.index(',')+4]                 
end

Open in new window


good luck.

Sara

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial