Avatar of 707Tech
707TechFlag for United States of America

asked on 

Ruby - Extract Year from document text

Hello, I am trying to update a Ruby script that looks in html files downloaded from the IRS site and finds specific text in the file to extract the Year from the document.   The script works fine however there are not 2 types of documents that can be in the folder and the line in question is different in each document.  How can I add an "or" or "if" clause so the script looks for the text formatted either way and pulls out the year based on how the line of text reads.   The line currently in the file is displayed as "TAX PERIOD:    DEC. 31, 2014"  The line in the 2nd document that I need to add is displayed as "Tax Period or Periods:  December, 2014"  I would need the year "2014" extracted from each document.   I have attached screenshots of the script in color and the 2 document types.

 
#Open and read the downloaded file
	transcript_html_name = files_in_dir[i]
	File.open(Dir.pwd + "/Transcripts/#{transcript_html_name}", "r") do |f|
		f.each_line do |line|
			 #Search for the Tax Period date to get the year
			if line.include? "TAX PERIOD:" 
				line.gsub!(/(<[^>]*>)|\n|\t/s) {""}
				line.slice!("TAX PERIOD:")
				line.gsub!(/\w\w\w[.]\s\d\d[,]\s/) {""}
				year = line
			end

Open in new window

script.jpg
document-sample-1.jpg
document-sample-2.jpg
RubyProgrammingMiscellaneousProgramming Languages-Other

Avatar of undefined
Last Comment
sarabande
ASKER CERTIFIED SOLUTION
Avatar of sarabande
sarabande
Flag of Luxembourg image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of 707Tech
707Tech
Flag of United States of America image

ASKER

Thank you Sara, that did work going through the folder but I am still getting an error message on the "Tax Period or Periods:"   html files.   They are generating the error attached.   Thanks for your assistance with this, I did not write the original script and this is my first time trying to use Ruby.
Error_Ruby.jpg
Avatar of sarabande
sarabande
Flag of Luxembourg image

line.gsub!(//\w{3,}[,]/) {""}

probably there should be only one slash.

try to change it to

line.gsub!(/\w{3,}[,]/) {""}

Open in new window


if that doesn't help you may change manually a file with "Tax Period or Periods: ...." to have the alternate text "TAX PERIOD:    DEC. 31, 2014"  instead.

if the file works with that then there is still a problem with the regular expressions used. unfortunately i don't have a test environment for ruby.

Sara
Avatar of 707Tech
707Tech
Flag of United States of America image

ASKER

Thanks again, I ended up hard coding the year for the 2nd file type to 0000.  This works out better so those files can be easily identified.

# Search for the Tax Period date to get the year
			if line.include? "TAX PERIOD:" 
				line.gsub!(/(<[^>]*>)|\n|\t/s) {""}
				line.slice!("TAX PERIOD:")
				line.gsub!(/\w\w\w[.]\s\d\d[,]\s/) {""}
				year = line
			elsif line.include? "Tax Period or Periods:" 
				year = "0000"
			end

Open in new window

Avatar of 707Tech
707Tech
Flag of United States of America image

ASKER

I've requested that this question be closed as follows:

Accepted answer: 500 points for sarabande's comment #a41488647
Assisted answer: 0 points for sla310's comment #a41488960

for the following reason:

Instead of trying to replace the text, this was the solution we used to correct the file, was to hard code the final output of the file type.
Avatar of sarabande
sarabande
Flag of Luxembourg image

thanks.

if the year is at end of string you may do

...
elsif line.include? "Tax Period or Periods:" 
      year = line[line.size-4, line.size-1]                  
end

Open in new window


or

...
elsif line.include? "Tax Period or Periods:" 
      year = line.partition(',').last                 
end

Open in new window


or

...
elsif line.include? "Tax Period or Periods:" 
      year = line.[line.index(',')+1..line.index(',')+4]                 
end

Open in new window


good luck.

Sara
Programming
Programming

Programming includes both the specifics of the language you’re using, like Visual Basic, .NET, Java and others, but also the best practices in user experience and interfaces and the management of projects, version control and development. Other programming topics are related to web and cloud development and system and hardware programming.

55K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo