remove rows where duplicate

Hi
I have a large text file and need to remove any duplicate rows - eg
example text (original)
select *
from table1
table 1
table 1
select * from ref where
table1
etc etc
...what I want returned is ....
select *
from table1
table 1
select * from ref where
table1

Regards
PHIL SawyerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Gertone (Geert Bormans)Information ArchitectCommented:
Hi Philip,

best way to remove duplicate rows is by adding the rows to a hash
in this particular case you need to clear teh hash between packages

integrated in your code

file_out1 = File.open('regex1_output1b.txt','w')
re1 = /create|replace package.*/i
re2 = /.*?from\s+(\w+).*/i
re3 = /.*?\sjoin\s(\w+?)(\s\w+?)?\son.*/i
h = {}
File.foreach("CRYSTAL_PERMAL_SCHEMA.txt"){ |line|
  case line
    when re1
    then
      line = line.gsub(/\s+/,' ').upcase
      h.each_key do
        |k| file_out1.puts k
      end
      puts "#{line}"
      h = {}


      file_out1.puts( "#{line}")
    when re2
    then
      line = line.gsub(re2, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
    when re3
    then
      line = line.gsub(re3, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
  end
}
file_out1.close

Open in new window

0
PHIL SawyerAuthor Commented:
Hey Geert
It seems to be removing more rows than expected.  Maybe if I clarify a few points.
First of all I'm trying to list all TABLES used within each Package and discard everything else.
The txt file was downloaded from Oracle/Toad - listing all Packages and sql contained within.
There could be dozens of different queries/joins within each Package. I'm assuming anything between "join" and "on" will be a table and not forgetting any word after "from" - but not for example......
select *
from (
select * from philtable)
Tables will and can be repeated many times within eachPackage as that can be the case - THIS IS WHERE MY PROBLEM IS AS THE FINAL OUTPUT ONLY NEEDS TO SHOW THE ONE TALE NAME WITHIN EACH PACKAGE.
Plus - all lines in the txt file do not necessarily start at the beginning of the line eg
Create Package Phil as
  Select * from
       mytable
           where id in (
                                  select id
         from  myothertable where etc etc
Hope this makes sense.
Regards
Phil
0
Gertone (Geert Bormans)Information ArchitectCommented:
Hi Phil,

There are basically two issues in this question
1. write the regular expression to find table and package names
2. only list each table name once per package

The heart of this question is 2. and that should be covered by using a hash and re-initialising the hash with each package, as I did in the suggested solution... at least if I understood the requirement well. It would be nice if you could confirm that part working

The issue with 1. is that your original code was working on a line base processing,
but you are processing SQL statements from TOAD, and a line-break does not mean anything in SQL (I believe). You should forget about the lines and parse on keywords.
SQL is a very predictable and parsable format, so it should not be too hard to write a parser for it

Let me think about a different way to get about it
0
OWASP: Forgery and Phishing

Learn the techniques to avoid forgery and phishing attacks and the types of attacks an application or network may face.

PHIL SawyerAuthor Commented:
Thanks Geert - welcome back
0
Gertone (Geert Bormans)Information ArchitectCommented:
Hey Phil, canb you send me a decent sized source file I can use for testing? Thanks
0
PHIL SawyerAuthor Commented:
Geert - I can send one later today if ok.
0
Gertone (Geert Bormans)Information ArchitectCommented:
That's OK, working on other things anyway
0
Gertone (Geert Bormans)Information ArchitectCommented:
Hi Phil,

I think the best way to go about this is to still take it line by line,
but keep status when the table is to be expected on the next line,

something along these terms

file_out1 = File.open('regex1_output1b.txt','w')
re0 = /^\s*([^\s]+)\s*.*$/i
re1 = /^\s*CREATE\s+OR\s+REPLACE\s+(PACKAGE)\s+([^\s]+).*$/i
re2a = /^.*?from\s+\(.*$/i
re2b = /^.*?from\s+([^\s]+).*$/i
re2c = /^.*?from\s+$/i
re3a = /^.*?join\s+\(.*$/i
re3b = /^.*?join\s+([^\s]+).*$/i
re3c = /^.*?join\s+$/i
from_bool = false
h = {}
File.foreach("CRYSTAL_PERMAL_SCHEMA.txt"){ |line|
  case line
    when re0 && from_bool == true
    then
      line = line.sub(re0, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
      from_bool = false
    when re1
    then
      line = line.sub(re1, '\1 \2').upcase
      h.each_key do
        |k| file_out1.puts k
      end
      puts "#{line}"
      h = {}
      file_out1.puts( "#{line}")
    when re2a
    then

    when re2b
    then
      line = line.sub(re2b, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
    when re2c
    then
      from_bool = true
    when re3a
    then

    when re3b
    then
      line = line.sub(re3b, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
    when re3c
    then
      from_bool = true
   end
}
file_out1.close

I know there are some unexpected table names and I still have duplicates,
but I think it is a matter of refining the regexes

If things get too complex, you might consider building a parser using RACC
https://rubygems.org/gems/racc
There is a first attempt to do so
https://rubygems.org/gems/sql-parser/versions/0.0.2
and you can base yourself on the grammars found here
http://www.savage.net.au/SQL/

Cheers

Geert
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
PHIL SawyerAuthor Commented:
Thanks Geert - excellent answer
0
Gertone (Geert Bormans)Information ArchitectCommented:
welcome
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Ruby

From novice to tech pro — start learning today.