remove rows where duplicate

Hi
I have a large text file and need to remove any duplicate rows - eg
example text (original)
select *
from table1
table 1
table 1
select * from ref where
table1
etc etc
...what I want returned is ....
select *
from table1
table 1
select * from ref where
table1

Regards
PHIL SawyerAsked:
Who is Participating?
 
Geert BormansInformation ArchitectCommented:
Hi Phil,

I think the best way to go about this is to still take it line by line,
but keep status when the table is to be expected on the next line,

something along these terms

file_out1 = File.open('regex1_output1b.txt','w')
re0 = /^\s*([^\s]+)\s*.*$/i
re1 = /^\s*CREATE\s+OR\s+REPLACE\s+(PACKAGE)\s+([^\s]+).*$/i
re2a = /^.*?from\s+\(.*$/i
re2b = /^.*?from\s+([^\s]+).*$/i
re2c = /^.*?from\s+$/i
re3a = /^.*?join\s+\(.*$/i
re3b = /^.*?join\s+([^\s]+).*$/i
re3c = /^.*?join\s+$/i
from_bool = false
h = {}
File.foreach("CRYSTAL_PERMAL_SCHEMA.txt"){ |line|
  case line
    when re0 && from_bool == true
    then
      line = line.sub(re0, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
      from_bool = false
    when re1
    then
      line = line.sub(re1, '\1 \2').upcase
      h.each_key do
        |k| file_out1.puts k
      end
      puts "#{line}"
      h = {}
      file_out1.puts( "#{line}")
    when re2a
    then

    when re2b
    then
      line = line.sub(re2b, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
    when re2c
    then
      from_bool = true
    when re3a
    then

    when re3b
    then
      line = line.sub(re3b, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
    when re3c
    then
      from_bool = true
   end
}
file_out1.close

I know there are some unexpected table names and I still have duplicates,
but I think it is a matter of refining the regexes

If things get too complex, you might consider building a parser using RACC
https://rubygems.org/gems/racc
There is a first attempt to do so
https://rubygems.org/gems/sql-parser/versions/0.0.2
and you can base yourself on the grammars found here
http://www.savage.net.au/SQL/

Cheers

Geert
0
 
Geert BormansInformation ArchitectCommented:
Hi Philip,

best way to remove duplicate rows is by adding the rows to a hash
in this particular case you need to clear teh hash between packages

integrated in your code

file_out1 = File.open('regex1_output1b.txt','w')
re1 = /create|replace package.*/i
re2 = /.*?from\s+(\w+).*/i
re3 = /.*?\sjoin\s(\w+?)(\s\w+?)?\son.*/i
h = {}
File.foreach("CRYSTAL_PERMAL_SCHEMA.txt"){ |line|
  case line
    when re1
    then
      line = line.gsub(/\s+/,' ').upcase
      h.each_key do
        |k| file_out1.puts k
      end
      puts "#{line}"
      h = {}


      file_out1.puts( "#{line}")
    when re2
    then
      line = line.gsub(re2, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
    when re3
    then
      line = line.gsub(re3, '\1').downcase
      puts "#{line}"
      h[line] = 1 unless h.has_key?(line)
  end
}
file_out1.close

Open in new window

0
 
PHIL SawyerAuthor Commented:
Hey Geert
It seems to be removing more rows than expected.  Maybe if I clarify a few points.
First of all I'm trying to list all TABLES used within each Package and discard everything else.
The txt file was downloaded from Oracle/Toad - listing all Packages and sql contained within.
There could be dozens of different queries/joins within each Package. I'm assuming anything between "join" and "on" will be a table and not forgetting any word after "from" - but not for example......
select *
from (
select * from philtable)
Tables will and can be repeated many times within eachPackage as that can be the case - THIS IS WHERE MY PROBLEM IS AS THE FINAL OUTPUT ONLY NEEDS TO SHOW THE ONE TALE NAME WITHIN EACH PACKAGE.
Plus - all lines in the txt file do not necessarily start at the beginning of the line eg
Create Package Phil as
  Select * from
       mytable
           where id in (
                                  select id
         from  myothertable where etc etc
Hope this makes sense.
Regards
Phil
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

 
Geert BormansInformation ArchitectCommented:
Hi Phil,

There are basically two issues in this question
1. write the regular expression to find table and package names
2. only list each table name once per package

The heart of this question is 2. and that should be covered by using a hash and re-initialising the hash with each package, as I did in the suggested solution... at least if I understood the requirement well. It would be nice if you could confirm that part working

The issue with 1. is that your original code was working on a line base processing,
but you are processing SQL statements from TOAD, and a line-break does not mean anything in SQL (I believe). You should forget about the lines and parse on keywords.
SQL is a very predictable and parsable format, so it should not be too hard to write a parser for it

Let me think about a different way to get about it
0
 
PHIL SawyerAuthor Commented:
Thanks Geert - welcome back
0
 
Geert BormansInformation ArchitectCommented:
Hey Phil, canb you send me a decent sized source file I can use for testing? Thanks
0
 
PHIL SawyerAuthor Commented:
Geert - I can send one later today if ok.
0
 
Geert BormansInformation ArchitectCommented:
That's OK, working on other things anyway
0
 
PHIL SawyerAuthor Commented:
Thanks Geert - excellent answer
0
 
Geert BormansInformation ArchitectCommented:
welcome
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.