Link to home
Start Free TrialLog in
Avatar of john lambert
john lambert

asked on

Ptyhon Dex.txt how to use this regex for this python script?

Ptyhon Dex  how to use this regex for this python script:
^[all-words-from-dex]blah-blah.*

Open in new window


I have my country language wich contains 100% words in my language
DEX.txt file

Open in new window

This is regex i want to integrate this regex code and use only this rule in the dex python script


Dex script:

import sqlite3


def lungime_fisier(fisier):
    i = 0
    with open(fisier) as data:
        for line in data:
            i +=1
    return i


def creaza_bazadate(nume_baza_date):
    conn = sqlite3.connect(nume_baza_date)
    conn.execute('''CREATE TABLE IF NOT EXISTS dex
       (id INTEGER PRIMARY KEY AUTOINCREMENT,
       words VARCHAR(50) NOT NULL);''')
    conn.close()
    print "Am creat baza de date '%s'!" % nume_baza_date



def introduc_dictionar(dictionar, nume_baza_date):
    conn = sqlite3.connect(nume_baza_date)
    lungime_dictionar = lungime_fisier("dictionar.txt")
    with open(dictionar) as fisier:
        for linie in fisier:
            linie = linie.replace("\n", "")
            cursor = conn.cursor()
            cursor.execute("INSERT INTO dex (words) VALUES (%r)" % linie)
    conn.commit()
    conn.close()
    print "Am introdus %d cuvinte in baza de date %s" % (lungime_dictionar, nume_baza_date)
            

def db(nume_baza_date):
    lista = []
    conn = sqlite3.connect(nume_baza_date)
    cursor = conn.cursor()
    data = cursor.execute("SELECT words from dex")
    for cuvant in data:
        if cuvant[0] not in lista:
            lista.append(str(cuvant[0]))
    return lista


def potriviri_exacte(cuvinte_mixte, nume_baza_date):
    lista_potriviri_exacte = []
    conn = sqlite3.connect(nume_baza_date)
    with open(cuvinte_mixte) as fisier:
        for linie in fisier:
            linie = linie.replace("\n", "")
            cursor = conn.cursor()
            cauta_cuvant = cursor.execute("SELECT words FROM dex WHERE words = %r LIMIT 1" % linie)
            for cuvant in cauta_cuvant:
                lista_potriviri_exacte.append(str(cuvant[0]))
    return lista_potriviri_exacte
    

def potriviri_derivate(cuvinte_mixte, cuvinte_romanesti, nume_baza_date):
    potriviri_derivate = []
    dex = db(nume_baza_date)
    for cuvant in dex:
        with open(cuvinte_mixte) as fisier:
            for linie in fisier:
                linie = linie.replace("\n", "")
                if cuvant in linie:
                    potriviri_derivate.append(str(linie))
                    
    for cuvant in potriviri_derivate:
        with open(cuvinte_romanesti, "a") as fisier:
            fisier.write(cuvant + "\n")
    fisier.close()


#creaza_bazadate("dex.db")
#introduc_dictionar("DEX.txt", "dex.db")
#potriviri_derivate("cuvinte_mixte.txt", "romanesti.txt", "dex.db")

Open in new window

Avatar of aikimark
aikimark
Flag of United States of America image

While your question references dex.txt, your code references a database table.  I'm confused.
Avatar of john lambert
john lambert

ASKER

Python script working but i'm not satisfied with the result u can modofy or create a new script.Look what i want let me explain to you and give you some examples,this is a regex code and i want  to exctract all words wich starts with the name Adina
^[Aa]dina.* 

Open in new window



I  have 1 big mix_words.txt  100 mb   and I want to extract all the words start with Adina   , so the output.txt would look like this:

adina@123
Adina2016
adina123
ADINA01
adina!234
adina12345
Adina-masina1
Adina-blah blah...

Open in new window


Ok so i have a mYCountry Dictionary called dex.txt (800 kb)
manual
profesor
casier
magazie
doctor
vanzator
utilizator
oaspete
gazda
etc....

Open in new window


I want to extract all this words,this is just an example using few lines

manual1
PROFESOR
PROFESOR1
casier@123
magazie$123$
doctor
Doctor010101
vanzator2016
utilizator1234!!!
oaspete,.1!
gazda-01!

Open in new window


Would be great if u can do this,for example dex.txt contain this line:

accelerat

Open in new window


now if u can output all words  cuting last 2 words or 3 words,would be :accele

accelerat
acceleration
accelerat@1
ACCELERAT

Open in new window

I don't think you need a regular expression for your pattern matching.  You just need to iterate your list of acceptable word beginnings and use the startswith method.  Here are a couple of examples:
currentword.lower().startswith('Adina'.lower())

Open in new window

currentword.lower().startswith(wordfromlist.lower())

Open in new window

ok i want to use a mixed_words.txt and My.Language.Dictionary called Dex.txt,sorry but i don't know how to use this code can u explain plz?Can u create a script?or modify the above ?
I want lovercase and uppercase too

currentword.lower().startswith(wordfromlist.lower())
We are still talking in general terms.  I do not have any way to test any code because you haven't posted any files.
orginal dex.txt and mix_words.txt? u can use the small lists above just to test
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
    print wanted_words
    dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
    for wd in dex:
        for wanted in wanted_words:
            if wd.lower().startswith(wanted.lower()):
                print wd

Open in new window

i receive this error:
i have windows os not linux  just in case

C:\Python27>script.py
  File "C:\Python27\script", line 1
    wanted_words = open('C:\Python27\myCountry.txt','r').read().splitlines()
    ^
IndentationError: unexpected indent

Open in new window

please address the indentation error
team viewer better??i don;t know what this error means i copy & paste ur script, i just change path to my lists,i run and i received that error
Indentations matter in Python.  Unless you want to go to Live or Gigs with this problem, you'd better do some of the lifting.
This will be more efficient for larger sets of lines.  It stops looking when it finds a match
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
    print wanted_words
    dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
    for wd in dex:
        for wanted in wanted_words:
            if wd.lower().startswith(wanted.lower()):
                print wd
                break

Open in new window

Also, if the wanted words were lower-cased before the searching began, the comparison would be faster.

Also, if the wanted words were in some order, the searching might also be optimized.  Hard to tell how much it would help, but thought it worth mentioning.
Maybe not in order, but a dictionary would split up the words into smaller lists, based on the first character.  Then each new word only needs to search the list based on its first character (lower case, of course)
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()

    print wanted_words
    d={}
    for wd in wanted_words:
        wd=wd.lower()
        if d.has_key(wd[0]):
            d[wd[0]].append(wd)
        else:
            d[wd[0]]=[wd]
    print d

Open in new window

Result:
{'c': ['casier'], 'd': ['doctor'], 'g': ['gazda'], 'm': ['manual', 'magazie'], 'o': ['oaspete'], 'p': ['profesor'], 'u': ['utilizator'], 'v': ['vanzator']}

Open in new window

ok i will test now ....i don't want to stop when finds a match, i want to save all matches in Output.txt

lcasier
doctor
gazda
etc...

compare a mix_words.txt with a  Dex.txt (my country dictionary ) and output only words from my country that's why i'm using a Dex.txt
It only stops looking for a match when it finds one.  It still iterates all the words in the big list.
i tested both script..and i receive same error:
I use  python2.7 for windows

C:\Python27>script.py
  File "C:\Python27\script.py", line 1
    wanted_words = open('C:\Python27\myCountry.txt','r').read().splitlines()
    ^
IndentationError: unexpected indent

Open in new window

maybe we better use team viewer?i really don't understand why i receive that error, doesn't work
If you want to go to Live, I'll meet you there.  Check for a mixture of leading tabs and leading spaces in your lines.

Example of use of dictionary for faster lookup through shortened lists
def main():
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
##    print wanted_words
    d={}
    for wd in wanted_words:
        wd=wd.lower()
        if d.has_key(wd[0]):
            d[wd[0]].append(wd)
        else:
            d[wd[0]]=[wd]
##    print d

    dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
    for wd in dex:
        wd=wd.lower()
        if d.has_key(wd[0]):
            for wdCandidate in d[wd[0]]:
                if wd.startswith(wdCandidate):
                    print wd
                    break
##        else:
##            print "*** No need to search for: ",wd

if __name__ == '__main__':
    main()

Open in new window

yes better but can u can make it save as OUTPUT.txt in the same direciton? thank you
but look ok this is dex.txt ( 100% my country words,this list we use to match)
o
acces
mariana
maria
marian
lavinia
calculator
romania

Open in new window


Mix words:

oana123
john
marcelo
ioana@1
acces@123
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula

Open in new window


Output
oana123
acces@123
lavinia2016
romania1@

Open in new window


why output , oana123 ??

in my dex.txt exist only  ''o ''  if so then ur script will save all words start with the letter O ? not good....then would save words like


Oasis
Olchit
onasis
etc...

Open in new window

not good this
must save oana123 only if myCountry.txt contains : oan or oana, let me give u other example:

myCountry.txt contain the word:
accelerate

Open in new window

, then ur script must save the words:accelerite or accelerato1  etc..
so the word: accelerate have 10 letters , i want at least 8 letters match, in  the continuation of the word can be anything 1$@123blah blah

accelera
accelera12342432@#
accelerati@344
acceleram_123456
etc
why output , oana123 ??
because the first word in dex.txt is "o"
This is your script, not mine.  I'm merely helping you.  If you want someone to do the work for you, visit Live or Gigs.
so the word: accelerate have 10 letters , i want at least 8 letters match
This is scope creep.  Stop it.  If you have different requirements, close this question and open a new question.
must compare my country words not alphabet letters, a,b,c,d etc. ,  ''accelerate'' have 10 letters i want to match at least first 8 or ''maimuta'' have 7 letters  , i want to match at least first 5 , would be: maimuca1 , maimute@123 etc...
This is scope creep.  Stop it.
dones't matter if is very very slow, this is what i want,who cna do that please?
@John

this is what i want
That's all well and good, but you asked a question and I have answered the question.  If you need something different, then close this question and ask a related question (please use the link).

If you want to hire someone to do that for you, visit Gigs or Live here at EE.
https://www.experts-exchange.com/gigs/
https://www.experts-exchange.com/live/
thanks
Maybe you can try removing single letters from your word list.
my country words list have 900 k and mix words have 120 mb.....  70% american,maybe 20% words of my country,i want extract them that's why i need a ''nice'' match
Would you consider grep as an alternative to python or must this be a pure python solution.

If grep is an option: you can use a pattern file (your dex.txt) to search for / filter your other (bigger) file. Patterns in your dex.txt must start with ^ to search for patterns at the beginning of the line, case insensitive search is also possible.

Example dex.txt:

^o
^acces
^mariana
^maria
^marian
^lavinia
^calculator
^romania

Example big_mix.txt:

oana123
john
marcelo
ioana@1
acces@123
OOtoo
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula

Command:
grep -f dex.txt big_mix.txt

Open in new window

Result:

oana123
acces@123
lavinia2016
romania1@

And case insensitive:
grep -i -f dex.txt big_mix.txt

Open in new window

Result:

oana123
acces@123
OOtoo
lavinia2016
romania1@
working for python windows too ?
My suggestion is to use grep, not Python.

Do you have a Windows or Linux environment for this question? If it is Linux, then I suggest you use the grep solution.

But Python is available on Windows as well:
https://www.python.org/downloads/windows/
i use a windows
ASKER CERTIFIED SOLUTION
Avatar of Gerwin Jansen
Gerwin Jansen
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
than you ,by the way do you know how to use regular expression?
Thanks but what regular expression do you mean?