john lambert
asked on
Ptyhon Dex.txt how to use this regex for this python script?
Ptyhon Dex how to use this regex for this python script:
I have my country language wich contains 100% words in my language
Dex script:
^[all-words-from-dex]blah-blah.*
I have my country language wich contains 100% words in my language
DEX.txt file
This is regex i want to integrate this regex code and use only this rule in the dex python scriptDex script:
import sqlite3
def lungime_fisier(fisier):
i = 0
with open(fisier) as data:
for line in data:
i +=1
return i
def creaza_bazadate(nume_baza_date):
conn = sqlite3.connect(nume_baza_date)
conn.execute('''CREATE TABLE IF NOT EXISTS dex
(id INTEGER PRIMARY KEY AUTOINCREMENT,
words VARCHAR(50) NOT NULL);''')
conn.close()
print "Am creat baza de date '%s'!" % nume_baza_date
def introduc_dictionar(dictionar, nume_baza_date):
conn = sqlite3.connect(nume_baza_date)
lungime_dictionar = lungime_fisier("dictionar.txt")
with open(dictionar) as fisier:
for linie in fisier:
linie = linie.replace("\n", "")
cursor = conn.cursor()
cursor.execute("INSERT INTO dex (words) VALUES (%r)" % linie)
conn.commit()
conn.close()
print "Am introdus %d cuvinte in baza de date %s" % (lungime_dictionar, nume_baza_date)
def db(nume_baza_date):
lista = []
conn = sqlite3.connect(nume_baza_date)
cursor = conn.cursor()
data = cursor.execute("SELECT words from dex")
for cuvant in data:
if cuvant[0] not in lista:
lista.append(str(cuvant[0]))
return lista
def potriviri_exacte(cuvinte_mixte, nume_baza_date):
lista_potriviri_exacte = []
conn = sqlite3.connect(nume_baza_date)
with open(cuvinte_mixte) as fisier:
for linie in fisier:
linie = linie.replace("\n", "")
cursor = conn.cursor()
cauta_cuvant = cursor.execute("SELECT words FROM dex WHERE words = %r LIMIT 1" % linie)
for cuvant in cauta_cuvant:
lista_potriviri_exacte.append(str(cuvant[0]))
return lista_potriviri_exacte
def potriviri_derivate(cuvinte_mixte, cuvinte_romanesti, nume_baza_date):
potriviri_derivate = []
dex = db(nume_baza_date)
for cuvant in dex:
with open(cuvinte_mixte) as fisier:
for linie in fisier:
linie = linie.replace("\n", "")
if cuvant in linie:
potriviri_derivate.append(str(linie))
for cuvant in potriviri_derivate:
with open(cuvinte_romanesti, "a") as fisier:
fisier.write(cuvant + "\n")
fisier.close()
#creaza_bazadate("dex.db")
#introduc_dictionar("DEX.txt", "dex.db")
#potriviri_derivate("cuvinte_mixte.txt", "romanesti.txt", "dex.db")
While your question references dex.txt, your code references a database table. I'm confused.
ASKER
Python script working but i'm not satisfied with the result u can modofy or create a new script.Look what i want let me explain to you and give you some examples,this is a regex code and i want to exctract all words wich starts with the name Adina
I have 1 big mix_words.txt 100 mb and I want to extract all the words start with Adina , so the output.txt would look like this:
Ok so i have a mYCountry Dictionary called dex.txt (800 kb)
I want to extract all this words,this is just an example using few lines
Would be great if u can do this,for example dex.txt contain this line:
now if u can output all words cuting last 2 words or 3 words,would be :accele
^[Aa]dina.*
I have 1 big mix_words.txt 100 mb and I want to extract all the words start with Adina , so the output.txt would look like this:
adina@123
Adina2016
adina123
ADINA01
adina!234
adina12345
Adina-masina1
Adina-blah blah...
Ok so i have a mYCountry Dictionary called dex.txt (800 kb)
manual
profesor
casier
magazie
doctor
vanzator
utilizator
oaspete
gazda
etc....
I want to extract all this words,this is just an example using few lines
manual1
PROFESOR
PROFESOR1
casier@123
magazie$123$
doctor
Doctor010101
vanzator2016
utilizator1234!!!
oaspete,.1!
gazda-01!
Would be great if u can do this,for example dex.txt contain this line:
accelerat
now if u can output all words cuting last 2 words or 3 words,would be :accele
accelerat
acceleration
accelerat@1
ACCELERAT
I don't think you need a regular expression for your pattern matching. You just need to iterate your list of acceptable word beginnings and use the startswith method. Here are a couple of examples:
currentword.lower().startswith('Adina'.lower())
currentword.lower().startswith(wordfromlist.lower())
ASKER
ok i want to use a mixed_words.txt and My.Language.Dictionary called Dex.txt,sorry but i don't know how to use this code can u explain plz?Can u create a script?or modify the above ?
I want lovercase and uppercase too
currentword.lower().starts with(wordf romlist.lo wer())
I want lovercase and uppercase too
currentword.lower().starts
We are still talking in general terms. I do not have any way to test any code because you haven't posted any files.
ASKER
orginal dex.txt and mix_words.txt? u can use the small lists above just to test
wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
print wanted_words
dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
for wd in dex:
for wanted in wanted_words:
if wd.lower().startswith(wanted.lower()):
print wd
ASKER
i receive this error:
i have windows os not linux just in case
i have windows os not linux just in case
C:\Python27>script.py
File "C:\Python27\script", line 1
wanted_words = open('C:\Python27\myCountry.txt','r').read().splitlines()
^
IndentationError: unexpected indent
please address the indentation error
ASKER
team viewer better??i don;t know what this error means i copy & paste ur script, i just change path to my lists,i run and i received that error
Indentations matter in Python. Unless you want to go to Live or Gigs with this problem, you'd better do some of the lifting.
This will be more efficient for larger sets of lines. It stops looking when it finds a match
wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
print wanted_words
dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
for wd in dex:
for wanted in wanted_words:
if wd.lower().startswith(wanted.lower()):
print wd
break
Also, if the wanted words were lower-cased before the searching began, the comparison would be faster.
Also, if the wanted words were in some order, the searching might also be optimized. Hard to tell how much it would help, but thought it worth mentioning.
Also, if the wanted words were in some order, the searching might also be optimized. Hard to tell how much it would help, but thought it worth mentioning.
Maybe not in order, but a dictionary would split up the words into smaller lists, based on the first character. Then each new word only needs to search the list based on its first character (lower case, of course)
wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
print wanted_words
d={}
for wd in wanted_words:
wd=wd.lower()
if d.has_key(wd[0]):
d[wd[0]].append(wd)
else:
d[wd[0]]=[wd]
print d
Result:{'c': ['casier'], 'd': ['doctor'], 'g': ['gazda'], 'm': ['manual', 'magazie'], 'o': ['oaspete'], 'p': ['profesor'], 'u': ['utilizator'], 'v': ['vanzator']}
ASKER
ok i will test now ....i don't want to stop when finds a match, i want to save all matches in Output.txt
lcasier
doctor
gazda
etc...
compare a mix_words.txt with a Dex.txt (my country dictionary ) and output only words from my country that's why i'm using a Dex.txt
lcasier
doctor
gazda
etc...
compare a mix_words.txt with a Dex.txt (my country dictionary ) and output only words from my country that's why i'm using a Dex.txt
It only stops looking for a match when it finds one. It still iterates all the words in the big list.
ASKER
i tested both script..and i receive same error:
I use python2.7 for windows
I use python2.7 for windows
C:\Python27>script.py
File "C:\Python27\script.py", line 1
wanted_words = open('C:\Python27\myCountry.txt','r').read().splitlines()
^
IndentationError: unexpected indent
ASKER
maybe we better use team viewer?i really don't understand why i receive that error, doesn't work
If you want to go to Live, I'll meet you there. Check for a mixture of leading tabs and leading spaces in your lines.
Example of use of dictionary for faster lookup through shortened lists
Example of use of dictionary for faster lookup through shortened lists
def main():
wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
## print wanted_words
d={}
for wd in wanted_words:
wd=wd.lower()
if d.has_key(wd[0]):
d[wd[0]].append(wd)
else:
d[wd[0]]=[wd]
## print d
dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
for wd in dex:
wd=wd.lower()
if d.has_key(wd[0]):
for wdCandidate in d[wd[0]]:
if wd.startswith(wdCandidate):
print wd
break
## else:
## print "*** No need to search for: ",wd
if __name__ == '__main__':
main()
ASKER
yes better but can u can make it save as OUTPUT.txt in the same direciton? thank you
ASKER
but look ok this is dex.txt ( 100% my country words,this list we use to match)
Mix words:
Output
why output , oana123 ??
in my dex.txt exist only ''o '' if so then ur script will save all words start with the letter O ? not good....then would save words like
o
acces
mariana
maria
marian
lavinia
calculator
romania
Mix words:
oana123
john
marcelo
ioana@1
acces@123
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula
Output
oana123
acces@123
lavinia2016
romania1@
why output , oana123 ??
in my dex.txt exist only ''o '' if so then ur script will save all words start with the letter O ? not good....then would save words like
Oasis
Olchit
onasis
etc...
not good this
ASKER
must save oana123 only if myCountry.txt contains : oan or oana, let me give u other example:
myCountry.txt contain the word:
so the word: accelerate have 10 letters , i want at least 8 letters match, in the continuation of the word can be anything 1$@123blah blah
accelera
accelera12342432@#
accelerati@344
acceleram_123456
etc
myCountry.txt contain the word:
accelerate
, then ur script must save the words:accelerite or accelerato1 etc..so the word: accelerate have 10 letters , i want at least 8 letters match, in the continuation of the word can be anything 1$@123blah blah
accelera
accelera12342432@#
accelerati@344
acceleram_123456
etc
why output , oana123 ??because the first word in dex.txt is "o"
This is your script, not mine. I'm merely helping you. If you want someone to do the work for you, visit Live or Gigs.
so the word: accelerate have 10 letters , i want at least 8 letters matchThis is scope creep. Stop it. If you have different requirements, close this question and open a new question.
ASKER
must compare my country words not alphabet letters, a,b,c,d etc. , ''accelerate'' have 10 letters i want to match at least first 8 or ''maimuta'' have 7 letters , i want to match at least first 5 , would be: maimuca1 , maimute@123 etc...
This is scope creep. Stop it.
ASKER
dones't matter if is very very slow, this is what i want,who cna do that please?
@John
If you want to hire someone to do that for you, visit Gigs or Live here at EE.
https://www.experts-exchange.com/gigs/
https://www.experts-exchange.com/live/
this is what i wantThat's all well and good, but you asked a question and I have answered the question. If you need something different, then close this question and ask a related question (please use the link).
If you want to hire someone to do that for you, visit Gigs or Live here at EE.
https://www.experts-exchange.com/gigs/
https://www.experts-exchange.com/live/
ASKER
thanks
Maybe you can try removing single letters from your word list.
ASKER
my country words list have 900 k and mix words have 120 mb..... 70% american,maybe 20% words of my country,i want extract them that's why i need a ''nice'' match
Would you consider grep as an alternative to python or must this be a pure python solution.
If grep is an option: you can use a pattern file (your dex.txt) to search for / filter your other (bigger) file. Patterns in your dex.txt must start with ^ to search for patterns at the beginning of the line, case insensitive search is also possible.
Example dex.txt:
^o
^acces
^mariana
^maria
^marian
^lavinia
^calculator
^romania
Example big_mix.txt:
oana123
john
marcelo
ioana@1
acces@123
OOtoo
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula
Command:
oana123
acces@123
lavinia2016
romania1@
And case insensitive:
oana123
acces@123
OOtoo
lavinia2016
romania1@
If grep is an option: you can use a pattern file (your dex.txt) to search for / filter your other (bigger) file. Patterns in your dex.txt must start with ^ to search for patterns at the beginning of the line, case insensitive search is also possible.
Example dex.txt:
^o
^acces
^mariana
^maria
^marian
^lavinia
^calculator
^romania
Example big_mix.txt:
oana123
john
marcelo
ioana@1
acces@123
OOtoo
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula
Command:
grep -f dex.txt big_mix.txt
Result:oana123
acces@123
lavinia2016
romania1@
And case insensitive:
grep -i -f dex.txt big_mix.txt
Result:oana123
acces@123
OOtoo
lavinia2016
romania1@
ASKER
working for python windows too ?
My suggestion is to use grep, not Python.
Do you have a Windows or Linux environment for this question? If it is Linux, then I suggest you use the grep solution.
But Python is available on Windows as well:
https://www.python.org/downloads/windows/
Do you have a Windows or Linux environment for this question? If it is Linux, then I suggest you use the grep solution.
But Python is available on Windows as well:
https://www.python.org/downloads/windows/
ASKER
i use a windows
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
than you ,by the way do you know how to use regular expression?
Thanks but what regular expression do you mean?