Solved

Ptyhon Dex.txt how to use this regex for this python script?

Posted on 2016-11-11
40
49 Views
Last Modified: 2016-11-13
Ptyhon Dex  how to use this regex for this python script:
^[all-words-from-dex]blah-blah.*

Open in new window


I have my country language wich contains 100% words in my language
DEX.txt file

Open in new window

This is regex i want to integrate this regex code and use only this rule in the dex python script


Dex script:

import sqlite3


def lungime_fisier(fisier):
    i = 0
    with open(fisier) as data:
        for line in data:
            i +=1
    return i


def creaza_bazadate(nume_baza_date):
    conn = sqlite3.connect(nume_baza_date)
    conn.execute('''CREATE TABLE IF NOT EXISTS dex
       (id INTEGER PRIMARY KEY AUTOINCREMENT,
       words VARCHAR(50) NOT NULL);''')
    conn.close()
    print "Am creat baza de date '%s'!" % nume_baza_date



def introduc_dictionar(dictionar, nume_baza_date):
    conn = sqlite3.connect(nume_baza_date)
    lungime_dictionar = lungime_fisier("dictionar.txt")
    with open(dictionar) as fisier:
        for linie in fisier:
            linie = linie.replace("\n", "")
            cursor = conn.cursor()
            cursor.execute("INSERT INTO dex (words) VALUES (%r)" % linie)
    conn.commit()
    conn.close()
    print "Am introdus %d cuvinte in baza de date %s" % (lungime_dictionar, nume_baza_date)
            

def db(nume_baza_date):
    lista = []
    conn = sqlite3.connect(nume_baza_date)
    cursor = conn.cursor()
    data = cursor.execute("SELECT words from dex")
    for cuvant in data:
        if cuvant[0] not in lista:
            lista.append(str(cuvant[0]))
    return lista


def potriviri_exacte(cuvinte_mixte, nume_baza_date):
    lista_potriviri_exacte = []
    conn = sqlite3.connect(nume_baza_date)
    with open(cuvinte_mixte) as fisier:
        for linie in fisier:
            linie = linie.replace("\n", "")
            cursor = conn.cursor()
            cauta_cuvant = cursor.execute("SELECT words FROM dex WHERE words = %r LIMIT 1" % linie)
            for cuvant in cauta_cuvant:
                lista_potriviri_exacte.append(str(cuvant[0]))
    return lista_potriviri_exacte
    

def potriviri_derivate(cuvinte_mixte, cuvinte_romanesti, nume_baza_date):
    potriviri_derivate = []
    dex = db(nume_baza_date)
    for cuvant in dex:
        with open(cuvinte_mixte) as fisier:
            for linie in fisier:
                linie = linie.replace("\n", "")
                if cuvant in linie:
                    potriviri_derivate.append(str(linie))
                    
    for cuvant in potriviri_derivate:
        with open(cuvinte_romanesti, "a") as fisier:
            fisier.write(cuvant + "\n")
    fisier.close()


#creaza_bazadate("dex.db")
#introduc_dictionar("DEX.txt", "dex.db")
#potriviri_derivate("cuvinte_mixte.txt", "romanesti.txt", "dex.db")

Open in new window

0
Comment
Question by:john lambert
  • 18
  • 17
  • 4
40 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 41883844
While your question references dex.txt, your code references a database table.  I'm confused.
0
 

Author Comment

by:john lambert
ID: 41883933
Python script working but i'm not satisfied with the result u can modofy or create a new script.Look what i want let me explain to you and give you some examples,this is a regex code and i want  to exctract all words wich starts with the name Adina
^[Aa]dina.* 

Open in new window



I  have 1 big mix_words.txt  100 mb   and I want to extract all the words start with Adina   , so the output.txt would look like this:

adina@123
Adina2016
adina123
ADINA01
adina!234
adina12345
Adina-masina1
Adina-blah blah...

Open in new window


Ok so i have a mYCountry Dictionary called dex.txt (800 kb)
manual
profesor
casier
magazie
doctor
vanzator
utilizator
oaspete
gazda
etc....

Open in new window


I want to extract all this words,this is just an example using few lines

manual1
PROFESOR
PROFESOR1
casier@123
magazie$123$
doctor
Doctor010101
vanzator2016
utilizator1234!!!
oaspete,.1!
gazda-01!

Open in new window


Would be great if u can do this,for example dex.txt contain this line:

accelerat

Open in new window


now if u can output all words  cuting last 2 words or 3 words,would be :accele

accelerat
acceleration
accelerat@1
ACCELERAT

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 41883951
I don't think you need a regular expression for your pattern matching.  You just need to iterate your list of acceptable word beginnings and use the startswith method.  Here are a couple of examples:
currentword.lower().startswith('Adina'.lower())

Open in new window

currentword.lower().startswith(wordfromlist.lower())

Open in new window

1
 

Author Comment

by:john lambert
ID: 41883956
ok i want to use a mixed_words.txt and My.Language.Dictionary called Dex.txt,sorry but i don't know how to use this code can u explain plz?Can u create a script?or modify the above ?
I want lovercase and uppercase too

currentword.lower().startswith(wordfromlist.lower())
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884008
We are still talking in general terms.  I do not have any way to test any code because you haven't posted any files.
1
 

Author Comment

by:john lambert
ID: 41884010
orginal dex.txt and mix_words.txt? u can use the small lists above just to test
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884045
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
    print wanted_words
    dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
    for wd in dex:
        for wanted in wanted_words:
            if wd.lower().startswith(wanted.lower()):
                print wd

Open in new window

0
 

Author Comment

by:john lambert
ID: 41884052
i receive this error:
i have windows os not linux  just in case

C:\Python27>script.py
  File "C:\Python27\script", line 1
    wanted_words = open('C:\Python27\myCountry.txt','r').read().splitlines()
    ^
IndentationError: unexpected indent

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884062
please address the indentation error
0
 

Author Comment

by:john lambert
ID: 41884064
team viewer better??i don;t know what this error means i copy & paste ur script, i just change path to my lists,i run and i received that error
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884078
Indentations matter in Python.  Unless you want to go to Live or Gigs with this problem, you'd better do some of the lifting.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884084
This will be more efficient for larger sets of lines.  It stops looking when it finds a match
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
    print wanted_words
    dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
    for wd in dex:
        for wanted in wanted_words:
            if wd.lower().startswith(wanted.lower()):
                print wd
                break

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884092
Also, if the wanted words were lower-cased before the searching began, the comparison would be faster.

Also, if the wanted words were in some order, the searching might also be optimized.  Hard to tell how much it would help, but thought it worth mentioning.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884107
Maybe not in order, but a dictionary would split up the words into smaller lists, based on the first character.  Then each new word only needs to search the list based on its first character (lower case, of course)
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()

    print wanted_words
    d={}
    for wd in wanted_words:
        wd=wd.lower()
        if d.has_key(wd[0]):
            d[wd[0]].append(wd)
        else:
            d[wd[0]]=[wd]
    print d

Open in new window

Result:
{'c': ['casier'], 'd': ['doctor'], 'g': ['gazda'], 'm': ['manual', 'magazie'], 'o': ['oaspete'], 'p': ['profesor'], 'u': ['utilizator'], 'v': ['vanzator']}

Open in new window

0
 

Author Comment

by:john lambert
ID: 41884114
ok i will test now ....i don't want to stop when finds a match, i want to save all matches in Output.txt

lcasier
doctor
gazda
etc...

compare a mix_words.txt with a  Dex.txt (my country dictionary ) and output only words from my country that's why i'm using a Dex.txt
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884116
It only stops looking for a match when it finds one.  It still iterates all the words in the big list.
0
 

Author Comment

by:john lambert
ID: 41884120
i tested both script..and i receive same error:
I use  python2.7 for windows

C:\Python27>script.py
  File "C:\Python27\script.py", line 1
    wanted_words = open('C:\Python27\myCountry.txt','r').read().splitlines()
    ^
IndentationError: unexpected indent

Open in new window

0
 

Author Comment

by:john lambert
ID: 41884125
maybe we better use team viewer?i really don't understand why i receive that error, doesn't work
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884146
If you want to go to Live, I'll meet you there.  Check for a mixture of leading tabs and leading spaces in your lines.

Example of use of dictionary for faster lookup through shortened lists
def main():
    wanted_words = open('c:\users\mark\downloads\myCountry.txt','r').read().splitlines()
##    print wanted_words
    d={}
    for wd in wanted_words:
        wd=wd.lower()
        if d.has_key(wd[0]):
            d[wd[0]].append(wd)
        else:
            d[wd[0]]=[wd]
##    print d

    dex = open('c:\users\mark\downloads\mix_words.txt','r').read().splitlines()
    for wd in dex:
        wd=wd.lower()
        if d.has_key(wd[0]):
            for wdCandidate in d[wd[0]]:
                if wd.startswith(wdCandidate):
                    print wd
                    break
##        else:
##            print "*** No need to search for: ",wd

if __name__ == '__main__':
    main()

Open in new window

1
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:john lambert
ID: 41884151
yes better but can u can make it save as OUTPUT.txt in the same direciton? thank you
0
 

Author Comment

by:john lambert
ID: 41884169
but look ok this is dex.txt ( 100% my country words,this list we use to match)
o
acces
mariana
maria
marian
lavinia
calculator
romania

Open in new window


Mix words:

oana123
john
marcelo
ioana@1
acces@123
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula

Open in new window


Output
oana123
acces@123
lavinia2016
romania1@

Open in new window


why output , oana123 ??

in my dex.txt exist only  ''o ''  if so then ur script will save all words start with the letter O ? not good....then would save words like


Oasis
Olchit
onasis
etc...

Open in new window

not good this
0
 

Author Comment

by:john lambert
ID: 41884178
must save oana123 only if myCountry.txt contains : oan or oana, let me give u other example:

myCountry.txt contain the word:
accelerate

Open in new window

, then ur script must save the words:accelerite or accelerato1  etc..
so the word: accelerate have 10 letters , i want at least 8 letters match, in  the continuation of the word can be anything 1$@123blah blah

accelera
accelera12342432@#
accelerati@344
acceleram_123456
etc
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884201
why output , oana123 ??
because the first word in dex.txt is "o"
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884205
This is your script, not mine.  I'm merely helping you.  If you want someone to do the work for you, visit Live or Gigs.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884208
so the word: accelerate have 10 letters , i want at least 8 letters match
This is scope creep.  Stop it.  If you have different requirements, close this question and open a new question.
0
 

Author Comment

by:john lambert
ID: 41884210
must compare my country words not alphabet letters, a,b,c,d etc. ,  ''accelerate'' have 10 letters i want to match at least first 8 or ''maimuta'' have 7 letters  , i want to match at least first 5 , would be: maimuca1 , maimute@123 etc...
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884211
This is scope creep.  Stop it.
0
 

Author Comment

by:john lambert
ID: 41884222
dones't matter if is very very slow, this is what i want,who cna do that please?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884228
@John

this is what i want
That's all well and good, but you asked a question and I have answered the question.  If you need something different, then close this question and ask a related question (please use the link).

If you want to hire someone to do that for you, visit Gigs or Live here at EE.
https://www.experts-exchange.com/gigs/
https://www.experts-exchange.com/live/
1
 

Author Comment

by:john lambert
ID: 41884231
thanks
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41884245
Maybe you can try removing single letters from your word list.
0
 

Author Comment

by:john lambert
ID: 41884249
my country words list have 900 k and mix words have 120 mb.....  70% american,maybe 20% words of my country,i want extract them that's why i need a ''nice'' match
0
 
LVL 37

Expert Comment

by:Gerwin Jansen
ID: 41884990
Would you consider grep as an alternative to python or must this be a pure python solution.

If grep is an option: you can use a pattern file (your dex.txt) to search for / filter your other (bigger) file. Patterns in your dex.txt must start with ^ to search for patterns at the beginning of the line, case insensitive search is also possible.

Example dex.txt:

^o
^acces
^mariana
^maria
^marian
^lavinia
^calculator
^romania

Example big_mix.txt:

oana123
john
marcelo
ioana@1
acces@123
OOtoo
acessu
liam2015
lavinia2016
romania1@
milano
inter
intern
lama
craca
pula

Command:
grep -f dex.txt big_mix.txt

Open in new window

Result:

oana123
acces@123
lavinia2016
romania1@

And case insensitive:
grep -i -f dex.txt big_mix.txt

Open in new window

Result:

oana123
acces@123
OOtoo
lavinia2016
romania1@
0
 

Author Comment

by:john lambert
ID: 41885493
working for python windows too ?
0
 
LVL 37

Expert Comment

by:Gerwin Jansen
ID: 41885529
My suggestion is to use grep, not Python.

Do you have a Windows or Linux environment for this question? If it is Linux, then I suggest you use the grep solution.

But Python is available on Windows as well:
https://www.python.org/downloads/windows/
0
 

Author Comment

by:john lambert
ID: 41885530
i use a windows
0
 
LVL 37

Accepted Solution

by:
Gerwin Jansen earned 500 total points
ID: 41885562
OK, to run grep on Windows you have a few options:

- run a virtual machine with Linux
- install cygwin (https://cygwin.com/install.html)

But as said, Python runs on Windows just fine.

You could try both and see what's best for your purpose (why do you do this btw?).
0
 

Author Closing Comment

by:john lambert
ID: 41885563
than you ,by the way do you know how to use regular expression?
0
 
LVL 37

Expert Comment

by:Gerwin Jansen
ID: 41885588
Thanks but what regular expression do you mean?
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now