Need to turn documents into data sources

Posted on 2011-10-09
Last Modified: 2012-05-12
I wish to analyze some speeches for frequency of words.  I tried to import the speeches in foxpro as space delimited files but it is not functioning properly.  Tried separating every word with comma, then with quotes around words and commas between.

Quite frustrating.  I am now seeking other methods.  I could write a program to read the file and put them into the database, but there has to easier one step processes out there.

Thanks in advance for the assistance.  No this is not homework.
Question by:jjackson2004
    LVL 12

    Accepted Solution

    I could write a program to read the file and put them into the database, but there has to easier one step processes out there.

    While there is an outside possibility that there might be something out there, I personally doubt it.

    General text such as you would find in speeches is comprised of multiple sentences of varying length and would also contain intermixed punctuation characters.   As such there would be no 'standard' line of text that you could use to create a 'standardized' data table structure into which to APPEND the text data.

    If you just wanted to import every word individually into a data table, you could use FILETOSTR() to put the data into a single character string (as long as it did not exceed the max char limitations).  Then change all of the spaces into CrLf's so that each separate word was represented on its own 'record'/line.  Then write it all back out to a file with STRTOFILE() and do your APPEND FROM all into a single field data table.

    BUT that's a long way around and would most likely exceed the max char count and therefore fail.  It would also be the long way around when, as you say, you could - write a program to read the file and put them into the database   Which I think would be the best approach to use.

    Use low-level commands to open the text file, acquire individual strings of text data and then parse out the individual words and put them into your data table.

    Another advantage of doing it with the low-level parsing approach would be that as you SEEK to put the individual words into the data table, you could, at that time, do your word counting.  

    Otherwise, if you COULD find a way, you would have to import all of the individual words separately including the duplicated words and then, after the fact, do your word counts.

    Good Luck
    LVL 7

    Expert Comment

    If you were loading to SQL server, you could load each line of text into a field and then create full text indexes, then do text searches. Even in Foxpro, it should be easy to import line at a could then do searches across the lines, though this would be a little slow since you would always be doing table space scans.
    LVL 27

    Assisted Solution

    All you need is one table really.

    word.dbf (word C(100), count N(10)) and index it on word.

    Use FOPEN() to read the file low level or FILETOSTR() to read it into a variable in memory and parse it in memory.

    You need to change all ",!?:;." and other punctuation to spaces using STRTRAN. Then you need to change all your line feeds CHR(10) and carriage returns CHR(13) also to spaces. At the end you keep replacing the double spaces with one space.
    STRTRAN(cString,CHR(13)," ")

    Now you will have all your words delimited with space. You can alternately choose anyother delimiter but it's visually most appealing at this stage to have them separated by space.

    Another method is to use IMPORT from a mono-delimited file to append the words into a temporary cursor.

    Others have done this for creating an index to the documents and discarding common words: "a", "the", "I", ...
    LVL 29

    Assisted Solution

    by:Olaf Doschke
    "there has to easier one step processes out there."

    You're not a programmer, are you? A single step process to import a text into a table with a word per record?
    Programming languages, including those of databases, are general ipurpose  languages, if you're thinking a programming language should have such things as a single command, then you could come up with the next thing a language should be able to do in a single command. That would result in computer languages having far more commands than a natural language has words.

    Your description of the problem took you some sentences and you expect a programming language to have a single command for that? What you ask for is a very specialised purpose, this would not fit for any normal purpose data processing.

    If we just talk about ascii data table formats, like csv, the seperation you tried to misuse is the seperation of columns, not of rows, rows are always delimited by lines. So if you format your speeches to a word per line you could import them via import or append. Otherwise you need a table with as many columns as the longes line of your speech has words. And this won't help you, as you then would have non normalized data. You'd have same data (words) in many columns.

    A database can be used to support word processing, but a word processor is a different thing than a database.

    Foxpro supports you in getting at the single words of a text be getwordcount() and getwordnum(). With a larger text you rather would use ALINES() with a space as delimiter to seperate all words into array elements and then put them in a table.

    ALines(aWord,Filetostr("yourspeech.txt"),1+4," ")
    ? aWord(1)
    ? aWord(2)

    Like Cyril already said you will want to preprocess the speech, eg replace line feeds carriage return and other chars, also punctuation with spaces, then remove double spaces and then seperate the text into single words via above alines call.

    Also have a read on

    Bye, Olaf.
    LVL 41

    Expert Comment

    The loading and counting program is really simple:
    CREATE TABLE MyWords (MyWord varchar(20), MyWordCnt integer)
    INDEX ON MyWord TAG MyWord
    lcMySpeach = FILETOSTR("d:\path\speachfile.ext") && Suppose the file is not larger than 16 MB
    lcSeparators = " ,.?!" && add more separators
    lnWordCount = GetWordCount(lcMySpeach, lcSeparators)
    FOR lnI = 1 TO lnWordCount
      lcWord = LOWER(GetWordNum(lcMySpeach, lnI, lcSeparators))
      IF !EMPTY(lcWord)
        IF SEEK(lcWord)
          REPLACE MyWordCnt WITH MyWordCnt+1
          INSERT INTO MyWords VALUES (lcWord, 1)

    Open in new window

    And finally ... you should recognize single word declinations which are in each language and count them together. You may also optimize the loading code as Olaf suggested.

    Author Comment

    I meant to give 150 point to pcelba but he did not show up on the points award page.  Mea Culpa
    LVL 41

    Expert Comment

    It seems I am on the EE black list :-)).
    LVL 41

    Expert Comment

    But you may donate my PayPal account :-))))

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Should I Do With This Threat Intelligence?

    Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

    Preface: When I started this series, I used the term CommandBars because that is the Office Object class that it discusses. Unfortunately, when Microsoft introduced Office 2007, they replaced the standard Commandbar menus with "The Ribbon" and rem…
    Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
    This video shows where to find the word count, how to display it, and what it breaks down to in Microsoft Word.
    In a previous video Micro Tutorial here at Experts Exchange (, I explained how to get a free, one-month trial of Office …

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    15 Experts available now in Live!

    Get 1:1 Help Now