Learn how to a build a cloud-first strategyRegister Now

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 523
  • Last Modified:

Need to turn documents into data sources

I wish to analyze some speeches for frequency of words.  I tried to import the speeches in foxpro as space delimited files but it is not functioning properly.  Tried separating every word with comma, then with quotes around words and commas between.

Quite frustrating.  I am now seeking other methods.  I could write a program to read the file and put them into the database, but there has to easier one step processes out there.

Thanks in advance for the assistance.  No this is not homework.
3 Solutions
I could write a program to read the file and put them into the database, but there has to easier one step processes out there.

While there is an outside possibility that there might be something out there, I personally doubt it.

General text such as you would find in speeches is comprised of multiple sentences of varying length and would also contain intermixed punctuation characters.   As such there would be no 'standard' line of text that you could use to create a 'standardized' data table structure into which to APPEND the text data.

If you just wanted to import every word individually into a data table, you could use FILETOSTR() to put the data into a single character string (as long as it did not exceed the max char limitations).  Then change all of the spaces into CrLf's so that each separate word was represented on its own 'record'/line.  Then write it all back out to a file with STRTOFILE() and do your APPEND FROM all into a single field data table.

BUT that's a long way around and would most likely exceed the max char count and therefore fail.  It would also be the long way around when, as you say, you could - write a program to read the file and put them into the database   Which I think would be the best approach to use.

Use low-level commands to open the text file, acquire individual strings of text data and then parse out the individual words and put them into your data table.

Another advantage of doing it with the low-level parsing approach would be that as you SEEK to put the individual words into the data table, you could, at that time, do your word counting.  

Otherwise, if you COULD find a way, you would have to import all of the individual words separately including the duplicated words and then, after the fact, do your word counts.

Good Luck
If you were loading to SQL server, you could load each line of text into a field and then create full text indexes, then do text searches. Even in Foxpro, it should be easy to import line at a time...you could then do searches across the lines, though this would be a little slow since you would always be doing table space scans.
All you need is one table really.

word.dbf (word C(100), count N(10)) and index it on word.

Use FOPEN() to read the file low level or FILETOSTR() to read it into a variable in memory and parse it in memory.

You need to change all ",!?:;." and other punctuation to spaces using STRTRAN. Then you need to change all your line feeds CHR(10) and carriage returns CHR(13) also to spaces. At the end you keep replacing the double spaces with one space.
STRTRAN(cString,CHR(13)," ")

Now you will have all your words delimited with space. You can alternately choose anyother delimiter but it's visually most appealing at this stage to have them separated by space.

Another method is to use IMPORT from a mono-delimited file to append the words into a temporary cursor.

Others have done this for creating an index to the documents and discarding common words: "a", "the", "I", ...
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Olaf DoschkeSoftware DeveloperCommented:
"there has to easier one step processes out there."

You're not a programmer, are you? A single step process to import a text into a table with a word per record?
Programming languages, including those of databases, are general ipurpose  languages, if you're thinking a programming language should have such things as a single command, then you could come up with the next thing a language should be able to do in a single command. That would result in computer languages having far more commands than a natural language has words.

Your description of the problem took you some sentences and you expect a programming language to have a single command for that? What you ask for is a very specialised purpose, this would not fit for any normal purpose data processing.

If we just talk about ascii data table formats, like csv, the seperation you tried to misuse is the seperation of columns, not of rows, rows are always delimited by lines. So if you format your speeches to a word per line you could import them via import or append. Otherwise you need a table with as many columns as the longes line of your speech has words. And this won't help you, as you then would have non normalized data. You'd have same data (words) in many columns.

A database can be used to support word processing, but a word processor is a different thing than a database.

Foxpro supports you in getting at the single words of a text be getwordcount() and getwordnum(). With a larger text you rather would use ALINES() with a space as delimiter to seperate all words into array elements and then put them in a table.

ALines(aWord,Filetostr("yourspeech.txt"),1+4," ")
? aWord(1)
? aWord(2)

Like Cyril already said you will want to preprocess the speech, eg replace line feeds carriage return and other chars, also punctuation with spaces, then remove double spaces and then seperate the text into single words via above alines call.

Also have a read on http://stevenblack.com/text.html.

Bye, Olaf.
The loading and counting program is really simple:
CREATE TABLE MyWords (MyWord varchar(20), MyWordCnt integer)

lcMySpeach = FILETOSTR("d:\path\speachfile.ext") && Suppose the file is not larger than 16 MB

lcSeparators = " ,.?!" && add more separators

lnWordCount = GetWordCount(lcMySpeach, lcSeparators)

FOR lnI = 1 TO lnWordCount
  lcWord = LOWER(GetWordNum(lcMySpeach, lnI, lcSeparators))
  IF !EMPTY(lcWord)
    IF SEEK(lcWord)
      REPLACE MyWordCnt WITH MyWordCnt+1
      INSERT INTO MyWords VALUES (lcWord, 1)


Open in new window

And finally ... you should recognize single word declinations which are in each language and count them together. You may also optimize the loading code as Olaf suggested.
jjackson2004Author Commented:
I meant to give 150 point to pcelba but he did not show up on the points award page.  Mea Culpa
It seems I am on the EE black list :-)).
But you may donate my PayPal account :-))))

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now