Solved

how to using php read data from word (.doc) file, data in a table.

Posted on 2013-12-13
15
4,077 Views
Last Modified: 2014-04-06
I need import the data from a word document using php.
The data is in a table which in word(.doc) file.
Anyone know how?

Thanks
0
Comment
Question by:Tim
  • 6
  • 5
  • 2
  • +1
15 Comments
 
LVL 53

Expert Comment

by:COBOLdinosaur
ID: 39718892
If the word doc is saved as HTML, it is possible that a lot of parsing might be able to make sense of it, but in general a word document is so loaded with proprietary codes controls and formatting that it is not suitable for uses by anything outside of Office, and sometimes even other Office components have a problem with compatibility.

Cd&
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39719146
Please post the test data and show us what you want to get for output, thanks. ~Ray
0
 

Author Comment

by:Tim
ID: 39722276
Comment edited to put the code into the code snippet ~Ray

I use this code to read the word document.

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "\t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "\n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("\o" ,">",$new_line); 
        $new_line .= "\n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("\*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "customers.doc";
$text = parseWord($userDoc);

Open in new window

Also  I use some code to get the  data array:
$olines=explode("<br />", $text);

$slines=array();
foreach( $olines as $orow ){
	$ovalue=explode("	", $orow);
	if(count($ovalue)>5)$slines[]=$ovalue;
}
print_r($slines);

Open in new window

For now, it can show the data, but it is not good enough.
Anyone can help me.

Thanks
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39722914
Thanks for posting that code.  I don't know if it works or can be made to work because I do not have any test data and I do not know exactly what output you want to get from your test data.  Without the test data, we would just be wasting your time by guessing.  Please post the test data and show us what you want to get for output.  This article explains why we want test data.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

Thanks, ~Ray
0
 

Author Comment

by:Tim
ID: 39724086
Please check the attach, it is data file.

I would like get the data array.
but now the array do not show perfect.

Array
(
    [0] => Array
        (
            [0] => x €é(¼ê ìÞÞx ÞÞÞÞÞ$$x ÞÞÞy$ÞÞÞÞé(ÞÞÞÞÞÞÞÞÞr ’</a>:Platform
            [1] => 21Y-68
            [2] => Ceiling
            [3] => Plaster
            [4] => 1500sf
            [5] => 0
            [6] => 0
            [7] => PC1.2% Chrysotile
            [8] => No
            [9] => Yes
            [10] => C
            [11] => YUS-S03-AS02
            [12] => Not identified
            [13] => See Photograph #1
            [14] => Condition to be reassessed annually.
            [15] =>
        )

    [1] => Array
        (
            [0] =>

            [1] => Platform
            [2] => 21Y-69
            [3] => Ceiling
            [4] => Plaster
            [5] => 1500sf
            [6] => 0
            [7] => 0
            [8] =>
        )

    [2] => Array
        (
            [0] =>

            [1] => No
            [2] => Yes
            [3] => C
            [4] => Visually similar to YUS-S03-S02
            [5] => Not identified
            [6] =>
        )

    [3] => Array
        (
            [0] =>

            [1] => Platform
            [2] => 21Y-73
            [3] => Ceiling
            [4] => Plaster
            [5] => Xx
            [6] => 0
            [7] => 0
            [8] =>
        )

    [4] => Array
        (
            [0] =>

            [1] => No
            [2] => Yes
            [3] => C
            [4] => Visually similar to YUS-S03-S02
            [5] => Not identified
            [6] =>
        )

    [5] => Array
        (
            [0] =>

            [1] => Platform
            [2] => 21Y-74
            [3] => Ceiling
            [4] => Plaster
            [5] => Xx
            [6] => 0
            [7] => 0
            [8] =>
        )

    [6] => Array
        (
            [0] =>

            [1] => No
            [2] => Yes
            [3] => C
            [4] => Visually similar to YUS-S03-S02
            [5] => Not identified
            [6] =>
        )

    [7] => Array
        (
            [0] =>

            [1] => Platform
            [2] => 21Y-75
            [3] => Ceiling
            [4] => Plaster
            [5] => Xx
            [6] => 0
            [7] => 0
            [8] =>
        )

    [8] => Array
        (
            [0] =>

            [1] => No
            [2] => Yes
            [3] => C
            [4] => Visually similar to YUS-S03-S02
            [5] => Not identified
            [6] =>
        )

    [9] => Array
        (
            [0] =>

            [1] => Concourse
            [2] => Chelsea NY
            [3] => Ceiling
            [4] => Acoustic Ceiling Tile
            [5] => NQ
            [6] => 0
            [7] => 0
            [8] => SACM
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled due to entry restrictions
            [13] =>
        )

    [10] => Array
        (
            [0] =>

            [1] => Concourse
            [2] => Lotto Centre
            [3] => Ceiling
            [4] => Acoustic Ceiling Tile
            [5] => NQ
            [6] => 0
            [7] => 0
            [8] => SACM
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled due to entry restrictions
            [13] =>
        )

    [11] => Array
        (
            [0] =>

            [1] => Concourse
            [2] => Cinnabon
            [3] => Ceiling
            [4] => Acoustic Ceililng Tile
            [5] => 100sf
            [6] => 0
            [7] => 0
            [8] => SACM
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled due to entry restrictions
            [13] =>
        )

    [12] => Array
        (
            [0] =>

            [1] => Concourse
            [2] => Gateway News Stand
            [3] => Ceiling
            [4] => Acoustic Ceiling Tile
            [5] => NQ
            [6] => 0
            [7] => 0
            [8] => SACM
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled due to entry restrictions
            [13] =>
        )

    [13] => Array
        (
            [0] =>

            [1] => Concourse
            [2] => Rainbow ‘n’ Things
            [3] => Ceiling
            [4] => Acoustic Ceiling Tile
            [5] => NQ
            [6] => 0
            [7] => 0
            [8] => SACM
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled due to entry restrictions
            [13] =>
        )

    [14] => Array
        (
            [0] =>

            [1] => Concourse
            [2] => 51P-330 Elevator Machine Rm.
            [3] => Piping
            [4] => TransiteTM
            [5] => xx
            [6] => 0
            [7] => 0
            [8] => SACM
            [9] => No
            [10] => Yes
            [11] => B
            [12] => NS, SACM
            [13] => Not identified
            [14] =>
        )

    [15] => Array
        (
            [0] =>

            [1] => Platform
            [2] => Line Southbound Platform
            [3] => Ceiling above luxalon
            [4] => Sprayed fire proofing
            [5] => 1500sf
            [6] => 0
            [7] => 0
            [8] => PC 4.8% Chrysotile
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled
            [13] => Sampled. Sample # 1759489-008. Asbestos content PC 4.8% Chrysotile
            [14] => Coffey did not resample, but presence was confirmed by visual inspection.
        )

    [16] => Array
        (
            [0] =>

            [1] => Platform
            [2] =>  Line Southbound Platform
            [3] => Ceiling above luxalon
            [4] => Sprayed fire proofing
            [5] => 1500sf
            [6] => 0
            [7] => 0
            [8] => PC 5.1% Chrysotile
            [9] => Yes
            [10] => Yes
            [11] => C
            [12] => Not sampled
            [13] => Sampled.  Sample # 1759495-014. Asbestos content PC 5.1% Chrysotile
            [14] => Coffey did not resample, but presence was confirmed by visual inspection.
        )

    [17] => Array
        (
            [0] =>

            [1] => Level
            [2] => Area/Room
            [3] => System Component
            [4] => Component Material
            [5] => Condition (Estimated Quantity)***
            [6] => Asbestos Content
            [7] => Friable?
            [8] => Visible?
            [9] => Access.
            [10] => Coffey's Sample Number**
            [11] => Pinchin Report Findings
            [12] => Comments/ Notes
            [13] => Recommendations
            [14] =>
        )

)
customers.doc
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39724217
OK, I can read the document.
http://www.laprbass.com/RAY_temp_zcfyhome.php

Now can you please show me what you want to extract from the document?  Thanks, ~Ray
0
 

Author Comment

by:Tim
ID: 39724261
The array have some issue.
in array[0][0], the have " x €é(¼ê ìÞÞx ÞÞÞÞÞ$$x ÞÞÞy$ÞÞÞÞé(ÞÞÞÞÞÞÞÞÞr ’</a>:"
not sure why.

the array[1], array[2] looks broken by one array.

Also the array looks not show the table data well.
the table have 15 columns, but array not.

Because I have lots day.
It is hard to check manually.

Could you help.
0
 
LVL 27

Accepted Solution

by:
skullnobrains earned 250 total points
ID: 39726635
parsing word docs from php is prone to fail when it is saved with a different word version or when the user copy-pastes formatted text

can you ask the users to use a different format ?

did you try to use antiword on the source file to convert it to text and parse it afterwards ? seems likely to produce more stable results
0
 
LVL 109

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 39734769
Just a thought... Are you running PHP on Windows?  If so, you may be able to use COM
0
 

Author Closing Comment

by:Tim
ID: 39976478
Thanks all for help
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39976503
We need an explanation of the bad grade.  Please see the grading guidelines here:
http://support.experts-exchange.com/customer/portal/articles/481419
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39976957
i don't care about the grade, but feel free to post information regarding what you ended up with
0
 

Author Comment

by:Tim
ID: 39977960
Sorry for the grade. not sure this grade mean, just  click . anyways thanks both.
I give up this process after hours try, looks not good idea for this.
0
 

Author Comment

by:Tim
ID: 39977964
If anyone can change the grade , please do it.
0

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Recently Microsoft released a brand new function called CONCAT. It's supposed to replace its predecessor CONCATENATE. But how does it work? And what's new? In this article, we take a closer look at all of this - we even included an exercise file for…
My experience with Windows 10 over a one year period and suggestions for smooth operation
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question