Solved

Python scraper

Posted on 2014-12-21
7
253 Views
Last Modified: 2014-12-27
I am trying to scrape a website for some information. I found a script and tried to convert it to python but the conversion still has some errors. I wondered if anyone can assist with the errors. Thanks

def scrapeEarningsZacks_(Stock=None,*args,**kwargs):

    varargin = cellarray(args)

    nargin = 1-[Stock].count(None)+len(args)



    s=urlread_(char('http://zacks.thestreet.com/CompanyView.php'),char('post'),[char('ticker'),Stock])

    try:

        etst=strfind_(s,char('Surprise%</strong></div></td>'))

    finally:

        pass

    etend=strfind_(s[etst:end()],char(' </table>'))

    et=s[etst:etst + etend]

    rowend=strfind_(et,char('</tr>'))

    earnings=cell_(length_(rowend) - 2,6)

    for i in arange_(1,(length_(rowend) - 1)).reshape(-1):

        if i == length_(rowend):

            row=et[rowend[i]:end()]

        else:

            row=et[rowend[i]:rowend[i + 1]]

        dst=strfind_(row,char('<td>'))

        for j in arange_(1,6).reshape(-1):

            if j == 6:

                a=row[dst[j]:end() - 23]

            else:

                a=row[dst[j]:dst[j + 1]]

            earnings[i,j]=a[5:(end() - 38)]

    emptyCells=cellfun_(isempty,earnings)

    row,col=find_(emptyCells,nargout=2)

    earnings[row,:]=[]

    return earnings

print scrapeEarningsZacks_(AAPL)

Open in new window

0
Comment
Question by:earngreen
7 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 40512319
Have you tried passing a string into the function?
print scrapeEarningsZacks_("AAPL")

Open in new window

0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 40512322
It looks like all the data on that page is posted thru javascript.  Your code will not run the javascript to get the data so it is unlikely that you will be able to scrape that page.  In particular, the input for selecting a stock is done with javascript.  It is not something you can 'post' to and get a result.  This is that code:

<input  type="text" name="search_company"  id="search_company" value="Enter company name" size=18 onFocus="JavaScript:this.value=''" onBlur="JavaScript:Fill_Lookup()" onkeyup="get_ticker_info();">
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40512591
are you running this code in Windows or Linux?
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 

Author Comment

by:earngreen
ID: 40513143
This is Linux
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40513222
what libraries have you imported?
0
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 40514416
It would probably be easier if you just told us what you're hoping to return rather than fix whatever is going on with that code that you have there.

From the sample URL

   http://zacks.thestreet.com/CompanyView.php?ticker=AAPL

What would you like your scrapeEarnings to return?  The entire table?    Here's a python3 example of parsing that into python objects using beautiful soup:

from bs4 import BeautifulSoup
import urllib.request

class Earning(object):
    def __init__(self, table_row):
        (self.date, 
         self.period_ending,
         self.estimate,
         self.reported,
         self.surprise,
         self.surprise_percent) = [i.text for i in table_row("td")]

    def __str__(self):
        return "\t".join((self.date, self.period_ending, self.estimate,
                         self.reported, self.surprise, self.surprise_percent))

class Earnings(object):
    def __init__(self, soup):
        self.soup = soup
        self.earnings_table = soup.find(id="divPrint")("table")[1]
        self.earnings_rows = self.earnings_table("tr")[1:]
        self.earnings = [Earning(e) for e in self.earnings_rows]

    def __str__(self):
        return "\n".join([str(e) for e in self.earnings])

def getEarningsForTicker(ticker):
    url = "http://zacks.thestreet.com/CompanyView.php?ticker={0}".format(ticker)
    return Earnings(BeautifulSoup(urllib.request.urlopen(url)))

def main():
    print(getEarningsForTicker('AAPL'))

if __name__ == '__main__':
    main()

Open in new window

0
 

Author Comment

by:earngreen
ID: 40520120
clockwatcher that worked out great. thx
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Strings in Python are the set of characters that, once defined, cannot be changed by any other method like replace. Even if we use the replace method it still does not modify the original string that we use, but just copies the string and then modif…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now