Link to home
Start Free TrialLog in
Avatar of cerksees
cerkseesFlag for United States of America

asked on

Reading dirty CSV data with Visual Basic 2008/2010.....

There is no "right" or wrong answer for this question - only more efficient answers than others.

I am interested in studying stock movements and in different ways of graphing the data available for publicly traded stocks.

So, to start with, I needed some stock data.  Luckily, others before me also needed this data and shared the calls needed to gather stock data from Yahoo via their financial pages which allow you to download CSV files containing the requested stock data.

You can get about 75 pieces of data for any given stock in less than 1 second.  Bump it up to 4 stocks and you still get the data back in less than 2 seconds.

Unfortunately, the data returned from Yahoo can be dirty.  By this I mean that the actual data itself can contain commas.  This data is not encased in parentheses or quotes, but is placed right alongside other data fields.  The problem is that CSV files use commas to tell where the data fields are and this makes the returned value 3,000 look like 2 fields (3 and 000) to Excel or any CSV file reading code.

To add to the confusion, the returned data does not contain any field names.  When your 75 requested fields returns an array of 84 items, you know something went wrong, but finding it is a pain in the butt.

So, I split out each of the 75 fields and basically requested a single value CSV file for each of the values being requested.  Doing so, I know all of the data returned is for the single data field that I requested.  The problem with this method is that making 75 HTTPWebResponse calls in a row kicks up the time to get all of the data for a single stock from less than one second to almost 50 seconds.

Multiply 50 seconds by over 7,000 stocks and you're looking at over 90 hours to get all of the needed data.  So, doing one call at a time is not going to work. to get daily stock data for all of these stocks.

The question that I have is what would be the best way to do this?  Should I use 75 threads and launch them all at once?  Should use the thread pool handling built in to VB.Net 2008?

How would you approach this problem?

And, yes, I am aware that there are places that sell historical stock data.  But, they are generally limited to open, high, low, close and volume.  The data available via Yahoo has about 75 variables for each stock (not all of which are available every day - but most of them are).

If you need, I can post the VERY raw and ugly code that I have been playing with to test getting this data.  The code is only important because I need the data.
ASKER CERTIFIED SOLUTION
Avatar of nepaluz
nepaluz
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of cerksees

ASKER

The code that retrieves the CSV file from yahoo cannot really be altered.  The only thing you can do is add or remove the field tags for the fields that you are requesting.

The allowed tags and fields are as follows....

a       Ask       
a2       Average Daily Volume       
a5       Ask Size
b       Bid       
b4       Book Value       
b6       Bid Size       
c       Change & Percent Change
c1       Change       
c3       Commission       
d       Dividend/Share       
d1       Last Trade Date
d2       Trade Date       
e       Earnings/Share       
e1       Error Indication (returned for symbol changed / invalid)
e7       EPS Estimate Current Year       
e8       EPS Estimate Next Year       
e9       EPS Estimate Next Quarter
f6       Float Shares       
g       Day’s Low       
h       Day’s High
j       52-week Low       
k       52-week High       
g1       Holdings Gain Percent
g3       Annualized Gain       
g4       Holdings Gain       
i       More Info       
j1       Market Capitalization       
j4       EBITDA
j5       Change From 52-week Low       
j6       Percent Change From 52-week Low       
k3       Last Trade Size       
k4       Change From 52-week High
k5       Percebt Change From 52-week High       
l       Last Trade (With Time)       
l1       Last Trade (Price Only)
l2       High Limit       
l3       Low Limit       
m       Day’s Range
m3       50-day Moving Average       
m4       200-day Moving Average
m5       Change From 200-day Moving Average       
m6       Percent Change From 200-day Moving Average       
m7       Change From 50-day Moving Average
m8       Percent Change From 50-day Moving Average       
n       Name       
n4       Notes
o       Open       
p       Previous Close       
p1       Price Paid
p2       Change in Percent       
p5       Price/Sales       
p6       Price/Book
q       Ex-Dividend Date       
r       P/E Ratio       
r1       Dividend Pay Date
r5       PEG Ratio       
r6       Price/EPS Estimate Current Year
r7       Price/EPS Estimate Next Year       
s       Symbol       
s1       Shares Owned
s7       Short Ratio       
t1       Last Trade Time       
t6       Trade Links
t7       Ticker Trend       
t8       1 yr Target Price       
v       Volume
v1       Holdings Value       
w       52-week Range
w1       Day’s Value Change       
x       Stock Exchange
y       Dividend Yield

As explained on http://cliffngan.net/a/13..."http://finance.yahoo.com/d/quotes.csv?s= a BUNCH of STOCK SYMBOLS separated by “+” &f=a bunch of special tags
**s=SYMBOL**: where SYMBOL is the actual stock symbol (e.g. 0001.HK)
**f=REQUESTED_FORMAT_STRING** where the REQUESTED_FORMAT_STRING is a string of letters representing what fields to be requested.

for example: http://finance.yahoo.com/d/quotes.csv?s=XOM+BBDb.TO+JNJ+MSFT&f=snd1l1yr

So, for me to get ALL data for a stock like Microsoft Corporation (symbol MSFT), I would use ...

http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=aa2a5bb4b6cc1c3dd1d2ee1e7e8e9f6ghjkg1g3g4ij1j4j5j6k3k4k5ll1l2l3mm3m4m5m6m7m8nn4opp1p2p5p6qrr1r5r6r7ss1s7t1t6t7t8vv1ww1xy

If you copy this line and put it in your browser's address bar, you can download the CSV file.  You will see that you get 76 columns of data where you should only get 71.

One of the dirty fields is the a5 tag (Ask Size).  If you change the above line to only download the a5 data, you will see that it shows up as 2 columns.  Try it with this http request...

http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a5

There are other dirty fields too.  These dirty fields contain commas.  Any CSV file reader (like excel) will see these commas as delimiters for different columns.  In the case of the a5 tag, a value of 75,300 shows up as two values (75 and 300).

Knowing this, there may be other dirty data as well that does not show up in testing.

So, I used each field tag in an independent HTTP request to get all of the data....like this....

http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a2
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a5
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=b4
...and so on...

This makes sure that any dirty data transmitted cannot not pollute the data pool.  Unfortunately it also increases the time to get the data from less than a second to over 48 seconds for a single stock - increasing the time to get all data from around 2 hours to over 90 hours.

I do not know what data will yield the best results, so I do need to capture all of it.  Even if I cut out the known dirty fields, that does not mean the more dirty data will come from yahoo in other fields in the future.

The data is not real time and does not need to be.

Thanks for your twopence.  I hope this explanation has made the question clearer.
Oops.....

This makes sure that any dirty data transmitted cannot not pollute the data pool.
 

should have read as "This makes sure that any dirty data transmitted cannot pollute the data pool."

Sorry about that....
As I said....there is no incorrect answer here.  This got me to thinking about things in a different manner an I solved the problem.
cerksees - I intended to get back to you but have been bogged down by some deadlines I had to make. I do have a bit of breathing  space so here goes.

The data you list is the same type of data that you can get from yahoo apis, and would therefore wonder why you don't utilize xml.

The YQL console to construct query strings can be found at:

http://developer.yahoo.com/yql/console/

Below is an example of a returned xml for BAY.L (I build the string in VB.net and call the page using webclient)

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20(%22BAY.L%22)&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys

The XML:

  <?xml version="1.0" encoding="UTF-8" ?>
- <query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="1" yahoo:created="2010-11-25T11:56:34Z" yahoo:lang="en-US">
- <results>
- <quote symbol="BAY.L">
  <Ask>275.30</Ask>
  <AverageDailyVolume>7892740</AverageDailyVolume>
  <Bid>275.10</Bid>
  <AskRealtime>275.30</AskRealtime>
  <BidRealtime>275.10</BidRealtime>
  <BookValue>17.59</BookValue>
  <Change_PercentChange>+1.80 - +0.66%</Change_PercentChange>
  <Change>+1.80</Change>
  <Commission />
  <ChangeRealtime>+1.80</ChangeRealtime>
  <AfterHoursChangeRealtime>N/A - N/A</AfterHoursChangeRealtime>
  <DividendShare>0.00</DividendShare>
  <LastTradeDate>11/25/2010</LastTradeDate>
  <TradeDate />
  <EarningsShare>-1.03</EarningsShare>
  <ErrorIndicationreturnedforsymbolchangedinvalid>N/A</ErrorIndicationreturnedforsymbolchangedinvalid>
  <EPSEstimateCurrentYear>11.88</EPSEstimateCurrentYear>
  <EPSEstimateNextYear>28.48</EPSEstimateNextYear>
  <EPSEstimateNextQuarter>-3.81</EPSEstimateNextQuarter>
  <DaysLow>271.20</DaysLow>
  <DaysHigh>278.50</DaysHigh>
  <YearLow>180.20</YearLow>
  <YearHigh>290.00</YearHigh>
  <HoldingsGainPercent>- - -</HoldingsGainPercent>
  <AnnualizedGain>-</AnnualizedGain>
  <HoldingsGain />
  <HoldingsGainPercentRealtime>N/A - N/A</HoldingsGainPercentRealtime>
  <HoldingsGainRealtime />
  <MoreInfo>cnprIed</MoreInfo>
  <OrderBookRealtime>N/A</OrderBookRealtime>
  <MarketCapitalization>326.4M</MarketCapitalization>
  <MarketCapRealtime />
  <EBITDA>866.0M</EBITDA>
  <ChangeFromYearLow>+95.00</ChangeFromYearLow>
  <PercentChangeFromYearLow>+52.72%</PercentChangeFromYearLow>
  <LastTradeRealtimeWithTime>N/A - <b>275.20</b></LastTradeRealtimeWithTime>
  <ChangePercentRealtime>N/A - +0.66%</ChangePercentRealtime>
  <ChangeFromYearHigh>-14.80</ChangeFromYearHigh>
  <PercebtChangeFromYearHigh>-5.10%</PercebtChangeFromYearHigh>
  <LastTradeWithTime>6:39am - <b>275.20</b></LastTradeWithTime>
  <LastTradePriceOnly>275.20</LastTradePriceOnly>
  <HighLimit />
  <LowLimit />
  <DaysRange>271.20 - 278.50</DaysRange>
  <DaysRangeRealtime>N/A - N/A</DaysRangeRealtime>
  <FiftydayMovingAverage>272.622</FiftydayMovingAverage>
  <TwoHundreddayMovingAverage>229.434</TwoHundreddayMovingAverage>
  <ChangeFromTwoHundreddayMovingAverage>+45.766</ChangeFromTwoHundreddayMovingAverage>
  <PercentChangeFromTwoHundreddayMovingAverage>+19.95%</PercentChangeFromTwoHundreddayMovingAverage>
  <ChangeFromFiftydayMovingAverage>+2.578</ChangeFromFiftydayMovingAverage>
  <PercentChangeFromFiftydayMovingAverage>+0.95%</PercentChangeFromFiftydayMovingAverage>
  <Name>BR.AIRWAYS ORD 25</Name>
  <Notes>-</Notes>
  <Open>272.60</Open>
  <PreviousClose>273.40</PreviousClose>
  <PricePaid />
  <ChangeinPercent>+0.66%</ChangeinPercent>
  <PriceSales>3.89</PriceSales>
  <PriceBook>15.54</PriceBook>
  <ExDividendDate>N/A</ExDividendDate>
  <PERatio />
  <DividendPayDate>N/A</DividendPayDate>
  <PERatioRealtime />
  <PEGRatio>2.24</PEGRatio>
  <PriceEPSEstimateCurrentYear>23.01</PriceEPSEstimateCurrentYear>
  <PriceEPSEstimateNextYear>9.60</PriceEPSEstimateNextYear>
  <Symbol>BAY.L</Symbol>
  <SharesOwned />
  <ShortRatio />
  <LastTradeTime>6:39am</LastTradeTime>
  <TickerTrend>&nbsp;==-===&nbsp;</TickerTrend>
  <OneyrTargetPrice>318.33</OneyrTargetPrice>
  <Volume>1905332</Volume>
  <HoldingsValue />
  <HoldingsValueRealtime />
  <YearRange>180.20 - 290.00</YearRange>
  <DaysValueChange>- - +0.66%</DaysValueChange>
  <DaysValueChangeRealtime>N/A - N/A</DaysValueChangeRealtime>
  <StockExchange>London</StockExchange>
  <DividendYield />
  <PercentChange>+0.66%</PercentChange>
  </quote>
  </results>
  </query>
- <!--  total: 273
  -->
- <!--  yqlengine1.pipes.ch1.yahoo.com compressed/chunked Thu Nov 25 03:56:34 PST 2010
  -->
Using linq to xml, you could easily "spot" empty tags etc, and there would not be an issue with identifying where a field / column ended.
Excellent advice.  Thanks!