Reading dirty CSV data with Visual Basic 2008/2010.....

There is no "right" or wrong answer for this question - only more efficient answers than others.

I am interested in studying stock movements and in different ways of graphing the data available for publicly traded stocks.

So, to start with, I needed some stock data.  Luckily, others before me also needed this data and shared the calls needed to gather stock data from Yahoo via their financial pages which allow you to download CSV files containing the requested stock data.

You can get about 75 pieces of data for any given stock in less than 1 second.  Bump it up to 4 stocks and you still get the data back in less than 2 seconds.

Unfortunately, the data returned from Yahoo can be dirty.  By this I mean that the actual data itself can contain commas.  This data is not encased in parentheses or quotes, but is placed right alongside other data fields.  The problem is that CSV files use commas to tell where the data fields are and this makes the returned value 3,000 look like 2 fields (3 and 000) to Excel or any CSV file reading code.

To add to the confusion, the returned data does not contain any field names.  When your 75 requested fields returns an array of 84 items, you know something went wrong, but finding it is a pain in the butt.

So, I split out each of the 75 fields and basically requested a single value CSV file for each of the values being requested.  Doing so, I know all of the data returned is for the single data field that I requested.  The problem with this method is that making 75 HTTPWebResponse calls in a row kicks up the time to get all of the data for a single stock from less than one second to almost 50 seconds.

Multiply 50 seconds by over 7,000 stocks and you're looking at over 90 hours to get all of the needed data.  So, doing one call at a time is not going to work. to get daily stock data for all of these stocks.

The question that I have is what would be the best way to do this?  Should I use 75 threads and launch them all at once?  Should use the thread pool handling built in to VB.Net 2008?

How would you approach this problem?

And, yes, I am aware that there are places that sell historical stock data.  But, they are generally limited to open, high, low, close and volume.  The data available via Yahoo has about 75 variables for each stock (not all of which are available every day - but most of them are).

If you need, I can post the VERY raw and ugly code that I have been playing with to test getting this data.  The code is only important because I need the data.
Who is Participating?
nepaluzConnect With a Mentor Commented:
Have you thought about revising the code that retrieves the data from Yahoo?
And doyou need / require all these fields, or are you factoring in "redundancy" just in case you need them? And finally, is the data that you aregetting hold of realtime (and do you need / require realtime)?

I suggest looking at the code (if at all you can get hold of it) that retrieves the data from Yahoo as a first step after deciding on the former. Then utilize the WebClient in a background worker (cum thread). In my experience retrieving data from Yahoo Finance, once the initial connection is made, subsequent iterated requests are much faster.

(For my twopence!)
cerkseesAuthor Commented:
The code that retrieves the CSV file from yahoo cannot really be altered.  The only thing you can do is add or remove the field tags for the fields that you are requesting.

The allowed tags and fields are as follows....

a       Ask       
a2       Average Daily Volume       
a5       Ask Size
b       Bid       
b4       Book Value       
b6       Bid Size       
c       Change & Percent Change
c1       Change       
c3       Commission       
d       Dividend/Share       
d1       Last Trade Date
d2       Trade Date       
e       Earnings/Share       
e1       Error Indication (returned for symbol changed / invalid)
e7       EPS Estimate Current Year       
e8       EPS Estimate Next Year       
e9       EPS Estimate Next Quarter
f6       Float Shares       
g       Day’s Low       
h       Day’s High
j       52-week Low       
k       52-week High       
g1       Holdings Gain Percent
g3       Annualized Gain       
g4       Holdings Gain       
i       More Info       
j1       Market Capitalization       
j4       EBITDA
j5       Change From 52-week Low       
j6       Percent Change From 52-week Low       
k3       Last Trade Size       
k4       Change From 52-week High
k5       Percebt Change From 52-week High       
l       Last Trade (With Time)       
l1       Last Trade (Price Only)
l2       High Limit       
l3       Low Limit       
m       Day’s Range
m3       50-day Moving Average       
m4       200-day Moving Average
m5       Change From 200-day Moving Average       
m6       Percent Change From 200-day Moving Average       
m7       Change From 50-day Moving Average
m8       Percent Change From 50-day Moving Average       
n       Name       
n4       Notes
o       Open       
p       Previous Close       
p1       Price Paid
p2       Change in Percent       
p5       Price/Sales       
p6       Price/Book
q       Ex-Dividend Date       
r       P/E Ratio       
r1       Dividend Pay Date
r5       PEG Ratio       
r6       Price/EPS Estimate Current Year
r7       Price/EPS Estimate Next Year       
s       Symbol       
s1       Shares Owned
s7       Short Ratio       
t1       Last Trade Time       
t6       Trade Links
t7       Ticker Trend       
t8       1 yr Target Price       
v       Volume
v1       Holdings Value       
w       52-week Range
w1       Day’s Value Change       
x       Stock Exchange
y       Dividend Yield

As explained on" a BUNCH of STOCK SYMBOLS separated by “+” &f=a bunch of special tags
**s=SYMBOL**: where SYMBOL is the actual stock symbol (e.g. 0001.HK)
**f=REQUESTED_FORMAT_STRING** where the REQUESTED_FORMAT_STRING is a string of letters representing what fields to be requested.

for example:

So, for me to get ALL data for a stock like Microsoft Corporation (symbol MSFT), I would use ...

If you copy this line and put it in your browser's address bar, you can download the CSV file.  You will see that you get 76 columns of data where you should only get 71.

One of the dirty fields is the a5 tag (Ask Size).  If you change the above line to only download the a5 data, you will see that it shows up as 2 columns.  Try it with this http request...

There are other dirty fields too.  These dirty fields contain commas.  Any CSV file reader (like excel) will see these commas as delimiters for different columns.  In the case of the a5 tag, a value of 75,300 shows up as two values (75 and 300).

Knowing this, there may be other dirty data as well that does not show up in testing.

So, I used each field tag in an independent HTTP request to get all of the this....
...and so on...

This makes sure that any dirty data transmitted cannot not pollute the data pool.  Unfortunately it also increases the time to get the data from less than a second to over 48 seconds for a single stock - increasing the time to get all data from around 2 hours to over 90 hours.

I do not know what data will yield the best results, so I do need to capture all of it.  Even if I cut out the known dirty fields, that does not mean the more dirty data will come from yahoo in other fields in the future.

The data is not real time and does not need to be.

Thanks for your twopence.  I hope this explanation has made the question clearer.
cerkseesAuthor Commented:

This makes sure that any dirty data transmitted cannot not pollute the data pool.

should have read as "This makes sure that any dirty data transmitted cannot pollute the data pool."

Sorry about that....
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

cerkseesAuthor Commented:
As I said....there is no incorrect answer here.  This got me to thinking about things in a different manner an I solved the problem.
cerksees - I intended to get back to you but have been bogged down by some deadlines I had to make. I do have a bit of breathing  space so here goes.

The data you list is the same type of data that you can get from yahoo apis, and would therefore wonder why you don't utilize xml.

The YQL console to construct query strings can be found at:

Below is an example of a returned xml for BAY.L (I build the string in and call the page using webclient)*

The XML:

  <?xml version="1.0" encoding="UTF-8" ?>
- <query xmlns:yahoo="" yahoo:count="1" yahoo:created="2010-11-25T11:56:34Z" yahoo:lang="en-US">
- <results>
- <quote symbol="BAY.L">
  <Change_PercentChange>+1.80 - +0.66%</Change_PercentChange>
  <Commission />
  <AfterHoursChangeRealtime>N/A - N/A</AfterHoursChangeRealtime>
  <TradeDate />
  <HoldingsGainPercent>- - -</HoldingsGainPercent>
  <HoldingsGain />
  <HoldingsGainPercentRealtime>N/A - N/A</HoldingsGainPercentRealtime>
  <HoldingsGainRealtime />
  <MarketCapRealtime />
  <LastTradeRealtimeWithTime>N/A - <b>275.20</b></LastTradeRealtimeWithTime>
  <ChangePercentRealtime>N/A - +0.66%</ChangePercentRealtime>
  <LastTradeWithTime>6:39am - <b>275.20</b></LastTradeWithTime>
  <HighLimit />
  <LowLimit />
  <DaysRange>271.20 - 278.50</DaysRange>
  <DaysRangeRealtime>N/A - N/A</DaysRangeRealtime>
  <Name>BR.AIRWAYS ORD 25</Name>
  <PricePaid />
  <PERatio />
  <PERatioRealtime />
  <SharesOwned />
  <ShortRatio />
  <HoldingsValue />
  <HoldingsValueRealtime />
  <YearRange>180.20 - 290.00</YearRange>
  <DaysValueChange>- - +0.66%</DaysValueChange>
  <DaysValueChangeRealtime>N/A - N/A</DaysValueChangeRealtime>
  <DividendYield />
- <!--  total: 273
- <!-- compressed/chunked Thu Nov 25 03:56:34 PST 2010
Using linq to xml, you could easily "spot" empty tags etc, and there would not be an issue with identifying where a field / column ended.
cerkseesAuthor Commented:
Excellent advice.  Thanks!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.