Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17


Reading dirty CSV data with Visual Basic 2008/2010.....

Posted on 2010-11-22
Medium Priority
Last Modified: 2012-05-10
There is no "right" or wrong answer for this question - only more efficient answers than others.

I am interested in studying stock movements and in different ways of graphing the data available for publicly traded stocks.

So, to start with, I needed some stock data.  Luckily, others before me also needed this data and shared the calls needed to gather stock data from Yahoo via their financial pages which allow you to download CSV files containing the requested stock data.

You can get about 75 pieces of data for any given stock in less than 1 second.  Bump it up to 4 stocks and you still get the data back in less than 2 seconds.

Unfortunately, the data returned from Yahoo can be dirty.  By this I mean that the actual data itself can contain commas.  This data is not encased in parentheses or quotes, but is placed right alongside other data fields.  The problem is that CSV files use commas to tell where the data fields are and this makes the returned value 3,000 look like 2 fields (3 and 000) to Excel or any CSV file reading code.

To add to the confusion, the returned data does not contain any field names.  When your 75 requested fields returns an array of 84 items, you know something went wrong, but finding it is a pain in the butt.

So, I split out each of the 75 fields and basically requested a single value CSV file for each of the values being requested.  Doing so, I know all of the data returned is for the single data field that I requested.  The problem with this method is that making 75 HTTPWebResponse calls in a row kicks up the time to get all of the data for a single stock from less than one second to almost 50 seconds.

Multiply 50 seconds by over 7,000 stocks and you're looking at over 90 hours to get all of the needed data.  So, doing one call at a time is not going to work. to get daily stock data for all of these stocks.

The question that I have is what would be the best way to do this?  Should I use 75 threads and launch them all at once?  Should use the thread pool handling built in to VB.Net 2008?

How would you approach this problem?

And, yes, I am aware that there are places that sell historical stock data.  But, they are generally limited to open, high, low, close and volume.  The data available via Yahoo has about 75 variables for each stock (not all of which are available every day - but most of them are).

If you need, I can post the VERY raw and ugly code that I have been playing with to test getting this data.  The code is only important because I need the data.
Question by:cerksees
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
LVL 17

Accepted Solution

nepaluz earned 2000 total points
ID: 34186849
Have you thought about revising the code that retrieves the data from Yahoo?
And doyou need / require all these fields, or are you factoring in "redundancy" just in case you need them? And finally, is the data that you aregetting hold of realtime (and do you need / require realtime)?

I suggest looking at the code (if at all you can get hold of it) that retrieves the data from Yahoo as a first step after deciding on the former. Then utilize the WebClient in a background worker (cum thread). In my experience retrieving data from Yahoo Finance, once the initial connection is made, subsequent iterated requests are much faster.

(For my twopence!)

Author Comment

ID: 34191449
The code that retrieves the CSV file from yahoo cannot really be altered.  The only thing you can do is add or remove the field tags for the fields that you are requesting.

The allowed tags and fields are as follows....

a       Ask       
a2       Average Daily Volume       
a5       Ask Size
b       Bid       
b4       Book Value       
b6       Bid Size       
c       Change & Percent Change
c1       Change       
c3       Commission       
d       Dividend/Share       
d1       Last Trade Date
d2       Trade Date       
e       Earnings/Share       
e1       Error Indication (returned for symbol changed / invalid)
e7       EPS Estimate Current Year       
e8       EPS Estimate Next Year       
e9       EPS Estimate Next Quarter
f6       Float Shares       
g       Day’s Low       
h       Day’s High
j       52-week Low       
k       52-week High       
g1       Holdings Gain Percent
g3       Annualized Gain       
g4       Holdings Gain       
i       More Info       
j1       Market Capitalization       
j4       EBITDA
j5       Change From 52-week Low       
j6       Percent Change From 52-week Low       
k3       Last Trade Size       
k4       Change From 52-week High
k5       Percebt Change From 52-week High       
l       Last Trade (With Time)       
l1       Last Trade (Price Only)
l2       High Limit       
l3       Low Limit       
m       Day’s Range
m3       50-day Moving Average       
m4       200-day Moving Average
m5       Change From 200-day Moving Average       
m6       Percent Change From 200-day Moving Average       
m7       Change From 50-day Moving Average
m8       Percent Change From 50-day Moving Average       
n       Name       
n4       Notes
o       Open       
p       Previous Close       
p1       Price Paid
p2       Change in Percent       
p5       Price/Sales       
p6       Price/Book
q       Ex-Dividend Date       
r       P/E Ratio       
r1       Dividend Pay Date
r5       PEG Ratio       
r6       Price/EPS Estimate Current Year
r7       Price/EPS Estimate Next Year       
s       Symbol       
s1       Shares Owned
s7       Short Ratio       
t1       Last Trade Time       
t6       Trade Links
t7       Ticker Trend       
t8       1 yr Target Price       
v       Volume
v1       Holdings Value       
w       52-week Range
w1       Day’s Value Change       
x       Stock Exchange
y       Dividend Yield

As explained on" a BUNCH of STOCK SYMBOLS separated by “+” &f=a bunch of special tags
**s=SYMBOL**: where SYMBOL is the actual stock symbol (e.g. 0001.HK)
**f=REQUESTED_FORMAT_STRING** where the REQUESTED_FORMAT_STRING is a string of letters representing what fields to be requested.

for example:

So, for me to get ALL data for a stock like Microsoft Corporation (symbol MSFT), I would use ...

If you copy this line and put it in your browser's address bar, you can download the CSV file.  You will see that you get 76 columns of data where you should only get 71.

One of the dirty fields is the a5 tag (Ask Size).  If you change the above line to only download the a5 data, you will see that it shows up as 2 columns.  Try it with this http request...

There are other dirty fields too.  These dirty fields contain commas.  Any CSV file reader (like excel) will see these commas as delimiters for different columns.  In the case of the a5 tag, a value of 75,300 shows up as two values (75 and 300).

Knowing this, there may be other dirty data as well that does not show up in testing.

So, I used each field tag in an independent HTTP request to get all of the this....
...and so on...

This makes sure that any dirty data transmitted cannot not pollute the data pool.  Unfortunately it also increases the time to get the data from less than a second to over 48 seconds for a single stock - increasing the time to get all data from around 2 hours to over 90 hours.

I do not know what data will yield the best results, so I do need to capture all of it.  Even if I cut out the known dirty fields, that does not mean the more dirty data will come from yahoo in other fields in the future.

The data is not real time and does not need to be.

Thanks for your twopence.  I hope this explanation has made the question clearer.

Author Comment

ID: 34191517

This makes sure that any dirty data transmitted cannot not pollute the data pool.

should have read as "This makes sure that any dirty data transmitted cannot pollute the data pool."

Sorry about that....
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!


Author Closing Comment

ID: 34200670
As I said....there is no incorrect answer here.  This got me to thinking about things in a different manner an I solved the problem.
LVL 17

Expert Comment

ID: 34212095
cerksees - I intended to get back to you but have been bogged down by some deadlines I had to make. I do have a bit of breathing  space so here goes.

The data you list is the same type of data that you can get from yahoo apis, and would therefore wonder why you don't utilize xml.

The YQL console to construct query strings can be found at:

Below is an example of a returned xml for BAY.L (I build the string in and call the page using webclient)*

The XML:

  <?xml version="1.0" encoding="UTF-8" ?>
- <query xmlns:yahoo="" yahoo:count="1" yahoo:created="2010-11-25T11:56:34Z" yahoo:lang="en-US">
- <results>
- <quote symbol="BAY.L">
  <Change_PercentChange>+1.80 - +0.66%</Change_PercentChange>
  <Commission />
  <AfterHoursChangeRealtime>N/A - N/A</AfterHoursChangeRealtime>
  <TradeDate />
  <HoldingsGainPercent>- - -</HoldingsGainPercent>
  <HoldingsGain />
  <HoldingsGainPercentRealtime>N/A - N/A</HoldingsGainPercentRealtime>
  <HoldingsGainRealtime />
  <MarketCapRealtime />
  <LastTradeRealtimeWithTime>N/A - <b>275.20</b></LastTradeRealtimeWithTime>
  <ChangePercentRealtime>N/A - +0.66%</ChangePercentRealtime>
  <LastTradeWithTime>6:39am - <b>275.20</b></LastTradeWithTime>
  <HighLimit />
  <LowLimit />
  <DaysRange>271.20 - 278.50</DaysRange>
  <DaysRangeRealtime>N/A - N/A</DaysRangeRealtime>
  <Name>BR.AIRWAYS ORD 25</Name>
  <PricePaid />
  <PERatio />
  <PERatioRealtime />
  <SharesOwned />
  <ShortRatio />
  <HoldingsValue />
  <HoldingsValueRealtime />
  <YearRange>180.20 - 290.00</YearRange>
  <DaysValueChange>- - +0.66%</DaysValueChange>
  <DaysValueChangeRealtime>N/A - N/A</DaysValueChangeRealtime>
  <DividendYield />
- <!--  total: 273
- <!-- compressed/chunked Thu Nov 25 03:56:34 PST 2010
LVL 17

Expert Comment

ID: 34212104
Using linq to xml, you could easily "spot" empty tags etc, and there would not be an issue with identifying where a field / column ended.

Author Comment

ID: 34215892
Excellent advice.  Thanks!

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This tutorial demonstrates one way to create an application that runs without any Forms but still has a GUI presence via an Icon in the System Tray. The magic lies in Inheriting from the ApplicationContext Class and passing that to Application.Ru…
Well, all of us have seen the multiple EXCEL.EXE's in task manager that won't die even if you call the .close, .dispose methods. Try this method to kill any excels in memory. You can copy the kill function to create a check function and replace the …
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …

670 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question