Reading dirty CSV data with Visual Basic 2008/2010.....
Posted on 2010-11-22
There is no "right" or wrong answer for this question - only more efficient answers than others.
I am interested in studying stock movements and in different ways of graphing the data available for publicly traded stocks.
So, to start with, I needed some stock data. Luckily, others before me also needed this data and shared the calls needed to gather stock data from Yahoo via their financial pages which allow you to download CSV files containing the requested stock data.
You can get about 75 pieces of data for any given stock in less than 1 second. Bump it up to 4 stocks and you still get the data back in less than 2 seconds.
Unfortunately, the data returned from Yahoo can be dirty. By this I mean that the actual data itself can contain commas. This data is not encased in parentheses or quotes, but is placed right alongside other data fields. The problem is that CSV files use commas to tell where the data fields are and this makes the returned value 3,000 look like 2 fields (3 and 000) to Excel or any CSV file reading code.
To add to the confusion, the returned data does not contain any field names. When your 75 requested fields returns an array of 84 items, you know something went wrong, but finding it is a pain in the butt.
So, I split out each of the 75 fields and basically requested a single value CSV file for each of the values being requested. Doing so, I know all of the data returned is for the single data field that I requested. The problem with this method is that making 75 HTTPWebResponse calls in a row kicks up the time to get all of the data for a single stock from less than one second to almost 50 seconds.
Multiply 50 seconds by over 7,000 stocks and you're looking at over 90 hours to get all of the needed data. So, doing one call at a time is not going to work. to get daily stock data for all of these stocks.
The question that I have is what would be the best way to do this? Should I use 75 threads and launch them all at once? Should use the thread pool handling built in to VB.Net 2008?
How would you approach this problem?
And, yes, I am aware that there are places that sell historical stock data. But, they are generally limited to open, high, low, close and volume. The data available via Yahoo has about 75 variables for each stock (not all of which are available every day - but most of them are).
If you need, I can post the VERY raw and ugly code that I have been playing with to test getting this data. The code is only important because I need the data.