cerksees
asked on
Reading dirty CSV data with Visual Basic 2008/2010.....
There is no "right" or wrong answer for this question - only more efficient answers than others.
I am interested in studying stock movements and in different ways of graphing the data available for publicly traded stocks.
So, to start with, I needed some stock data. Luckily, others before me also needed this data and shared the calls needed to gather stock data from Yahoo via their financial pages which allow you to download CSV files containing the requested stock data.
You can get about 75 pieces of data for any given stock in less than 1 second. Bump it up to 4 stocks and you still get the data back in less than 2 seconds.
Unfortunately, the data returned from Yahoo can be dirty. By this I mean that the actual data itself can contain commas. This data is not encased in parentheses or quotes, but is placed right alongside other data fields. The problem is that CSV files use commas to tell where the data fields are and this makes the returned value 3,000 look like 2 fields (3 and 000) to Excel or any CSV file reading code.
To add to the confusion, the returned data does not contain any field names. When your 75 requested fields returns an array of 84 items, you know something went wrong, but finding it is a pain in the butt.
So, I split out each of the 75 fields and basically requested a single value CSV file for each of the values being requested. Doing so, I know all of the data returned is for the single data field that I requested. The problem with this method is that making 75 HTTPWebResponse calls in a row kicks up the time to get all of the data for a single stock from less than one second to almost 50 seconds.
Multiply 50 seconds by over 7,000 stocks and you're looking at over 90 hours to get all of the needed data. So, doing one call at a time is not going to work. to get daily stock data for all of these stocks.
The question that I have is what would be the best way to do this? Should I use 75 threads and launch them all at once? Should use the thread pool handling built in to VB.Net 2008?
How would you approach this problem?
And, yes, I am aware that there are places that sell historical stock data. But, they are generally limited to open, high, low, close and volume. The data available via Yahoo has about 75 variables for each stock (not all of which are available every day - but most of them are).
If you need, I can post the VERY raw and ugly code that I have been playing with to test getting this data. The code is only important because I need the data.
I am interested in studying stock movements and in different ways of graphing the data available for publicly traded stocks.
So, to start with, I needed some stock data. Luckily, others before me also needed this data and shared the calls needed to gather stock data from Yahoo via their financial pages which allow you to download CSV files containing the requested stock data.
You can get about 75 pieces of data for any given stock in less than 1 second. Bump it up to 4 stocks and you still get the data back in less than 2 seconds.
Unfortunately, the data returned from Yahoo can be dirty. By this I mean that the actual data itself can contain commas. This data is not encased in parentheses or quotes, but is placed right alongside other data fields. The problem is that CSV files use commas to tell where the data fields are and this makes the returned value 3,000 look like 2 fields (3 and 000) to Excel or any CSV file reading code.
To add to the confusion, the returned data does not contain any field names. When your 75 requested fields returns an array of 84 items, you know something went wrong, but finding it is a pain in the butt.
So, I split out each of the 75 fields and basically requested a single value CSV file for each of the values being requested. Doing so, I know all of the data returned is for the single data field that I requested. The problem with this method is that making 75 HTTPWebResponse calls in a row kicks up the time to get all of the data for a single stock from less than one second to almost 50 seconds.
Multiply 50 seconds by over 7,000 stocks and you're looking at over 90 hours to get all of the needed data. So, doing one call at a time is not going to work. to get daily stock data for all of these stocks.
The question that I have is what would be the best way to do this? Should I use 75 threads and launch them all at once? Should use the thread pool handling built in to VB.Net 2008?
How would you approach this problem?
And, yes, I am aware that there are places that sell historical stock data. But, they are generally limited to open, high, low, close and volume. The data available via Yahoo has about 75 variables for each stock (not all of which are available every day - but most of them are).
If you need, I can post the VERY raw and ugly code that I have been playing with to test getting this data. The code is only important because I need the data.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Oops.....
should have read as "This makes sure that any dirty data transmitted cannot pollute the data pool."
Sorry about that....
This makes sure that any dirty data transmitted cannot not pollute the data pool.
should have read as "This makes sure that any dirty data transmitted cannot pollute the data pool."
Sorry about that....
ASKER
As I said....there is no incorrect answer here. This got me to thinking about things in a different manner an I solved the problem.
cerksees - I intended to get back to you but have been bogged down by some deadlines I had to make. I do have a bit of breathing space so here goes.
The data you list is the same type of data that you can get from yahoo apis, and would therefore wonder why you don't utilize xml.
The YQL console to construct query strings can be found at:
http://developer.yahoo.com/yql/console/
Below is an example of a returned xml for BAY.L (I build the string in VB.net and call the page using webclient)
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20(%22BAY.L%22)&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
The XML:
<?xml version="1.0" encoding="UTF-8" ?>
- <query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="1" yahoo:created="2010-11-25T 11:56:34Z" yahoo:lang="en-US">
- <results>
- <quote symbol="BAY.L">
<Ask>275.30</Ask>
<AverageDailyVolume>789274 0</Average DailyVolum e>
<Bid>275.10</Bid>
<AskRealtime>275.30</AskRe altime>
<BidRealtime>275.10</BidRe altime>
<BookValue>17.59</BookValu e>
<Change_PercentChange>+1.8 0 - +0.66%</Change_PercentChan ge>
<Change>+1.80</Change>
<Commission />
<ChangeRealtime>+1.80</Cha ngeRealtim e>
<AfterHoursChangeRealtime> N/A - N/A</AfterHoursChangeRealt ime>
<DividendShare>0.00</Divid endShare>
<LastTradeDate>11/25/2010< /LastTrade Date>
<TradeDate />
<EarningsShare>-1.03</Earn ingsShare>
<ErrorIndicationreturnedfo rsymbolcha ngedinvali d>N/A</Err orIndicati onreturned forsymbolc hangedinva lid>
<EPSEstimateCurrentYear>11 .88</EPSEs timateCurr entYear>
<EPSEstimateNextYear>28.48 </EPSEstim ateNextYea r>
<EPSEstimateNextQuarter>-3 .81</EPSEs timateNext Quarter>
<DaysLow>271.20</DaysLow>
<DaysHigh>278.50</DaysHigh >
<YearLow>180.20</YearLow>
<YearHigh>290.00</YearHigh >
<HoldingsGainPercent>- - -</HoldingsGainPercent>
<AnnualizedGain>-</Annuali zedGain>
<HoldingsGain />
<HoldingsGainPercentRealti me>N/A - N/A</HoldingsGainPercentRe altime>
<HoldingsGainRealtime />
<MoreInfo>cnprIed</MoreInf o>
<OrderBookRealtime>N/A</Or derBookRea ltime>
<MarketCapitalization>326. 4M</Market Capitaliza tion>
<MarketCapRealtime />
<EBITDA>866.0M</EBITDA>
<ChangeFromYearLow>+95.00< /ChangeFro mYearLow>
<PercentChangeFromYearLow> +52.72%</P ercentChan geFromYear Low>
<LastTradeRealtimeWithTime >N/A - <b>275.20</b></LastTradeRe altimeWith Time>
<ChangePercentRealtime>N/A - +0.66%</ChangePercentRealt ime>
<ChangeFromYearHigh>-14.80 </ChangeFr omYearHigh >
<PercebtChangeFromYearHigh >-5.10%</P ercebtChan geFromYear High>
<LastTradeWithTime>6:39am - <b>275.20</b></LastTradeWi thTime>
<LastTradePriceOnly>275.20 </LastTrad ePriceOnly >
<HighLimit />
<LowLimit />
<DaysRange>271.20 - 278.50</DaysRange>
<DaysRangeRealtime>N/A - N/A</DaysRangeRealtime>
<FiftydayMovingAverage>272 .622</Fift ydayMoving Average>
<TwoHundreddayMovingAverag e>229.434< /TwoHundre ddayMoving Average>
<ChangeFromTwoHundreddayMo vingAverag e>+45.766< /ChangeFro mTwoHundre ddayMoving Average>
<PercentChangeFromTwoHundr eddayMovin gAverage>+ 19.95%</Pe rcentChang eFromTwoHu ndreddayMo vingAverag e>
<ChangeFromFiftydayMovingA verage>+2. 578</Chang eFromFifty dayMovingA verage>
<PercentChangeFromFiftyday MovingAver age>+0.95% </PercentC hangeFromF iftydayMov ingAverage >
<Name>BR.AIRWAYS ORD 25</Name>
<Notes>-</Notes>
<Open>272.60</Open>
<PreviousClose>273.40</Pre viousClose >
<PricePaid />
<ChangeinPercent>+0.66%</C hangeinPer cent>
<PriceSales>3.89</PriceSal es>
<PriceBook>15.54</PriceBoo k>
<ExDividendDate>N/A</ExDiv idendDate>
<PERatio />
<DividendPayDate>N/A</Divi dendPayDat e>
<PERatioRealtime />
<PEGRatio>2.24</PEGRatio>
<PriceEPSEstimateCurrentYe ar>23.01</ PriceEPSEs timateCurr entYear>
<PriceEPSEstimateNextYear> 9.60</Pric eEPSEstima teNextYear >
<Symbol>BAY.L</Symbol>
<SharesOwned />
<ShortRatio />
<LastTradeTime>6:39am</Las tTradeTime >
<TickerTrend> ==-===& nbsp;</Tic kerTrend>
<OneyrTargetPrice>318.33</ OneyrTarge tPrice>
<Volume>1905332</Volume>
<HoldingsValue />
<HoldingsValueRealtime />
<YearRange>180.20 - 290.00</YearRange>
<DaysValueChange>- - +0.66%</DaysValueChange>
<DaysValueChangeRealtime>N /A - N/A</DaysValueChangeRealti me>
<StockExchange>London</Sto ckExchange >
<DividendYield />
<PercentChange>+0.66%</Per centChange >
</quote>
</results>
</query>
- <!-- total: 273
-->
- <!-- yqlengine1.pipes.ch1.yahoo .com compressed/chunked Thu Nov 25 03:56:34 PST 2010
-->
The data you list is the same type of data that you can get from yahoo apis, and would therefore wonder why you don't utilize xml.
The YQL console to construct query strings can be found at:
http://developer.yahoo.com/yql/console/
Below is an example of a returned xml for BAY.L (I build the string in VB.net and call the page using webclient)
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20(%22BAY.L%22)&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
The XML:
<?xml version="1.0" encoding="UTF-8" ?>
- <query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="1" yahoo:created="2010-11-25T
- <results>
- <quote symbol="BAY.L">
<Ask>275.30</Ask>
<AverageDailyVolume>789274
<Bid>275.10</Bid>
<AskRealtime>275.30</AskRe
<BidRealtime>275.10</BidRe
<BookValue>17.59</BookValu
<Change_PercentChange>+1.8
<Change>+1.80</Change>
<Commission />
<ChangeRealtime>+1.80</Cha
<AfterHoursChangeRealtime>
<DividendShare>0.00</Divid
<LastTradeDate>11/25/2010<
<TradeDate />
<EarningsShare>-1.03</Earn
<ErrorIndicationreturnedfo
<EPSEstimateCurrentYear>11
<EPSEstimateNextYear>28.48
<EPSEstimateNextQuarter>-3
<DaysLow>271.20</DaysLow>
<DaysHigh>278.50</DaysHigh
<YearLow>180.20</YearLow>
<YearHigh>290.00</YearHigh
<HoldingsGainPercent>- - -</HoldingsGainPercent>
<AnnualizedGain>-</Annuali
<HoldingsGain />
<HoldingsGainPercentRealti
<HoldingsGainRealtime />
<MoreInfo>cnprIed</MoreInf
<OrderBookRealtime>N/A</Or
<MarketCapitalization>326.
<MarketCapRealtime />
<EBITDA>866.0M</EBITDA>
<ChangeFromYearLow>+95.00<
<PercentChangeFromYearLow>
<LastTradeRealtimeWithTime
<ChangePercentRealtime>N/A
<ChangeFromYearHigh>-14.80
<PercebtChangeFromYearHigh
<LastTradeWithTime>6:39am - <b>275.20</b></LastTradeWi
<LastTradePriceOnly>275.20
<HighLimit />
<LowLimit />
<DaysRange>271.20 - 278.50</DaysRange>
<DaysRangeRealtime>N/A - N/A</DaysRangeRealtime>
<FiftydayMovingAverage>272
<TwoHundreddayMovingAverag
<ChangeFromTwoHundreddayMo
<PercentChangeFromTwoHundr
<ChangeFromFiftydayMovingA
<PercentChangeFromFiftyday
<Name>BR.AIRWAYS ORD 25</Name>
<Notes>-</Notes>
<Open>272.60</Open>
<PreviousClose>273.40</Pre
<PricePaid />
<ChangeinPercent>+0.66%</C
<PriceSales>3.89</PriceSal
<PriceBook>15.54</PriceBoo
<ExDividendDate>N/A</ExDiv
<PERatio />
<DividendPayDate>N/A</Divi
<PERatioRealtime />
<PEGRatio>2.24</PEGRatio>
<PriceEPSEstimateCurrentYe
<PriceEPSEstimateNextYear>
<Symbol>BAY.L</Symbol>
<SharesOwned />
<ShortRatio />
<LastTradeTime>6:39am</Las
<TickerTrend> ==-===&
<OneyrTargetPrice>318.33</
<Volume>1905332</Volume>
<HoldingsValue />
<HoldingsValueRealtime />
<YearRange>180.20 - 290.00</YearRange>
<DaysValueChange>- - +0.66%</DaysValueChange>
<DaysValueChangeRealtime>N
<StockExchange>London</Sto
<DividendYield />
<PercentChange>+0.66%</Per
</quote>
</results>
</query>
- <!-- total: 273
-->
- <!-- yqlengine1.pipes.ch1.yahoo
-->
Using linq to xml, you could easily "spot" empty tags etc, and there would not be an issue with identifying where a field / column ended.
ASKER
Excellent advice. Thanks!
ASKER
The allowed tags and fields are as follows....
a Ask
a2 Average Daily Volume
a5 Ask Size
b Bid
b4 Book Value
b6 Bid Size
c Change & Percent Change
c1 Change
c3 Commission
d Dividend/Share
d1 Last Trade Date
d2 Trade Date
e Earnings/Share
e1 Error Indication (returned for symbol changed / invalid)
e7 EPS Estimate Current Year
e8 EPS Estimate Next Year
e9 EPS Estimate Next Quarter
f6 Float Shares
g Day’s Low
h Day’s High
j 52-week Low
k 52-week High
g1 Holdings Gain Percent
g3 Annualized Gain
g4 Holdings Gain
i More Info
j1 Market Capitalization
j4 EBITDA
j5 Change From 52-week Low
j6 Percent Change From 52-week Low
k3 Last Trade Size
k4 Change From 52-week High
k5 Percebt Change From 52-week High
l Last Trade (With Time)
l1 Last Trade (Price Only)
l2 High Limit
l3 Low Limit
m Day’s Range
m3 50-day Moving Average
m4 200-day Moving Average
m5 Change From 200-day Moving Average
m6 Percent Change From 200-day Moving Average
m7 Change From 50-day Moving Average
m8 Percent Change From 50-day Moving Average
n Name
n4 Notes
o Open
p Previous Close
p1 Price Paid
p2 Change in Percent
p5 Price/Sales
p6 Price/Book
q Ex-Dividend Date
r P/E Ratio
r1 Dividend Pay Date
r5 PEG Ratio
r6 Price/EPS Estimate Current Year
r7 Price/EPS Estimate Next Year
s Symbol
s1 Shares Owned
s7 Short Ratio
t1 Last Trade Time
t6 Trade Links
t7 Ticker Trend
t8 1 yr Target Price
v Volume
v1 Holdings Value
w 52-week Range
w1 Day’s Value Change
x Stock Exchange
y Dividend Yield
As explained on http://cliffngan.net/a/13..."http://finance.yahoo.com/d/quotes.csv?s= a BUNCH of STOCK SYMBOLS separated by “+” &f=a bunch of special tags
**s=SYMBOL**: where SYMBOL is the actual stock symbol (e.g. 0001.HK)
**f=REQUESTED_FORMAT_STRIN
for example: http://finance.yahoo.com/d/quotes.csv?s=XOM+BBDb.TO+JNJ+MSFT&f=snd1l1yr"
So, for me to get ALL data for a stock like Microsoft Corporation (symbol MSFT), I would use ...
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=aa2a5bb4b6cc1c3dd1d2ee1e7e8e9f6ghjkg1g3g4ij1j4j5j6k3k4k5ll1l2l3mm3m4m5m6m7m8nn4opp1p2p5p6qrr1r5r6r7ss1s7t1t6t7t8vv1ww1xy
If you copy this line and put it in your browser's address bar, you can download the CSV file. You will see that you get 76 columns of data where you should only get 71.
One of the dirty fields is the a5 tag (Ask Size). If you change the above line to only download the a5 data, you will see that it shows up as 2 columns. Try it with this http request...
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a5
There are other dirty fields too. These dirty fields contain commas. Any CSV file reader (like excel) will see these commas as delimiters for different columns. In the case of the a5 tag, a value of 75,300 shows up as two values (75 and 300).
Knowing this, there may be other dirty data as well that does not show up in testing.
So, I used each field tag in an independent HTTP request to get all of the data....like this....
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a2
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=a5
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=b4
...and so on...
This makes sure that any dirty data transmitted cannot not pollute the data pool. Unfortunately it also increases the time to get the data from less than a second to over 48 seconds for a single stock - increasing the time to get all data from around 2 hours to over 90 hours.
I do not know what data will yield the best results, so I do need to capture all of it. Even if I cut out the known dirty fields, that does not mean the more dirty data will come from yahoo in other fields in the future.
The data is not real time and does not need to be.
Thanks for your twopence. I hope this explanation has made the question clearer.