How to scrape data from a web page using .NET

I need to get data from a web page
https://personal.vanguard.com/us/FundsAllHoldings?FundId=0986&FundIntExt=INT&tableName=Equity&tableIndex=0

I tried to use the webclient and the Httpwebrequest methods but the content is not getting downloaded fully.
I also tried the async methods but I get just some of the data
How can I scrape data from the web page to get the 141 holdings information on the web page.
LVL 1
countrymeisterAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Fernando SotoRetiredCommented:
Hi countrymeister;

When you Screen Scrape a web page you can only scrape what is currently on that page. If you want all 141 entries you will need to scrape those individual web pages. This can be done by changing the last number in the URL, for example the URL for the first page ends in zero as shown here &tableIndex=0, if you want page two use this as the las digit, &tableIndex=1, and the other pages in like manner.
countrymeisterAuthor Commented:
Hi Fernando Soto
What is currently on the first page, in this case zero,  are 1-50 of the 141 holdings.
If you hit the link provided you should see that.

If I right click on the page and do view source I can see those 50 holdings information.
But I cannot see this when I use webclient or Httpwebrequest to get the data.
The data is getting truncated , where I do not see even one row <tr> of the holdings.
Fernando SotoRetiredCommented:
Please post your code that will show the issue.
OWASP Proactive Controls

Learn the most important control and control categories that every architect and developer should include in their projects.

countrymeisterAuthor Commented:
Hi ! Fernando Soto

here is the code

downloadFile = "https://personal.vanguard.com/us/FundsAllHoldings?FundId=0986&FundIntExt=INT&tableName=Equity&tableIndex=0&sort=marketValue&sortOrder=desc"
 downloadLocation = "C:\\Download\\VNQ_All_Holdings.xls"
 
 WebClient webClient = new WebClient();
 webClient.DownloadFile(downloadFile, downloadLocation);


I have attached the file that gets downloaded
VNQ-All-Holdings.zip
Fernando SotoRetiredCommented:
Hi countrymeister;

You state that this is the code, "https://personal.vanguard.com/us/FundsAllHoldings?FundId=0986&FundIntExt=INT&tableName=Equity&tableIndex=0&sort=marketValue&sortOrder=desc", but this is the web page you want to screen scrape. I had asked for the .Net code you used to do the screen scrape that does not show the complete page. I want to reproduce the issue here so I can see what is going on.
countrymeisterAuthor Commented:
Hi ! Fernando soto

I posted it in my previous post after the two variables that were defined, (downloadFile and downloadFileLoacation)
I also posted the zip file which I attached which was getting downloaded using the .NET code
Here it is again , the .NET code

WebClient webClient = new WebClient();
 webClient.DownloadFile(downloadFile, downloadLocation);
countrymeisterAuthor Commented:
Hi ! Fernando Soto

Any luck in helping me ?
Fernando SotoRetiredCommented:
Hi countrymeister;

I used the HtmlAgilityPack which needs to be loaded into your project. The name in NuGet is called HtmlAgilityPack. The code below will load all 140 items.

using HtmlAgilityPack;
using System.Collections;
using System.Text.RegularExpressions;


// A utility class to get HTML document from HTTP
HtmlWeb hw = new HtmlWeb();
// Index to insert to the url string to get each page.
string page = "0";
// The url page that has the data to get
string url = @"https://personal.vanguard.com/us/FundsAllHoldings?FundId=0986&FundIntExt=INT&tableName=Equity&tableIndex=" + page + "&sort=marketValue&sortOrder=desc";
// Load the HTML from the web site
HtmlAgilityPack.HtmlDocument doc = hw.Load(url);
// Convert the HTML document into a string to get the total line to get
string docStr = doc.DocumentNode.OuterHtml;
// Get the total lines needed to get
int lineCount = int.Parse(Regex.Match(docStr, @"(?'Zero'.*)(?'One'\d\d\d)(?'Two'\s*Holdings)").Groups["One"].Value);

for (int i = 0, lines = 0; lines < lineCount; i++, lines += 50 )
{
    // Load the next page from the web site
    doc = hw.Load(url);
    // Query the HTML document to get all the data for each line of data
    var nodes = (from n in doc.DocumentNode.Descendants("table")
                 where n.GetAttributeValue("class", String.Empty) == "dataTable pad "
                 from row in n.Descendants("tr")
                 where row.Descendants("th").Count() == 0
                 select row).ToList();
 
    // Iterate through the result returned from the query
    // Get the complete row of data 
    foreach (HtmlNode node in nodes)
    {
        // Get the individual data from each row.
        foreach (HtmlNode td in node.ChildNodes)
        {
            // You need to format the string the way you want
            Console.Write(td.InnerText + "    ");
        }
        Console.WriteLine();
    }
    // Set up for the next page to get 
    page = (i + 1).ToString();
    // Insert the page number into the url
    url = @"https://personal.vanguard.com/us/FundsAllHoldings?FundId=0986&FundIntExt=INT&tableName=Equity&tableIndex=" + page + "&sort=marketValue&sortOrder=desc";
}

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C#

From novice to tech pro — start learning today.