• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1674
  • Last Modified:

Getting a connection reset when trying to scrape a website

Hi all I am getting a connection reset when I trying to scrape a website I am wondering if thiere is any work around for that. I am using java with selinium to do this and we cant seem to catch the error or set the

Error Code ---
Exception in thread "main" java.lang.RuntimeException: java.net.SocketException: Connection reset
	at com.gargoylesoftware.htmlunit.WebClient.download(WebClient.java:2131)
	at com.gargoylesoftware.htmlunit.html.HtmlForm.submit(HtmlForm.java:135)
	at com.gargoylesoftware.htmlunit.html.HtmlImageInput.doClickAction(HtmlImageInput.java:115)
	at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1244)
	at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1195)
	at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1158)
	at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:138)
	at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:99)
	at org.openqa.selenium.htmlunit.HtmlUnitWebElement.submitForm(HtmlUnitWebElement.java:236)
	at org.openqa.selenium.htmlunit.HtmlUnitWebElement.submit(HtmlUnitWebElement.java:191)
	at HerculesImageLookup.hercImageLookup(HerculesImageLookup.java:158)
	at HerculesImageLookup.main(HerculesImageLookup.java:56)
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(Unknown Source)
	at java.net.SocketInputStream.read(Unknown Source)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:187)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:197)
	at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:155)
	at com.gargoylesoftware.htmlunit.HttpWebConnection.downloadContent(HttpWebConnection.java:605)
	at com.gargoylesoftware.htmlunit.HttpWebConnection.downloadResponseBody(HttpWebConnection.java:587)
	at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:153)
	at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1439)
	at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1358)
	at com.gargoylesoftware.htmlunit.WebClient.download(WebClient.java:2127)
	... 11 more

Open in new window


Java Code

/*

 * Reads in the imagereq file and logs onto the Hercules website to check for

 * the missing image files listed. If they are found they are saved to disk and

 * and removed from imagereq. Otherwise the original check date for that pronum

 * is checked and if it is over 30 days old it is removed from the imagereq list.

 */


import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.io.InputStream;

import java.io.OutputStream;

import java.net.SocketException;

import java.net.URL;

import org.openqa.selenium.By;

import org.openqa.selenium.WebDriver;

import org.openqa.selenium.WebElement;

import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class HerculesImageLookup {

            public static void main(String[] args){

                        

                        String input, proNum, need;

                        String date;

                        boolean bol, pod;

                        

                        final long LAST_MONTH = System.currentTimeMillis() - 2592000000L;

                        

                        try {

                                    BufferedReader readIn = new BufferedReader( new FileReader("F:\\tmp\\Rob\\imagereq.txt"));

                                    BufferedWriter writeOut = new BufferedWriter( new FileWriter("F:\\tmp\\Rob\\tempimagereq.txt") );

                                    

                                    input = readIn.readLine();

                        

                                    while( input != null){

                                                need = null;

                                                proNum = input.substring(0, input.indexOf(" "));

                                                bol = input.contains("bol");

                                                pod = input.contains("pod");

                                                if( !input.contains( "#" ) ){

                                                            date = Long.toString( System.currentTimeMillis() );

                                                }

                                                else{

                                                            date = input.substring( input.indexOf("#")+1 );

                                                }

                                                

                                                if( Long.parseLong(date) > LAST_MONTH ){

                                                            try{

                                                                        need = hercImageLookup( proNum, bol, pod );

                                                            

                                                                        if( need != null ){

                                                                                    System.out.println(proNum + need + " #" + date);

                                                                                    writeOut.write(proNum + need + " #" + date);

                                                                                    writeOut.newLine();

                                                                                    writeOut.flush();

                                                                        }

                                                            }catch (SocketException e){

                                                                        System.out.println("Connection Reset");

                                                            }

                                                            // Check/Update Sql Database

                                                            

                                                }

                                                

                                                input = readIn.readLine();

                                    }

                                    

                                    readIn.close();

                                    writeOut.close();

                                    

                                    File source = new File("F:\\tmp\\Rob\\tempimagereq.txt");

                                    File dest = new File("F:\\tmp\\Rob\\imagereq.txt");

                                    dest.delete();

                                    source.renameTo(dest);

                                    

                        } catch (SocketException e) {

                                    System.out.println("Connection Error");

                        } catch (FileNotFoundException e) {

                                    System.out.println("File Note Found");

                        } catch (IOException e) {

                                    // TODO Auto-generated catch block

                                    e.printStackTrace();

                        } catch (InterruptedException e) {

                                    // TODO Auto-generated catch block

                                    e.printStackTrace();

                        }

                        

            }

            

            

            /*

             * Looks for a Bill of Lading and Delivery Receipt on the Hercules web site

             * Saves the images in the tmp folder on the C drive if they are found

             */

            public static String hercImageLookup( String proNum, Boolean needBol, Boolean needPod ) throws SocketException, InterruptedException{

                        

                        int count = 0;

                        

                        String output = "";

                        boolean foundBol = !needBol;

                        boolean foundPod = !needPod;

                        

                        String blFileName = "C:\\tmp\\BL.tiff";

                        String drFileName = "C:\\tmp\\DR.tiff";

                        

                        File blFile = new File(blFileName);

                        File drFile = new File(drFileName);

                        

                        blFile.delete();

                        drFile.delete();

                        

                        try{

                                    WebDriver driver = new HtmlUnitDriver();          

                                    

                                    driver.get("http://www.herculesfreight.com/index.php");          

                                                

                                    count = 0;

                                    while( driver.findElements(By.name("username")).size() == 0 ){

                                                if(count > 10){

                                                            throw new InterruptedException();

                                                }

                                                else{

                                                            Thread.sleep(1000);

                                                            count++;

                                                }

                                    }

                                    

                                    

                                    WebElement userName = driver.findElement(By.name("username"));          

                                    userName.sendKeys("userName");

                                    

                                    WebElement passWord = driver.findElement(By.name("password"));

                                    passWord.sendKeys("passWord");

                                    

                                    userName.submit();

                                    

                                    count = 0;

                                    while( driver.findElements(By.name("Bill")).size() == 0 ){

                                                if(count > 10){

                                                            throw new InterruptedException();

                                                }

                                                else{

                                                            Thread.sleep(1000);

                                                            count++;

                                                }

                                    }

                                    

                                    WebElement proInput = driver.findElement( By.name("Bill") );

                                    proInput.sendKeys( proNum );

                                    

                                    //May be timing out.

                                    proInput.submit();

            

                                    

                                    count = 0;

                                    while( driver.findElements(By.id( "dilink" )).size() == 0 ){

                                                if(count > 10){

                                                            throw new InterruptedException();

                                                }

                                                else{

                                                            Thread.sleep(1000);

                                                            count++;

                                                }

                                    }

                                    

                                    WebElement imagesLink = driver.findElement(By.id("dilink"));

                                    driver.get(imagesLink.getAttribute("href"));

                                                            

                                    try {

                                                if( needBol ){

                                                            if( driver.findElements(By.linkText("Bill of Lading")).size() != 0 ){

                                                                        saveImage(driver.findElement(By.linkText("Bill of Lading")).getAttribute("href"), blFileName);

                                                            //          foundBol = blFile.exists();

                                                                        

                                                                        //Move Image

                                                                        

                                                            }

            

                                                }

                                                if( needPod ){

                                                            if( driver.findElements(By.linkText("Delivery Receipt")).size() != 0 ){

                                                                        saveImage(driver.findElement(By.linkText("Delivery Receipt")).getAttribute("href"), drFileName);

                                                            //          foundPod = drFile.exists();

                                                                        

                                                                        //Move Image

                                                                        

                                                            }                                   

                                                }

                                    } catch (IOException e){}

                        } catch (InterruptedException e){

                                    System.out.println("Taking too long. Skip it.");

                        }

                        

                        if(!foundBol){

                                    output = output+" bol";

                        }

                        if(!foundPod){

                                    output = output+" pod";

                        }

                        

                        return output;

                        

            }

            

            /*

             * Saves the image from the given URL to the given File path

             */

            public static void saveImage(String imageUrl, String destinationFile) throws IOException {

                        URL url = new URL(imageUrl);

                        InputStream is = url.openStream();

                        OutputStream os = new FileOutputStream(destinationFile);

 

                        byte[] b = new byte[2048];

                        int length;

 

                        while ((length = is.read(b)) != -1) {

                                    os.write(b, 0, length);

                        }

 

                        is.close();

                        os.close();

            }

 }

Open in new window


**Java code snipped edited to remove username and password to protect company specific data.**-JARmod101
0
1030071002
Asked:
1030071002
1 Solution
 
tliottaCommented:
A 'reset' should be reported when the RST bit is on in the packet that was received. The RST bit might be set on by any device along the route. The appropriate action of a client is to close the socket and open a new connection. There is no other technical recovery.

Since any route device might be the source, a trace along the entire route might be needed to determine where the RST bit is being set.

There is always the possibility that the "Connection reset" condition might be reported erroneously. Some other condition could exist, and the underlying code might have a bug that defaults to "Connection reset" when an error happens somewhere in that code path. For various packages, you can often find bug reports that tell you if the problem has been reported and perhaps fixed. That condition can usually be eliminated by making sure that you're importing up to date packages.

However, the first action should be to catch the error, close the connection and open a new one.

In most cases, that should be all there is to it. But you might include a counter for how many times it happens. If it turns out to be a significant ongoing problem, then report (log) the condition as an error and start digging deeper. If it only happens a couple times for a session, you might just let your error handling handle it and ignore it. If it's indeed chronic, then some networking troubleshooting is called for.

The source of the RST setting would then need to be verified. Any logs from the device that is setting the bit on need to be reviewed to learn why it's happening.

Tom
0
 
1030071002Author Commented:
great
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now