• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 579
  • Last Modified:

Automated procedure for cutting and pasting sequential web pages.

Hi

I'm doing personal research at a certain web site that only allows me to cut and paste one hundred items per page, one page at a time, to Excel or NotePad out of maybe 20 to 60 pages.  I often need to do this tedious process for various searches.

Is their some way to automate this process so that all pages of the search are cut and pasted sequentially to one Excel or NotePad file?


Thanks in advance.

WS
0
WaterStreet
Asked:
WaterStreet
  • 5
  • 4
1 Solution
 
SunBowCommented:
(you mean copy|paste) depends..
Try initial workaround in DOS (CMD) window.
Convert each page to text.
Number pages sequentially. (page1.txt, page2.txt)
Use copy command to append them together (pages.txt)
Now there's one file to search, but still full of junk.

Recognize that a web page is an html file that can thus be copied.
Knowledge of source can have some value.
For example, it may already have its source pages numbered sequentially.
It is likely that there is content such as images that are not needed.
The images may have text also worth searching, as well as tables that may or may not have value for searching, Some images may have text worth searching, requiring an OCR reader.
Page may be oriented to having LHS, middle, RHS, etc., and may have option for 'text version' and numerous links, ads,.

Maybe revert to goal (re)definition. For example, Were you to 'select all', then copy/save in notepad the images and clickable links go away, but question also directed to excel, where all could be pasted there, preserving or removing links, possibly wanting to preserve columns and thus retain tabular info.

One issue I do not understand is main limiting factor beyond user time - "only 100 items per page". I'd like to think that a manual <cntl>A 'select all' can get it copied in bulk easily enough. As such the pp limit is applied to us outsiders scripting against it online to refrain from slowing down network??

Assumption: For 10-20 pages a manual method should suffice, Create NotePad txt file, Place a sequence of <break>:Page 1, Page 2, Page 3... to 20, then copy in each page appropriately. Continue, process, and what more (or less) should/could have been done.

For 10 -60 pages, where prior method works out, copy each link into the notepad (or URL to excel file), copy|paste once each in sequence, You now have a list of files. Suppose 50 filenames, all in sequence. You now ask question here, receive answer for how to merge them into single file. Now that you've a single file, all text in sequence, this is not the answer sought. Why not? Notably, where Excel was potential goal, not MS word, even though a goal was to search (presumably for words). Why not? Presumption is to more easily dispose of ads and images. So OCR may not be part of question?

How would you know when & how to go from page 21 to 22 etc.? Is that, or could that be automated?

https://en.wikipedia.org/wiki/Screen_scraping#Screen_scraping
0
 
WaterStreetAuthor Commented:
Isn't there a Java script or Win 8 macro procedure that can do the heavy lifting on this?

Or, for example, any good macro recorder software?

Thanks

WS
0
 
SunBowCommented:
Possibly you are aware how to manage, but overall, consider size issue at 60 pages unknown content. Where Excel had limit of lines and NotePad limit of bytes, why not permit Word or at least something more like a dump file, say .data or .raw.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
WaterStreetAuthor Commented:
I'm using TextPad as my default text editor instead of NotePad.  It does not have the size limit of NotePad.  But see what I'm saying about Word, below.

According to MS, in Excel 2013, which I use, the maximum worksheet size is 1048576 rows by 16384 columns.

I had just been ignoring Word to receive the cut and pasting.  Tried it. Actually, Word works better for me then the ".txt" editors for the formats of the copied pages.   Easier to clean up certain things there than with Excel.  Thanks.

I use Chrome as my browser.  Don't know how to do a .data or .raw dump from Chrome, if you were suggesting doing the from the browser.

WS
0
 
SunBowCommented:
I hesitate to use term like flat file, or variable length file, where my guess would be to prefer .text without confining it to be NotePad, while not knowing if input is text, picture, 256 bit-char, language, etc. Just refining Q. By raw I would more mean original single file yet to be massaged. That can lend to become .csv for example, which is not same as Excel, while remaining of text only, being massaged of variable length, while not formatted for Word. Raw is unprocessed unknown content, as is dump, but the latter more often means bit-for-bit, lossless, which I presume is not necessary, still depends on goal. Aramaic? Chinese? I'd doubt.
0
 
WaterStreetAuthor Commented:
None of the above comes even close to answering my question in light of the following:  For example there is commercial software does scripting (or macros), I think that some macros can be done in Win 8.1, and their might even be Java code that does this.

I don't really expect a full solution, where I simple hit a hot key, but I do expect to find something that will automate most of the keyboard or mouse operations.

WS
0
 
WaterStreetAuthor Commented:
Maybe I posted this in the wrong Zones, especially for what I last posted, above.  Asking Community Support for help.
0
 
SunBowCommented:
I had suspected zones are right except Web Browsers, and scripting s/b first,  You might try swapping http://www.experts-exchange.com/Programming/Languages/Java/
for the browser. I suspected Misc would cover OTC prewritten code/package, perhaps at cost, and others (eg Viki) to have soon contributed for that - and that the Java issue would have been addressed by those wanting to contribute own code. And I suspected use of macro emulating keystrokes to be partial but incomplete solution. Also suspected you'd considered programming like VB, C, etc. would not be within answerset.
0
 
QlemoC++ DeveloperCommented:
Difficult to provide something useful, as your frame is widely open. All solutions I can think of require programming, be it in a script or "real programming" language, and tightly adjusting the code to the site it should get applied on.

In regard of manuall scripting, JavaScript, VB Script, PowerShell come into mind.
In regard of "macro recording", AutoIt / AutoHotkey might work, but not sure about it (not using it myself).

Another question is where to put the result into. A text file is much different from Excel. And Excel has the capability to use Web Queries - however, that requires tables to be used in the web page, else parsing the result (for getting the next page) is very, very complicated. Again, the scripting languages are able to control Word/Excel via COM Automation, and VBA allows for doing similar things from inside Word/Excel.

So, a lot of options, but probably tedious to establish as well.
0
 
WaterStreetAuthor Commented:
Qlemo,

Looked at and downloaded both AutoIT and AutoHotkey, and read their reviews.  My guess is that AutoHotkey should meet my various needs.  It's robust with good user reviews.  However, it's said to have a steep learning curve with a good manual.

Thanks
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now