Manipulate Web Page Via PowerShell

Sam Jacobs
Sam Jacobs used Ask the Experts™
on
I'm looking to manipulate a web page via PowerShell when there is no id field on the HTML element.
Let's take Google as an example ... Some elements have an id:
Element with an id... and some don't - they only have a class (or multiple) name(s):
Element with only a class name
I can use the following code to change the CSS of an element with an ID:
$SiteURL = "https://www.google.com/"   
$google = New-Object -ComObject "InternetExplorer.Application"
$google.visible = $true
$google.Navigate2($SiteURL)
# wait until doc is loaded and ready
Write-Host "Waiting for document to load "
    while ($google.ReadyState -ne 4) {
        Write-Host "." -NoNewLine
        Start-Sleep 1
    }
$doc = $google.Document
$id = [System.__ComObject].InvokeMember("getElementById",[System.Reflection.BindingFlags]::InvokeMethod, $null, $doc, 'lga')
"Current CSS: $($id.style.csstext)"
$id.style.csstext = "display:none;"

Open in new window

... but how can I change the CSS of an element with only a class?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
David FavorFractional CTO
Distinguished Expert 2018

Commented:
Note: Code you're writing is considered Bot code (non-human interactions)...

First problem is Google blocks Bot interactions with their sites.

Likely code as simple as yours will be caught + blocked in a way which is very difficult to catch + debug.

Here's how to get your code working.

1) Setup your own page somewhere for testing, which allows Bot access to manipulate the page.

2) If you'll only be manipulating pure HTML sites, then your code will work.

3) If you're interacting with normal sites, which fire Javascript your code will fail... again... in ways difficult to catch + debug.

Note: If you're writing a general purpose tool like this, then you'll use http://phantomjs.org/ as this is a headless version of Chrome, which runs Javascript.

Tip: Almost every major site these days checks for Bot interactions + blocks them in various ways... so... likely better approach will be to check site's docs for API access. For example Google provides API access for many of it's services.
Sam JacobsDirector of Technology Development, IPM

Author

Commented:
David,

Thanks for your response.  Maybe I wasn't being clear. The code provided above does work (please feel free to try it).
I am quite aware that interacting with Google is best accomplished via their API.
I provided it solely as an example of what I am trying to accomplish with another website (without an API).

I respectfully disagree with your definition of Bot code. The code provided is quite similar to what Google would see coming from an actual human interaction. If I repeated the process many times in a short time span, that would be a different story.

-Sam
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
Indeed the only way to go forward might be to go through the collections at some hierarchy level. Like going thru all chidlren with a certain name, and counting or checking for a particular text or attribute or whatever distinguishes same class items.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
Hi Sam,

I am not well versed in PowerShell. Qlemo is my go to for PS so I can only offer an idea and not full code. Perhaps if you want to scrape using vbs I can come up with something.   Can you try using
ParsedHtml.body.getElementsByClassName('gb_Q') 

Open in new window

or
ParsedHtml.body.getElementsByTagName('div') |  Where {$_.getAttributeNode('class').Value -eq 'gb_Q'}

Open in new window

Sam JacobsDirector of Technology Development, IPM

Author

Commented:
Hi Scott,

Thanks for your reply.

Sorry, I should have mentioned that I already have the commands to retrieve the needed DOM objects.
I'm using the following:
$classes = $doc.getElementsByClassName('gb_Q')

Open in new window

I find retrieval by class name to be much faster than by tag name, which could be done with:
$divs = $doc.getElementsByTagName('div') |  Where className -like 'gb_Q*'

Open in new window

What I am seeking assistance with is how to modify the style of the elements (e.g. set to display:none;) once found.
I can modify a get/set attribute like innerText:
$classes[0].innerText = "My text"

Open in new window

However, style  seems to be a read-only property.

Thanks!
Sam
Developer & EE Moderator
Fellow 2018
Most Valuable Expert 2013
Commented:
Again, I don't know powershell but if you can read it, then you can rewrite it.

Perhaps this example may help in your scraping.

Assume a simple page
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>test</title>
</head>
<body>
  <div class="test">Text</div>
</body>
</html>

Open in new window


There are multiple ways you can hide div class="test
Add an inline style display:none which is what your example does in the question.
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>test</title>
</head>
<body>
  <div class="test" style="display:none;">Text</div>
</body>
</html>

Open in new window

Add css between style tags to do the same
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>test</title>
  <style>
    .test{display:none;} 
  </style>
</head>
<body>
  <div class="test">Text</div>
</body>
</html>

Open in new window

In pure javascript
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>test</title>
  <script>
    
   window.onload = function() {
     var x = document.getElementsByClassName('test')[0];
     x.style.display='none';
        
  }
  </script>
</head>
<body>
  <div id="test" class="test">Text</div>
</body>
</html>

Open in new window


This may be in part a clue for what you are doing in PS because.  getElementsByClassName returns an array https://developer.mozilla.org/en-US/docs/Web/API/Element/getElementsByClassName and that is why in  my code I have document.getElementsByClassName('test')[0];  I already know it is the one and only item. But I can't just use document.getElementsByClassName('test').  Your other option would be to loop
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>test</title>
  <script>
    
   window.onload = function() {
     var x = document.getElementsByClassName('test');
     var i;
     for (i = 0; i < x.length; i++) {
       x[i].style.display = 'none';
     }
        
  }
  </script>
</head>
<body>
  <div id="test" class="test">Text</div>
</body>
</html>

Open in new window

I tested this code in internet explorer and it works.

I hope this can give some insight to build your PS script.  If I was going to do this in vbs, I would add an inline style
 <div class="test" style="display:none;">Text</div>

Open in new window

by finding div class="test">Text</div> and replacing it.
Sam JacobsDirector of Technology Development, IPM

Author

Commented:
Scott ... thanks again for your detailed response.
I am quite familiar with how to do it in JavaScript.
I am also quite familiar with manipulating the DOM in PowerShell, including how to use and iterate through getElementById, GetElementsByClassName, and getElementsByTagName in PowerShell.
What I am not familiar with is how to modify the style of a class or a <div> in PowerShell
Sam JacobsDirector of Technology Development, IPM

Author

Commented:
Scott ... great minds think alike ... I had also thought about replacing .outerHTML to include the modified style (which would of course over-ride any style sheets). I was just about to try it, when I reread your post, and saw that you had suggested it as well, so the points go to you!
(I still think there must be some way to modify the attributes of a style directly). Thanks!
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015
Commented:
Just crosschecked. There is (of course) no difference between a style object you get by getElementByID(...) and getElementsbyClassName(...)[0]. Not in PowerShell or any other scripting language.
Sam JacobsDirector of Technology Development, IPM

Author

Commented:
OMG ... you are correct ... I was assuming that because .style.csstext was blank, that it wasn't working.
I could've sworn that I had tried it earlier and it failed  (maybe I had forgotten to include the index for getElementsbyClassName(...) when I tried it last).
But I tried it just did now, and it DOES work!
Thanks!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial