Link to home
Start Free TrialLog in
Avatar of musickmann
musickmann

asked on

Search and replace, swapping tags

I have thousands of XML files that contain tags I need to swap. The tags are contained in the following tag:

<assessmentItem xmlns="http://www.imsglobal.org/xsd/imsqti_v2p1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.imsglobal.org/xsd/imsqti_v2p1  http://www.imsglobal.org/xsd/qti/qtiv2p1/imsqti_v2p1p1.xsd http://www.w3.org/1998/Math/MathML http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd" identifier="choice" title="ABC1234567" adaptive="false" timeDependent="false">

Open in new window

I need to swap the identifier and title, or if swapping is too involved, I really just need the identifier to contain the value of the title.

However, further down in the file, are additional occurrences of both identifier= and title=, so I can't just find and replace them.

I'm not sure about the best process to go about here?
Avatar of Bill Prew
Bill Prew

What platform are you working on (Windows, Unix, Apple, etc...)?

So, you want the updated line in your example to read as below afterwards?

<assessmentItem xmlns="http://www.imsglobal.org/xsd/imsqti_v2p1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.imsglobal.org/xsd/imsqti_v2p1  http://www.imsglobal.org/xsd/qti/qtiv2p1/imsqti_v2p1p1.xsd http://www.w3.org/1998/Math/MathML http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd" identifier="ABC1234567" title="choice" adaptive="false" timeDependent="false">

Open in new window


»bp
Avatar of musickmann

ASKER

I'm running a Mac, but have a Win10 VM as well.

That is the correct re-written line.

What I've been doing is making two replacements
First:
Find: " title
Replace: " identifier
Second:
Find: xsd" identifier
Replace: xsd" title

However, since I can't be 100% sure there won't be an errant match elsewhere in the file, I am doing this in batches of 300 since the files happen to be in folders of 300 each. As long as each Find/Replace only matches 300 occurrences, then I'll be okay to go on that package. But, with over 85,000 files, this will take a long time :)
Can you supply a full example file?

So, would this always be the first match in a file, so if we only replaced the first occurrence per file, might that get there?


»bp
Interesting thought, it would be the first occurrence in the file. It is always in the header area of the file. Attached is a sample full file. These are based on the QTI 2.1 standard from IMS Global.

The content I need to swap will always be in the <assessmentItem> area.
Attached is a sample full file.
Sorry, nothing attached.

»bp
Ha, guess it helps to click the upload file button.
QTI-Question-Sample.xml
ASKER CERTIFIED SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
So this would be the first time I've ever used anything in Powershell, so that's pretty cool on it's own.

It appears to be doing as expected, but I wanted to add a few counts just as a check to make sure I see the expected numbers, I think I got it right but would appreciate a second set of eyes.

In looking through this, another thought popped in my head, and it would be a different question altogether, but is it possible to have powershell parse through XML files and export data from some of the nodes into a txt file?

 
$folder = 'Z:\LocalFiles\test'
$filter = '*.xml'
$item_count = 0
$identifier_count = 0
$title_count = 0
Get-ChildItem $folder -Filter $filter | 
Foreach-Object {
    Write-Host $_.FullName
    [xml]$xml = (Get-Content $_.FullName)
    $node = $xml.assessmentItem
    $saveTitle = $node.title
    $node.title = $node.identifier
    $title_count ++
    $node.identifier = $saveTitle
    $identifier_count ++
    $xml.Save($_.FullName)
    Write-Host "New Title:" $node.title
    Write-Host "New Identifier:" $node.identifier
    $item_count ++
}
Write-Host "Total Items:" $item_count
Write-Host "Total Identifiers:" $identifier_count
Write-Host "Total Titles:" $title_count

Open in new window

I actually do have one followup -
There is one file in the folders that I would want to exclude, it's specifically named imsmanifest.xml. I've tried adding an -exclude option, I tried changing the filter to -include and adding -exclude, but the output is just the script, no action taken.

It isn't critical, as this files structure doesn't have the same tags, so it more than likely won't match, but just want to be double sure.
Thanks so much, this will be super helpful, and I'm excited about maybe poking around some PowerShell - double win!
You could do the following.  -Filter is faster than -Include / -Exclude, so often preferred for simple selections.

Get-ChildItem $folder -Include "*.xml" -Exclude "imsmanifest.xml" |

Open in new window


»bp
The Include/Exclude combo results in no files being processed, so I wanted to share my process to resolve in case anyone else finds this question helpful.

The end result that is working for line 6 is:
Get-ChildItem "$folder\\*" -file -Exclude "imsmanifest.xml" |

I wasn't clear with the entirety of the directory structure at the beginning of this question as I was just focused on the immediate need, but since the solution was great, it was easy to go back and tweak down to be a little more specific.

The folder structure is
Package folder
--passages folder
--images folder
--audio folder
--sytlesheets folder
items.xml (hundreds of files)
imsmaniest.xml (one file)

I wanted to alter only the items. Using just the -Exclude option to remove the imsmanifest file, the script still tried to process the folders and presented errors, which didn't hurt anything, but threw off my counts. I expect to see 300 at the end of each package, but was getting anything from 301-303, which made me pause to figure out what happened and review the console.

With the above solution, only the item files are being processed, so my counts should be 300 unless truly something unexpected happened, so I can streamline my process, run the script and move forward without double checking each package.

I also added another count check by creating the new folder/file list as a variable
$list = Get-ChildItem "$folder\\*" -file -Exclude "imsmanifest.xml"
And at the end in my little report section adding $list.count as Total Items in Package. This way, if a package has a different number of items, I'll know without thinking something went wrong.

Thanks again for a great solution and a little primer into PowerShell Bill!
I think you could also do:

Get-ChildItem $folder\*.xml -Exclude "imsmanifest.xml" |

Open in new window


»bp
I'm revisiting this as it's "that time of year" again to do this, and the company has been through major changes, including the loss of my access to Windows. I'm working on that, but because of all the equipment change and such, just wanted to say thanks again!
Thanks!


»bp