Skip to content

Instantly share code, notes, and snippets.

@Wimpje
Last active February 8, 2023 12:46
Show Gist options
  • Save Wimpje/a796ba134d61552587a7 to your computer and use it in GitHub Desktop.
Save Wimpje/a796ba134d61552587a7 to your computer and use it in GitHub Desktop.
Powershell, split large XML files on node name, with offset support
param( [string]$file = $(throw "file is required"), $matchesPerSplit = 50, $maxFiles = [Int32]::MaxValue, $splitOnNode = $(throw "splitOnNode is required"), $offset = 0 )
# with a little help of https://gist.github.com/awayken/5861923
$ErrorActionPreference = "Stop";
trap {
$ErrorActionPreference = "Continue"
write-error "Script failed: $_ \r\n $($_.ScriptStackTrace)"
exit (1);
}
$file = (resolve-path $file).path
$fileNameExt = [IO.Path]::GetExtension($file)
$fileNameWithoutExt = [IO.Path]::GetFileNameWithoutExtension($file)
$fileNameDirectory = [IO.Path]::GetDirectoryName($file)
$reader = [System.Xml.XmlReader]::Create($file)
$matchesCount = $idx = 0
try {
"Splitting $from on node name='$splitOnNode', with a max of $matchesPerSplit matches per file. Max of $maxFiles files will be generated."
$result = $reader.ReadToFollowing($splitOnNode)
$hasNextSibling = $true
while (-not($reader.EOF) -and $result -and $hasNextSibling -and ($idx -lt $maxFiles + $offset)) {
if ($matchesCount -lt $matchesPerSplit) {
if($offset -gt $idx) {
$idx++
continue
}
$to = [IO.Path]::Combine($fileNameDirectory, "$fileNameWithoutExt.$($idx -$offset)$fileNameExt")
"Writing to $to"
$toXml = New-Object System.Xml.XmlTextWriter($to, $null)
$toXml.Formatting = 'Indented'
$toXml.Indentation = 2
try {
$toXml.WriteStartElement("split")
$toXml.WriteAttributeString("cnt", $null, "$idx")
do {
$toXml.WriteRaw($reader.ReadOuterXml())
$matchesCount++;
$hasNextSibling = $reader.ReadToNextSibling($splitOnNode)
} while($hasNextSibling -and ($matchesCount -lt $matchesPerSplit))
$toXml.WriteEndElement();
}
finally {
$toXml.Flush()
$toXml.Close()
}
$idx++
$matchesCount = 0;
}
}
}
finally {
$reader.Close()
}
@Wimpje
Copy link
Author

Wimpje commented Jun 5, 2019

Hi! Thanks for the feedback, always those pesky off by one errors... I must say I used it for some sanity checking of large files, so didn't run into the issue. I will fix it later this week, when I'm on my windows machine :)

@Wimpje
Copy link
Author

Wimpje commented Jun 11, 2019

Hi! Thanks for the feedback, always those pesky off by one errors... I must say I used it for some sanity checking of large files, so didn't run into the issue. I will fix it later this week, when I'm on my windows machine :)

Should be fixed now!

@domOrielton
Copy link

domOrielton commented Oct 1, 2019

Thank you for excellent code - not sure if you've seen this issue before but when I process a very large file (>500mb) with no line breaks the output only ever seems to total approx 280mb and then completes with no errors - from the size a lot of the entries must be missing and I can't work out why. All I can think of is maybe it has something to do with all the text being on a single line and that somehow causes an issue? It doesn't seem to make any difference what I set the matchesPerSplit to, it will always max out at around 280mb (~295,964,333 bytes)

Update: I can confirm this issue does occur because of the large file with no line breaks. If I split the file into smaller files, add in line breaks and then join the files the script works just fine on a >500mb file

@vikjon0
Copy link

vikjon0 commented Jul 28, 2022

The code does not work on all files.
According to the doc ReadOuterXML will advance the reader to the next tag. What I don't understand is why it sometimes works.
https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readouterxml?view=net-6.0

This workaround seem to work. I have no been able to find a better solution

This seem to work in both situations which I also cannot explain
if ($reader.Name -eq $splitOnNode) {
$hasNextSibling = 1
} else {
$hasNextSibling = $reader.ReadToNextSibling($splitOnNode)
}

@vikjon0
Copy link

vikjon0 commented Jul 29, 2022

The code does not work on all files. According to the doc ReadOuterXML will advance the reader to the next tag. What I don't understand is why it sometimes works. https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readouterxml?view=net-6.0

This workaround seem to work. I have no been able to find a better solution

This seem to work in both situations which I also cannot explain if ($reader.Name -eq $splitOnNode) { $hasNextSibling = 1 } else { $hasNextSibling = $reader.ReadToNextSibling($splitOnNode) }

I think the script only works correctly on XML with "CR"?

Please compare output of sample1 & sample2

$global:ErrorActionPreference = "Stop"

$content1 = '<test><list_items><item><id>A</id></item><item><id>B</id></item></list_items></test>' | out-file -Force -filepath D:\test\sample1.xml
$content2 = '<test>' + [char]13 + '<list_items>' + [char]13 +'<item> '+ [char]13 +'<id>A</id>' + [char]13 + '</item>' + [char]13 + '<item>' + [char]13 + '<id>B</id>' + [char]13 + '</item>' + [char]13 + '</list_items>' + [char]13 + '</test>' | out-file -Force -filepath D:\test\sample2.xml

$file = (resolve-path D:\test\sample2.xml).path
#$file = (resolve-path D:\test\sample1.xml).path

$reader = [System.Xml.XmlReader]::Create($file) 

$matchesCount = $idx = 0

try {
  
    $result = $reader.ReadToFollowing("item")
    $hasNextSibling = $true
    while (-not($reader.EOF) -and $result -and $hasNextSibling) {  #JONVIK
          write-host $reader.ReadOuterXml()
          $hasNextSibling = $reader.ReadToNextSibling("item")
    }

}
finally {
    $reader.Close()
}

@AndreiPosto
Copy link

Hi, many thanks for this code, brilliant!
Could you help me with code line that I can extract the node's id of the used node and use that node's id on the file name rather than incremental $idx, please?
Many thanks!

@vikjon0
Copy link

vikjon0 commented Feb 8, 2023

I dont have time to test right now but the node should be in hear,
$reader.ReadOuterXml())

Not sure if you can extract it directly or if you need to load the content in another object first . It needs to be done without moving the readers position

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment