Summary
My customer had a large document library on SharePoint Online with a deep folder structure. The total number of items is 2.7 million (2,743,321). They want those documents downloaded to the local folder. I know, this is an actual scenario and issue.
The issue was the time it took to download the script they wrote. The customer used that script it was working but it took a very long time. The performance was a huge issue. Additionally, machine resources were getting effected, and download happen sequentially and no parallel download.
This article is to show the proposed approach and working solution for the customer. The blog post is mainly with the scripts so no need to write anything. You will need to be knowledgeable on PowerShell and little MS Graph API technical skills.
An approach
- Not to use the PnP PowerShell Get-PnPListItem call. This will take forever to complete. It is a helpful command but not for this scenario.
- Use some other ways to get the file’s relative URLs using MS Graph API.
- “https://graph.microsoft.com/v1.0/drives/{drive-id}/list/items?$Select=Id%2cWebUrl“
- Get the drive-id value for the Doc Lib. It is not List ID.
- Only get the ID and WebURL properties using MS Graph, as highlighted above.
- Store ID and WebURL properties in the CSV files in the batch of 100,000.
- Once all the CSV files are downloaded run another script (Phase-2 below).
- This script will import the CSV file and iterate over all web URL.
- Using the Web URL value, create a folder and download the file if not present. To download use the Get-PnPFile command.
- This is an important step, run the script in multiple PowerShell windows so the script downloads files in parallel.
Phase 1
In this phase 1, you will need a way to get the CSV files with ID and WebURL from DocLib. These CSV files will be with 100K rows. If you do the math there will be 28 CSV files for 2.7 million records. (Your numbers will be different based on your files).
To run the script below you will need the following:
Azure Function running on a Queue Trigger.
Storage with Azure Queue and Azure Blob.
Azure AD app with MS Graph API to have Sites Read access. (I used Full control)
You will need a driveID. I used the Graph PowerShell SDK to get it.
# Input bindings are passed in via param block.
param($QueueItem, $TriggerMetadata)
# Write out the queue message and insertion time to the information log.
Write-Host "PowerShell queue trigger function processed work item: $QueueItem"
Write-Host "Queue item insertion time: $($TriggerMetadata.InsertionTime)"
# Populate with the App Registration details and Tenant ID
$ClientId = "TODO"
$ClientSecret = "TODO"
$queueName = "TODO"
$containerName = "TODO"
$tenantid = "TODO"
$env:AzureWebJobsStorage = "TODO"
$env:LOG_FILE_PATH = "C:\TEMP"
$GraphScopes = "https://graph.microsoft.com/.default"
$driveID = "TODO"
# To get drive id variable execute the following command
# $drives = Get-MgSiteListDrive -SiteId Your-SITE -ListId Your-LIST
# You will need to connect to Graph. Follow this article.
# Get the access token to execute MS Graph calls.
$headers = @{
"Content-Type" = "application/x-www-form-urlencoded"
}
# Formulate body with four parameters.
$body = "grant_type=client_credentials&client_id=$ClientId&client_secret=$ClientSecret&scope=https%3A%2F%2Fgraph.microsoft.com%2F.default"
# Create login URL for the tenant id
$authUri = "https://login.microsoftonline.com/$tenantid/oauth2/v2.0/token"
# Make a POST call to Azure AD login URL
$response = Invoke-RestMethod $authUri -Method 'POST' -Headers $headers -Body $body
# Using Token from the above call, create header with bearer token
$headers = @{
"Content-Type" = "application/x-www-form-urlencoded"
"Authorization" = $("Bearer {0}" -f $response.access_token)
}
#Function to move local file to blob storage
function MoveLogFilesToBlobContainer
{
$storageContainer = New-AzStorageContext -ConnectionString $env:AzureWebJobsStorage | Get-AzStorageContainer -Name $containerName
#Write-Output $storageContainer
Get-ChildItem $env:LOG_FILE_PATH -Filter ListOfIDs*.csv |
Foreach-Object {
$blobNameWithFolder = $("{0}" -f $_.Name)
Write-Output $("Move {0} to {1} Blob Container AS BlobName {2}." -f $_.FullName, $storageContainer.Name, $blobNameWithFolder)
Set-AzStorageBlobContent -File $_.FullName `
-Container $storageContainer.Name `
-Blob $blobNameWithFolder `
-Context $storageContainer.Context -Force
Remove-Item -Path $_.FullName -Force
}
}
#Function to put a message in a queue
function Put2MsgInQueue([Int]$aCounter,[String]$anUrl2Process)
{
$FormattedMessage = $("{0},{1}" -f $aCounter, $anUrl2Process )
Write-Host $FormattedMessage
$context = New-AzStorageContext -ConnectionString $env:AzureWebJobsStorage
$queue = Get-AzStorageQueue -Name $queueName -Context $context
# Create a new message using a constructor of the CloudQueueMessage class
$queueMessage = [Microsoft.Azure.Storage.Queue.CloudQueueMessage]::new($FormattedMessage)
# Add a new message to the queue
$queue.CloudQueue.AddMessageAsync($QueueMessage)
}
function ScrapTheListItems([String]$aRestURI, [int]$batchNumber)
{
$StopWatch = [System.Diagnostics.Stopwatch]::StartNew()
$restURI = $aRestURI
Write-Output $("restURI {0}" -f $restURI);
# 200 * 500 = 100,000 rows in file.
# 200 rows per API call
$batchCountSize = 2 #500
#initialize the index and array to start
$batchIndex = 1
$outArray = @()
# MAKE a call to MS GRAPH API using the bearer token header
$response = Invoke-RestMethod $restURI -Method 'GET' -Headers $headers
Write-Output $("response {0}" -f $response);
# Get the next link URL.
$restURI = $response."@odata.nextLink"
while ($null -ne $restURI)
{
# Convert an array with Name & Value pair to an object array.
# This is needed so the object array can be stored as CSV
foreach ( $i in $response.value)
{
$anObj = New-Object PSObject
Add-Member -InputObject $anObj -MemberType NoteProperty -Name 'id' -Value $i.Id
Add-Member -InputObject $anObj -MemberType NoteProperty -Name 'webUrl' -Value $i.webUrl
$outArray += $anObj
}
$totalRows = $batchIndex * 200
Write-Output $("batchIndex : {0}, call to graph API for 200 rows now total is {1}" -f $batchIndex, $totalRows );
if ( $batchIndex -eq $batchCountSize)
{
$exportCsvURLPath = $("{0}\ListOfIDs-{1}.csv" -f $env:LOG_FILE_PATH, $batchNumber )
Write-Output $("Create {0}" -f $exportCsvURLPath);
# create a message in the Queue to start a new func app to process 1000 urls.
$outArray | Export-Csv -Path "$exportCsvURLPath" -NoTypeInformation -Force
## MOVE TO BLOB CONTAINER
MoveLogFilesToBlobContainer
#initialize the index and array to start
$batchIndex = 1
$outArray = @()
# add file batch number to next
$batchNumber++
###NOW EXIT FROM LOOP
break
}
else
{
$batchIndex++
}
# MAKE a call to MS GRAPH API using the bearer token header
$response = Invoke-RestMethod $restURI -Method 'GET' -Headers $headers
# Get the next link URL.
$restURI = $response."@odata.nextLink"
}
# The last remaining batch may be less than the batch count size
if (($batchIndex -gt 1) -or ($outArray.Count -gt 0))
{
$exportCsvURLPath = $("{0}\ListOfIDs-{1}.csv" -f $env:LOG_FILE_PATH, $batchNumber )
Write-Output $("Create {0}" -f $exportCsvURLPath);
# Create a message in the Queue to start a new Function App.
$outArray | Export-Csv -Path "$exportCsvURLPath" -NoTypeInformation -Force
## MOVE the CSV file TO the BLOB CONTAINER
MoveLogFilesToBlobContainer
}
if ($null -ne $restURI)
{
Put2MsgInQueue -aCounter $batchNumber -anUrl2Process $restURI
}
$StopWatch.Stop()
Write-Output $("Elapsed time in TotalMinutes: {0}" -f $StopWatch.Elapsed.TotalMinutes);
}
# To start this Function add a manual queue message as "START"
if ("START" -eq $QueueItem)
{
Write-Host "we are running first time"
$counter = 1
# For first time we need to make a call to MS Graph API
$firstURI = $("https://graph.microsoft.com/v1.0/drives/$driveID/list/items?{0}" -f '$Select=Id%2CWebUrl')
ScrapTheListItems $firstURI $counter
}
else
{
# Function will always fall here with the index#, URL to fetch
$splittedArray = $QueueItem.split(",")
$counter = [int]$splittedArray[0]
ScrapTheListItems $splittedArray[1] $counter
}
Phase 2
In Phase 2, the script is to download the file. It does the following steps.
Read the CSV File with the passed-in batch number. e.g. 1, 2, 3…,274. the files are assumed to in the blob storage. If you want to point to a local folder, you may need to change the code.
Read all 100k records. Using the Web URL check for the local folder presence and file presence.
If not present, create a folder and dump the file using Get-PnPFile command.
Run the following in multiple PowerShell prompt so the files are downloaded in parallel. Even if you run twice with the same batch number parameter the script will figure out and not download the file if downloaded already before.
# Input bindings are passed in via param block.
param($batchNumber)
### TODO REMOVE LATER ONLY FOR DEBUGGING
$MaxFiles2Get = 10000
$CurrentFileNumber = 0
#Initialize variables
$DownloadLocation = "V:\Verification Documents"
$SiteURL = "https://Contoso.sharepoint.com/sites/LegalDept"
$CSVFilesPath = "C:\LegalDept"
$OrechestratorCSVFileName = "0-Orchestrator.csv"
$env:LOG_FILE_PATH = "C:\LegalDept\Logs"
$global:TotalFilesAlreadyPresent = 0
$global:TotalFilesDownloaded = 0
$global:ConnectPnPDoneFlag = $false
Add-Type -AssemblyName System.Web
function Write-Log
{
[CmdletBinding()]
Param
(
[Parameter(Mandatory=$true,
ValueFromPipelineByPropertyName=$true)]
[ValidateNotNullOrEmpty()]
[Alias("LogContent")]
[string]$Message,
[Parameter(Mandatory=$false)]
[Alias('LogPath')]
[string]$Path='C:\Logs\PowerShellLog.log',
[Parameter(Mandatory=$false)]
[ValidateSet("Error","Warn","Info")]
[string]$Level="Info",
[Parameter(Mandatory=$false)]
[switch]$NoClobber
)
Begin
{
# Set VerbosePreference to Continue so that verbose messages are displayed.
$VerbosePreference = 'Continue'
}
Process
{
# If the file already exists and NoClobber was specified, do not write to the log.
if ((Test-Path $Path) -AND $NoClobber) {
Write-Error "Log file $Path already exists, and you specified NoClobber. Either delete the file or specify a different name."
Return
}
# If attempting to write to a log file in a folder/path that doesn't exist create the file including the path.
elseif (!(Test-Path $Path)) {
Write-Verbose "Creating $Path."
New-Item $Path -Force -ItemType File
}
else {
# Nothing to see here yet.
}
# Format Date for our Log File
$FormattedDate = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
# Write message to error, warning, or verbose pipeline and specify $LevelText
switch ($Level) {
'Error' {
Write-Error $Message
$LevelText = 'ERROR:'
}
'Warn' {
Write-Warning $Message
$LevelText = 'WARNING:'
}
'Info' {
Write-Verbose $Message
$LevelText = 'INFO:'
}
}
# Write log entry to $Path
"$FormattedDate $LevelText $Message" | Out-File -FilePath $Path -Append
## also dump to console
#$savedColor = $host.UI.RawUI.ForegroundColor
#$host.UI.RawUI.ForegroundColor = "DarkGreen"
Write-Output $message
#$host.UI.RawUI.ForegroundColor = $savedColor
}
End
{
}
}
function WriteExceptionInformation($AnItem)
{
Write-Log -Path $LogFileName $AnItem.Exception.Message
Write-Log -Path $LogFileName $AnItem.Exception.StackTrace
<#
Write-Log -Path $LogFileName $AnItem.Exception.ScriptStackTrace
Write-Log -Path $LogFileName $AnItem.InvocationInfo | Format-List *
#>
}
function UpdateOrchestratorInouFile()
{
param (
[string]$Status2update
)
$csvfile = Import-CSV -Path $("{0}\{1}" -f $CSVFilesPath, $OrechestratorCSVFileName)
if ( $null -ne $csvfile)
{
$outArray = @()
foreach ( $aRowInFile in $csvfile)
{
$anObj = New-Object PSObject
Add-Member -InputObject $anObj -MemberType NoteProperty -Name 'BatchNumber' -Value $aRowInFile.BatchNumber
if ($aRowInFile.BatchNumber -eq $batchNumber)
{
# Change status to Status2update
Add-Member -InputObject $anObj -MemberType NoteProperty -Name 'Status' -Value $Status2update
}
else {
<# Action when all if and elseif conditions are false #>
# Keep the status as is
Add-Member -InputObject $anObj -MemberType NoteProperty -Name 'Status' -Value $aRowInFile.Status
}
$outArray += $anObj
}
# Important step modify the file.
$outArray | Export-Csv -Path $("{0}\{1}" -f $CSVFilesPath, $OrechestratorCSVFileName) -NoTypeInformation -Force
}
}
function MainWorkerFunc
{
$didFailStatusHappen = $false
$importCsvURLPath = $("ListOfIDs-{0}.csv" -f $batchNumber )
$csvfile = Import-CSV -Path $("{0}\{1}" -f $CSVFilesPath, $importCsvURLPath)
if ( $null -ne $csvfile)
{
UpdateOrchestratorInouFile -Status2update "INPROGESS"
try {
# sample https://Contoso.sharepoint.com/sites/LegalDept/Verification%20Documents/Documents/FirtnameLastname0877_115502.pdf
foreach ( $aRowInFile in $csvfile)
{
$webUrl2work = $aRowInFile.webUrl
if ( $null -ne $webUrl2work)
{
# remove the URL decoding from web url
$decodedWebUrl2work = [System.Web.HttpUtility]::UrlDecode($webUrl2work)
$fileName = Split-Path -Path $decodedWebUrl2work -Leaf
$splitArr = $decodedWebUrl2work.split('/')
$filePath = $DownloadLocation
# now build a local path
$idx = 0;
foreach ( $valInArr in $splitArr)
{
# skip all four indices 0,1,2,3,4,5
if ( $idx -ge 6)
{
# skip the file name
if ( $fileName -ne $valInArr)
{
# append the path to the existing
$filePath = $("{0}\{1}" -f $filePath, $valInArr)
}
}
$idx++
}
#Ensure All Folders in the Local Path
$LocalFolder = $filePath
#Create Local Folder, if it doesn't exist
If (!(Test-Path -Path $LocalFolder))
{
New-Item -ItemType Directory -Path $LocalFolder | Out-Null
}
#Download file , if it doesn't exist
If (!(Test-Path -LiteralPath $("{0}\{1}" -f $filePath, $fileName)))
{
try
{
if ( $global:ConnectPnPDoneFlag -eq $false )
{
Write-Log -Path $LogFileName $("Connecting to {0}" -f $SiteURL);
Connect-PnPOnline $SiteURL -ClientId "TODO" -ClientSecret "*****"
Write-Log -Path $LogFileName $("Connected to {0}" -f $SiteURL);
# since we are connected make this flag true
$global:ConnectPnPDoneFlag = $true
}
# string the host from the URL
# https://Contoso.sharepoint.com/sites/LegalDept/Verification%20Documents/Documents/FirtnameLastname0877_115502.pdf
# should be /sites/LegalDept/Verification%20Documents/Documents/FirtnameLastname0877_115502.pdf
$relativeFileURL = ([uri]$webUrl2work).LocalPath
Write-Log -Path $LogFileName $("Download file from {0}." -f $relativeFileURL);
Get-PnPFile -Url $relativeFileURL -Path $filePath -FileName "$fileName" -AsFile
Write-Log -Path $LogFileName $("to {0}\{1}." -f $filePath,$fileName);
$global:TotalFilesDownloaded += 1
}
catch
{
WriteExceptionInformation ( $PSItem )
UpdateOrchestratorInouFile -Status2update "FAILED"
$didFailStatusHappen = $true
### STOP everything if the error occured
break
}
}
else
{
$global:TotalFilesAlreadyPresent += 1
Write-Log -Path $LogFileName $("File {0}\{1} already downloded." -f $filePath,$fileName);
}
$CurrentFileNumber += 1
Write-Log -Path $LogFileName $("CurrentFileNumber {0}" -f $CurrentFileNumber);
# TODO REMOVE LATER
if ( $CurrentFileNumber -eq $MaxFiles2Get)
{
break
}
}
}
}
catch {
WriteExceptionInformation ( $PSItem )
UpdateOrchestratorInouFile -Status2update "FAILED"
$didFailStatusHappen = $true
### STOP everything if the error occured
break
}
finally {
<#Do this after the try block regardless of whether an exception occurred or not#>
#####
#Update complete only if fail did not happen before.
if ( $true -ne $didFailStatusHappen )
{
UpdateOrchestratorInouFile -Status2update "COMPLETE"
}
}
}
}
$StopWatch = [System.Diagnostics.Stopwatch]::StartNew()
$LogFileName = $("{0}\Batch-{1:d2}-Log-{2}.txt" -f $env:LOG_FILE_PATH , $batchNumber, (Get-Date -Format "yyyy-MM-dd-HH-mm-ss"))
Write-Log -Path $LogFileName " *************************************** Start *************************************** "
#Change Window Title
$Host.UI.RawUI.WindowTitle = $("Batch number {0}." -f $batchNumber);
MainWorkerFunc # CALL THE MAIN WORKER FUNCTION
$StopWatch.Stop()
Write-Log -Path $LogFileName " ------------------------------------------------------------------------------------- "
Write-Log -Path $LogFileName $("Batch number {0}." -f $batchNumber);
Write-Log -Path $LogFileName $("Total files already found present: {0}" -f $global:TotalFilesAlreadyPresent);
Write-Log -Path $LogFileName $("Total files downloaded: {0}" -f $global:TotalFilesDownloaded);
$StopWatch.Stop()
Write-Log -Path $LogFileName $("Elapsed time in TotalMinutes: {0}" -f $StopWatch.Elapsed.TotalMinutes);
Write-Log -Path $LogFileName " ------------------------------------------------------------------------------------- "
Write-Log -Path $LogFileName " *************************************** End *************************************** "
Orchestrator PowerShell Script
#Initialize variables
$CSVFilesPath = "C:\LegalDept"
$OrechestratorCSVFileName = "0-Orchestrator.csv"
$env:LOG_FILE_PATH = "C:\LegalDept\Logs"
function WriteExceptionInformation($AnItem)
{
Write-Log -Path $LogFileName $AnItem.Exception.Message
Write-Log -Path $LogFileName $AnItem.Exception.StackTrace
Write-Log -Path $LogFileName $AnItem.Exception.ScriptStackTrace
Write-Log -Path $LogFileName $AnItem.InvocationInfo | Format-List *
}
function Write-Log
{
[CmdletBinding()]
Param
(
[Parameter(Mandatory=$true,
ValueFromPipelineByPropertyName=$true)]
[ValidateNotNullOrEmpty()]
[Alias("LogContent")]
[string]$Message,
[Parameter(Mandatory=$false)]
[Alias('LogPath')]
[string]$Path='C:\Logs\PowerShellLog.log',
[Parameter(Mandatory=$false)]
[ValidateSet("Error","Warn","Info")]
[string]$Level="Info",
[Parameter(Mandatory=$false)]
[switch]$NoClobber
)
Begin
{
# Set VerbosePreference to Continue so that verbose messages are displayed.
$VerbosePreference = 'Continue'
}
Process
{
# If the file already exists and NoClobber was specified, do not write to the log.
if ((Test-Path $Path) -AND $NoClobber) {
Write-Error "Log file $Path already exists, and you specified NoClobber. Either delete the file or specify a different name."
Return
}
# If attempting to write to a log file in a folder/path that doesn't exist create the file including the path.
elseif (!(Test-Path $Path)) {
Write-Verbose "Creating $Path."
New-Item $Path -Force -ItemType File
}
else {
# Nothing to see here yet.
}
# Format Date for our Log File
$FormattedDate = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
# Write message to error, warning, or verbose pipeline and specify $LevelText
switch ($Level) {
'Error' {
Write-Error $Message
$LevelText = 'ERROR:'
}
'Warn' {
Write-Warning $Message
$LevelText = 'WARNING:'
}
'Info' {
Write-Verbose $Message
$LevelText = 'INFO:'
}
}
# Write log entry to $Path
"$FormattedDate $LevelText $Message" | Out-File -FilePath $Path -Append
## also dump to console
#$savedColor = $host.UI.RawUI.ForegroundColor
#$host.UI.RawUI.ForegroundColor = "DarkGreen"
Write-Output $message
#$host.UI.RawUI.ForegroundColor = $savedColor
}
End
{
}
}
function MainOrchestratorFunc {
$csvfile = Import-CSV -Path $("{0}\{1}" -f $CSVFilesPath, $OrechestratorCSVFileName)
if ( $null -ne $csvfile)
{
foreach ( $aRowInFile in $csvfile)
{
Write-Log -Path $LogFileName $("Batch number {0:d2} has Status {1}" -f $aRowInFile.BatchNumber, $aRowInFile.Status )
switch ($aRowInFile.Status.ToUpper())
{
"NEW"
{
Write-Log -Path $LogFileName $("Batch number {0:d2} has '{1}' Status. Spawn this batch and change status to InProgress." -f $aRowInFile.BatchNumber, $aRowInFile.Status )
# spawm the file with the batch number
SpawnThePowerShellProcess -batchnumber2Process $aRowInFile.BatchNumber
}
"INPROGRESS"
{
Write-Log -Path $LogFileName $("Batch number {0:d2} has '{1}' Status. Do nothing." -f $aRowInFile.BatchNumber, $aRowInFile.Status )
}
"FAILED"
{
Write-Log -Path $LogFileName $("Batch number {0:d2} has '{1}' Status. Spawn this batch and change status to InProgress." -f $aRowInFile.BatchNumber, $aRowInFile.Status )
SpawnThePowerShellProcess -batchnumber2Process $aRowInFile.BatchNumber
}
"COMPLETE"
{
Write-Log -Path $LogFileName $("Batch number {0:d2} has '{1}' Status. Do nothing." -f $aRowInFile.BatchNumber, $aRowInFile.Status )
}
default
{
Write-Log -Path $LogFileName $("Batch number {0:d2} has and INVALID Status {1}" -f $aRowInFile.BatchNumber, $aRowInFile.Status )
}
}
}
}
}
function SpawnThePowerShellProcess {
param (
[int]$batchnumber2Process
)
$processOptions = @{
FilePath = "PowerShell"
WorkingDirectory = "C:\scripts"
ArgumentList = "C:\scripts\DownloadFiles.ps1 -batchNumber $batchnumber2Process"
}
Start-Process @processOptions -Verb RunAs -WindowStyle Normal
}
$StopWatch = [System.Diagnostics.Stopwatch]::StartNew()
$LogFileName = $("{0}\Orchestrator-Log-{1}.txt" -f $env:LOG_FILE_PATH , (Get-Date -Format "yyyy-MM-dd-HH-mm-ss"))
Write-Log -Path $LogFileName " *************************************** Start *************************************** "
MainOrchestratorFunc # CALL THE MAIN ORCHESTRATOR FUNCTION
$StopWatch.Stop()
Write-Log -Path $LogFileName " ------------------------------------------------------------------------------------- "
Write-Log -Path $LogFileName $("Elapsed time in TotalMinutes: {0}" -f $StopWatch.Elapsed.TotalMinutes);
Write-Log -Path $LogFileName " ------------------------------------------------------------------------------------- "
Write-Log -Path $LogFileName " *************************************** End *************************************** "
The Orchestrator CSV file
Status field is caseinsentive
Status New means new
1,New
Status InProgress means the batch is currently running
1,InProgress
Status Failed means the batch failed needs to run again
1,Failed
Status Complete means the batch is complete DO NOT run
1,Complete
"BatchNumber","Status"
"1","COMPLETE"
"2","NEW"
"3","COMPLETE"
"4","INPROGRESS"
"5","COMPLETE"
"6","COMPLETE"
"7","INPROGRESS"
"8","COMPLETE"
"9","FAILED"
"10","COMPLETE"
Conclusion
The script and instructions are rough, but they should be helpful if you are a developer.
The customer used the code from here. The code works but for the large doc lib they noticed issues mentioned in the summary,
This article would help you to download the files in multiple threads. The above script in phase 2 could run again pick up where it left from in case of failure. The script will skip the files already downloaded to the local folder.