Skip to content

Instantly share code, notes, and snippets.

@fredeil
Last active June 20, 2020 10:50
Show Gist options
  • Save fredeil/439e3ccfccc4e1a61259fcf2b9cf1acf to your computer and use it in GitHub Desktop.
Save fredeil/439e3ccfccc4e1a61259fcf2b9cf1acf to your computer and use it in GitHub Desktop.
Cleaning up orphaned chunks in GridFS
using MongoDB.Bson;
using MongoDB.Driver;
using Serilog;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
namespace Mongo
{
internal class Program
{
private static async Task<int> Main(string[] args)
{
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Debug()
.WriteTo.Console()
.WriteTo.File("mongo.txt")
.CreateLogger();
var mongoSettings = MongoClientSettings.FromConnectionString("");
var mongoClient = new MongoClient(mongoSettings);
var mongoDatabase = mongoClient.GetDatabase("packagesDb");
var chunksCollection = mongoDatabase.GetCollection<BsonDocument>("packages.chunks");
var filesCollection = mongoDatabase.GetCollection<BsonDocument>("packages.files");
using var cts = new CancellationTokenSource();
Console.CancelKeyPress += (sender, e) =>
{
e.Cancel = true;
cts.Cancel();
};
var sw = new Stopwatch();
var numberOfChunksDeleted = 0UL;
try
{
var chunkFilter = Builders<BsonDocument>.Filter.Empty;
using var cursor = await chunksCollection.FindAsync(chunkFilter, new FindOptions<BsonDocument, BsonDocument>()
{
BatchSize = 1000,
Skip = 0,
Limit = 100_000,
//Sort = Builders<BsonDocument>.Sort.Descending("_id"),
Projection = Builders<BsonDocument>.Projection.Exclude("data")
}, cts.Token);
sw.Start();
var uniqueFiles = new HashSet<string>();
while (await cursor.MoveNextAsync(cts.Token))
{
var batch = cursor.Current;
foreach (var gridFSChunk in batch)
{
var objecdId = gridFSChunk.GetValue("_id").AsObjectId;
uniqueFiles.Add(gridFSChunk.GetValue("files_id").AsObjectId.ToString());
}
}
foreach (var files_id in uniqueFiles)
{
var filter = Builders<BsonDocument>.Filter.Eq("_id", new ObjectId(files_id));
var count = await filesCollection.CountDocumentsAsync(filter, null, cts.Token);
if (count > 0)
{
continue;
}
var deleteFilter = Builders<BsonDocument>.Filter.Eq("files_id", new ObjectId(files_id));
var deleteResult = await chunksCollection.DeleteManyAsync(deleteFilter, null,cts.Token);
numberOfChunksDeleted += (ulong)deleteResult.DeletedCount;
Log.Information("{0},{1}", deleteResult, files_id);
}
}
catch (Exception ex)
{
Log.Error(ex, "Something went wrong.");
}
finally
{
sw.Stop();
Log.Information("Number of chunks deleted: {0}.", numberOfChunksDeleted);
Log.Information("Elapsed time: {0} minutes.", sw.Elapsed.TotalMinutes);
Log.CloseAndFlush();
}
return 0;
}
}
}
@thegreatco
Copy link

Ok, I made some tweaks, memory could be an issue if there are very large numbers of files, but this implementation should limit the number of delete commands you have to issue and the number of queries against the files collection. Pay very close attention to the Project and Hint as they'll let us do this whole thing as a covered query. Files with large number of chunks will take some time to delete, but should be faster than deleting 1 chunk at a time. I hope this helps.

var client = new MongoClient();
var db = client.GetDatabase("gridfs");
var chunks = db.GetCollection<BsonDocument>("packages.chunks");
var files = db.GetCollection<BsonDocument>("packages.files");
var validFiles = new HashSet<ObjectId>();
var deletedFiles = new HashSet<ObjectId>();
var findOptions = new FindOptions<BsonDocument>()
{
	Projection = Builders<BsonDocument>.Projection.Include("files_id").Exclude("_id"),
	Hint = "files_id_1_n_1"
};
using var chunksCursor = await chunks.FindAsync(Builders<BsonDocument>.Filter.Empty, findOptions);
while(await chunksCursor.MoveNextAsync())
{
	var uniqueFileIds = chunksCursor.Current.Select(x => x["files_id"].AsObjectId).ToHashSet();
	foreach(var fileId in uniqueFileIds)
	{
		// Don't check it if we already know it's valid
		// Don't check if we already deleted
		if (validFiles.Contains(fileId) || deletedFiles.Contains(fileId))
			continue;
		var fileCount = await files.CountDocumentsAsync(Builders<BsonDocument>.Filter.Eq("_id", fileId));
		// If the file exists, add it to our list and move on
		if (fileCount > 0)
		{
			validFiles.Add(fileId);
			continue;
		}

		// If we know it isn't valid, delete all possible chunks with a single command
		var deleteResult = await chunks.DeleteManyAsync(Builders<BsonDocument>.Filter.Eq("files_id", fileId));
		Console.WriteLine($"Deleted {deleteResult.DeletedCount} from orphaned file {fileId}");
		deletedFiles.Add(fileId);
	}
}

@fredeil
Copy link
Author

fredeil commented Jun 18, 2020

Thank you for this @thegreatco

@fredeil
Copy link
Author

fredeil commented Jun 18, 2020

After a while when the program has been running this exception occurs: MongoDB.Driver.MongoCommandException: Command getMore failed: cursor id 93648701887 not found.. However, when I did my first "dry run", only counting the documents instead of deleting them this did not happen.

@thegreatco
Copy link

That means it is taking longer than 10 minutes between calls to MoveNextAsync() and the cursor is getting cleaned up. If you add NoCursorTimeout = true to the FindOptions<BsonDocument>, it will prevent the mongod from cleaning up the cursor, but that means unless the client cleans it up, it won't be cleaned up until a server restart. So just make sure all error paths call Dispose on the chunksCursor. The using directive should solve that for you automatically, but an explicit try...finally wouldn't hurt either.

@fredeil
Copy link
Author

fredeil commented Jun 20, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment