Azure Table storage is a way of storing structured NoSQL data in the cloud, as such it’s more geared towards rapid read access rather than manipulation of data in the table.
If you want to perform any operations such as updates or deletes on data in your table then the most efficient way to do this is as a batch operation which groups up to 100 operations into a single batch provided they all have the same partition key.
In my case I wanted to download all rows from a history table before the current point in time and to then delete them from the history table.
Initially I get all of the history messages (the row key in this table is a timestamp and the partition key is a customer ID).
public async Task<List<HistoryEntity>> GetHistoryMessagesAsync(string tableName, CloudStorageAccount storageAccount, string timestamp)
{
List<HistoryEntity> historyEntities = new List<HistoryEntity>();
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable historyTable = tableClient.GetTableReference(tableName);
TableQuery<historyTable> dowloadQuery = new TableQuery<historyTable>().Where(TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.LessThan, timestamp));
TableContinuationToken continuationToken = null;
do
{
var tableQueryResult = historyTable.ExecuteQuerySegmentedAsync(dowloadQuery, continuationToken);
continuationToken = tableQueryResult.Result.ContinuationToken;
foreach (HistoryEntity result in tableQueryResult.Result)
{
historyEntities.Add(result);
}
}
while (continuationToken != null);
return historyEntities;
}
I then get all of the distinct partition keys from my entities so that I can batch up all of my deletes into as few operations as possible.
List<HistoryEntity> historyEntities = GetHistoryMessagesAsync("HistoryTable", storageAccount, "2019-03-07 22:10:53.8785130");
List<string> partitionKeys = historyEntities.Select(x => x.PartitionKey).Distinct();
In order to most efficiently delete these keys I want to create batches of 100 or fewer entities, if there’s more than 100 records belonging to a partition key then I want to split these into separate batches.
public async Task DeleteHistoryMessagesAsync(string partitionKey, CloudTable historyTable)
{
TableQuery<TableEntity> deleteQuery = new TableQuery<TableEntity>()
.Where(TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey),
TableOperators.And,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.LessThan, timestamp)
)
)
.Select(new string[] { "PartitionKey", "RowKey" });
TableContinuationToken continuationToken = null;
do
{
var tableQueryResult = historyTable.ExecuteQuerySegmentedAsync(deleteQuery, continuationToken);
continuationToken = tableQueryResult.Result.ContinuationToken;
// Split into chunks of 100 for batching
List<List<TableEntity>> rowsChunked = tableQueryResult.Result.Select((x, index) => new { Index = index, Value = x })
.Where(x => x.Value != null)
.GroupBy(x => x.Index / 100)
.Select(x => x.Select(v => v.Value).ToList())
.ToList();
// Delete each chunk of 100 in a batch
foreach (List<TableEntity> rows in rowsChunked)
{
TableBatchOperation tableBatchOperation = new TableBatchOperation();
rows.ForEach(x => tableBatchOperation.Add(TableOperation.Delete(x)));
await historyTable.ExecuteBatchAsync(tableBatchOperation);
}
}
while (continuationToken != null);
}
We can also run these deletes in parallel using TPL to further speed up the processing.
// Create a block with an asynchronous action
var block = new ActionBlock<(string partitionKey, CloudTable historyTable)>(
async x => await DeleteHistoryMessagesAsync(x.partitionKey, x.historyTable),
new ExecutionDataflowBlockOptions
{
BoundedCapacity = 100, // Cap the item count
MaxDegreeOfParallelism = 16
});
foreach (string partitionKey in partitionKeys)
{
await block.SendAsync((partitionKey, historyTable));
}
block.Complete();
await block.Completion;
2 Comments
Kanag · 23 August 2022 at 08:08
What was the time taken overall to delete 1000 records?
Shinigami · 24 August 2022 at 13:08
Sorry, I can’t remember exact times but I think 1000 records was probably only a couple of seconds assuming they’re all on the same partition key. If they’re spread accross multiple partition keys then that’s going to require more API calls and will take longer.