Searching Blob Documents with the Azure Search Service

 One of the core services in the Microsoft Azure cloud platform is the Storage Service, which includes Blobs, Queues, and Table storage. Blobs are great for anything you would use a file system for, such as avatars, data files, XML/JSON files, ...and documents. But until recently, documents in blob storage had one big shortcoming: they weren't searchable. That is no longer the case. In this post, we'll examine how to search documents in blob storage using the Azure Search Service.

Azure Blob Basics

Let's quickly cover the basic of Azure blobs:

  • Storage Accounts. To work with storage, you need to allocate a storage account in the Azure Management Portal. A storage account can be used for any or all of the following: blob storage, queue storage, table storage. We're focusing on blob storage in this article. To access a storage account you need its name and a key.
  • Containers. In your storage account, you can create one or more named containers. A container is kind of like a file folder--but without subfolders. Fortunately, there's a way to mimic subfolders (you may include slashes in blob names). Containers can be publicly accessible over the Internet, or have restricted access that requires an access key.
  • Blobs. A blob is a piece of data with a name, content, and some properties. For all intents and purposes, you can think of a blob as a file. There are actually several kinds of blobs (block blobs, append blobs, and page blobs). For our purposes here, whenever we mention blob we mean block blob, which is the type that most resembles a sequential file.

Uploading Documents to Azure Blob Storage

Let's say you had the chapters of a book you were writing in Microsoft Word that you save as pdf files--ch01.pdf, ch02.pdf, ... up to ch10.pdf, along with toc.pdf and preface.pdf--which you would like to store in blob storage and be able to search. Here's an example of what a page of this book chapter content looks like:

In your Azure storage account you can create a container (folder) for documents. In my case, I created a container named book-docs to hold my book chapter documents. In the book-docs container, you can upload your documents. If you upload the 12 pdf documents described above, you'll end up with 12 blobs (files) in your container.

Setting Document Properties

It would be nice to search these documents not only based on content, but also based on metadata. We can add metadata properties (name-value pairs) to each of these blobs. In the Microsoft Azure Storage Explorer, right-click a blob and select Properties. In the Properties dialog, click Add Metadata to add a property and enter a name and value. We'll later be able to search these properties. In my example, we've added a property named DocType and a property named Title to each document, with values like "pdf" and "Chapter 1: Cloud Computing Explained" Know more azure training

Creating a Data Source

To create a data source, we use the service client to add a new DataSource object to its DataSources collection. You'll need your storage account name and key (note this is a different credential from the search service name and key in the previous section). The following parameters are defined in the code below:

  • Name: name for the data source.
  • Type: the type of data source (AzureBlob).
  • Credentials: storage account connection string.
  • Container: identifies which container in blob storage to access.
  • DataDeletionDetectionPolicy: defines a deletion policy (soft delete), and identifies a property (Deleted) and value (1) which will be recognized as a deletion. Blobs with property Deleted:1 will be removed from the index. We'll explain more about this later.

String datasourceName = "book-docs";

if (!serviceClient.DataSources.Exists(datasourceName))

{

serviceClient.DataSources.Create(new DataSource()

{

Name = datasourceName,

Type = Microsoft.Azure.Search.Models.DataSourceType.AzureBlob,

Credentials = new DataSourceCredentials("DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=GL3AAN0Xyy/8nvgBJcVr9lIMgCTtBeIcKuL46o/TTCpEGrReILC5z9k4m4Z/yZyYNfOeEYHEHdqxuQZmPsjoeQ=="),

Container = new Microsoft.Azure.Search.Models.DataContainer(datasourceName),

DataDeletionDetectionPolicy = new Microsoft.Azure.Search.Models.SoftDeleteColumnDeletionDetectionPolicy() { SoftDeleteColumnName="Deleted", SoftDeleteMarkerValue="1" }

});

}

With our data source defined, we can move on to creating our index and indexer.

Creating an Index

Next, we need to create the index that Azure Search will maintain for searches. The code below creates an index named book. It populates the Fields collection with the fields we are interested in tracking for searches. This includes:Know more azure training

  • content: the blob's content.
  • native metadata fields that come from accessing blob storage (such as metadata_storage_name, metadata_storage_path, metadata_storage_last_modified, ...).
  • custom metadata properties we've decided to add: DocType, Title, and Deleted.

Once the object is set up, it is added to the service client's Indexes collection, which creates the index.

String indexName = "book";

Index index = new Index()

{

Name = indexName,

Fields = new List<Field>()

};

index.Fields.Add(new Field() { Name = "content", Type = Microsoft.Azure.Search.Models.DataType.String, IsSearchable = true });

index.Fields.Add(new Field() { Name = "metadata_storage_content_type", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_storage_size", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_storage_last_modified", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_storage_content_md5", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_storage_name", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_storage_path", Type = Microsoft.Azure.Search.Models.DataType.String, IsKey = true, IsRetrievable = true , IsSearchable = true});

index.Fields.Add(new Field() { Name = "metadata_author", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_language", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "metadata_title", Type = Microsoft.Azure.Search.Models.DataType.String });

index.Fields.Add(new Field() { Name = "DocType", Type = Microsoft.Azure.Search.Models.DataType.String, IsSearchable = true });

index.Fields.Add(new Field() { Name = "Title", Type = Microsoft.Azure.Search.Models.DataType.String, IsSearchable = true });

if (serviceClient.Indexers.Exists(indexName))

{

serviceClient.Indexers.Delete(indexName);

}

serviceClient.Indexes.Create(indexName)

Let's take note of some things about the index we're creating:

  • Some of the fields are built-in from what Azure Search intrinsically knows about blobs. This includes content and all the properties beginning with "metadata_". Especially take note of metadata_storage_path, which is the full URL of the blob. This is marked as the key of the index. This will ensure we do not receive duplicate documents in our search results.Know more azure training
  • Some of the fields are custom properties we've chosen to add to our blobs. This includes DocType and Title.

Creating an Indexer

And now, we can create an indexer (not to be confused with index). The indexer is the entity that will regularly scan the data source and keep the index up to date. The Indexer object identifies the data source to be scanned and the index to be updated. It also contains a schedule. In this case, the indexer will run every 30 minutes. Once the indexer object is set up, it is added to the service client's Indexers collection, which creates the indexer. In the background, the indexer will start running to scan the data source and populate the index. It's progress can be monitored using the Azure Management Portal.

String indexName = "book";

String indexerName = "book-docs";

Indexer indexer = new Indexer()

{

Name = indexerName,

DataSourceName = indexerName,

TargetIndexName = indexName,

Schedule = new IndexingSchedule()

{

Interval = System.TimeSpan.FromMinutes(30)

}

};

indexer.FieldMappings = new List();

indexer.FieldMappings.Add(new Microsoft.Azure.Search.Models.FieldMapping()

{

SourceFieldName = "metadata_storage_path",

MappingFunction = Microsoft.Azure.Search.Models.FieldMappingFunction.Base64Encode()

});

if (serviceClient.Indexers.Exists(indexerName ))

{

serviceClient.Indexers.Delete(indexerName );

}

serviceClient.Indexers.Create(indexer);

Let's point out some things about the indexer we're creating:

  • The indexer has a schedule, which determines how often it scans blob storage to update the index. The code above sets a schedule of every 30 minutes.
  • There is a field mapping function defined for the metadata_storage_path field, which is the document path and our unique key. Why do we need this? Well, it's possible this path value might contain characters that are invalid for an index column; to avoid failures, it is necessary to Base64-encode the value. We'll have to decode this value whenever we retrieve search results.Know more azure training



Comments

Popular posts from this blog

SSIS: DataFlow Transformations

Microsoft Business Intelligence(MSBI)