Skip to content

Instantly share code, notes, and snippets.

@Eonasdan
Created November 8, 2022 15:28
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Eonasdan/d542197dda52bee7bf138e448424fa1e to your computer and use it in GitHub Desktop.
Save Eonasdan/d542197dda52bee7bf138e448424fa1e to your computer and use it in GitHub Desktop.
Azure Cognitive Search Index schema

This is my attempt at defining a schema for Azure Cognitive Search Index. It's definately not perfect or complete but I wanted to provide something for myself and share it with others. If you find issues, please let me know so I can update it.

{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Required. The name of the index. An index name must only contain lowercase letters, digits or dashes, cannot start or end with dashes and is limited to 128 characters."
},
"description": {
"type": "string",
"description": "An optional description."
},
"fields": {
"$ref": "#/$defs/fields"
},
"similarity": {
"properties": {
"@odata.type": {
"type": "string",
"description": "Optional. For services created before July 15, 2020, set this property to use the BM25 ranking algorithm. Valid values include \"#Microsoft.Azure.Search.ClassicSimilarity\" or \"#Microsoft.Azure.Search.BM25Similarity\". API versions that support this property include 2020-06-30 and 2019-05-06-Preview. For more information, see Ranking algorithms in Azure Cognitive Search.",
"anyOf": [
{
"enum": [
"#Microsoft.Azure.Search.ClassicSimilarity",
"#Microsoft.Azure.Search.BM25Similarity"
]
}
]
},
"b": {
"type": "number",
"description": "Controls the scaling function between the term frequency of each matching terms to the final relevance score of a document-query pair. Values are usually 0.0 to 3.0, with 1.2 as the default.\n\nA value of 0.0 represents a \"binary model\", where the contribution of a single matching term is the same for all matching documents, regardless of how many times that term appears in the text, while a larger k1 value allows the score to continue to increase as more instances of the same term is found in the document.\n\nUsing a higher k1 value can be important in cases where we expect multiple terms to be part of a search query. In those cases, we might want to favor documents that match many of the different query terms being searched over documents that only match a single one, multiple times. For example, when querying the index for documents containing the terms \"Apollo Spaceflight\", we might want to lower the score of an article about Greek Mythology that contains the term \"Apollo\" a few dozen times, without mentions of \"Spaceflight\", compared to another article that explicitly mentions both \"Apollo\" and \"Spaceflight\" a handful of times only."
},
"k1": {
"type": "number",
"description": "Controls how the length of a document affects the relevance score. Values are between 0 and 1, with 0.75 as the default.\n\nA value of 0.0 means the length of the document will not influence the score, while a value of 1.0 means the impact of term frequency on relevance score will be normalized by the document's length.\n\nNormalizing the term frequency by the document's length is useful in cases where we want to penalize longer documents. In some cases, longer documents (such as a complete novel), are more likely to contain many irrelevant terms, compared to much shorter documents."
}
},
"required": ["@odata.type", "b", "k1"]
},
"suggesters": {
"type": "array",
"description": "Optional. Used for autocompleted queries or suggested search results, one per index. It is a data structure that stores prefixes for matching on partial queries like autocomplete and suggestions. Consists of a name and suggester-aware fields that provide content for autocompleted queries and suggested results. searchMode is required, and always set to analyzingInfixMatching. It specifies that matching will occur on any term in the query string.",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string"
},
"searchMode": {
"type": "string"
},
"sourceFields": {
"type": "array",
"items": [
{
"type": "string"
}
]
}
},
"required": ["name", "searchMode", "sourceFields"]
}
]
},
"scoringProfiles": {
"type": "array",
"description": "Optional. Used for custom search score ranking. Set `defaultScoringProfile` to use a custom profile as the default, invoked whenever a custom profile is not specified on the query string.",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Required. This is the name of the scoring profile. It follows the same naming conventions of a field. It must start with a letter, can't contain dots, colons or @ symbols, and can't start with the phrase azureSearch (case-sensitive)."
},
"functions": {
"type": "array",
"description": "\tOptional. A scoring function can only be applied to fields that are filterable.",
"items": [
{
"type": "object",
"properties": {
"fieldName": {
"type": "string",
"description": "Required for scoring functions. A scoring function can only be applied to fields that are part of the field collection of the index, and that are filterable. In addition, each function type introduces additional restrictions (freshness is used with datetime fields, magnitude with integer or double fields, and distance with location fields). You can only specify a single field per function definition. For example, to use magnitude twice in the same profile, you would need to include two definitions magnitude, one for each field."
},
"freshness": {
"type": "object",
"properties": {
"boostingDuration": {
"type": "string"
}
},
"required": ["boostingDuration"]
},
"interpolation": {
"type": "string"
},
"magnitude": {
"type": "null"
},
"distance": {
"type": "null"
},
"tag": {
"type": "null"
},
"type": {
"type": "string",
"description": "Required for scoring functions. Indicates the type of function to use. Valid values include magnitude, freshness, distance, and tag. You can include more than one function in each scoring profile. The function name must be lower case."
},
"boost": {
"type": "integer",
"description": "Required for scoring functions. A positive number used as multiplier for raw score. It can't be equal to 1."
}
},
"required": [
"fieldName",
"freshness",
"interpolation",
"type",
"boost"
]
}
]
},
"functionAggregation": {
"type": "string",
"anyOf": [
{
"enum": [
"sum",
"average",
"minimum",
"maximum",
"firstMatching"
]
}
]
},
"text": {
"type": "object",
"description": "Contains the weights property.",
"properties": {
"weights": {
"type": "object",
"description": "Optional. Name-value pairs that specify a searchable field and a positive integer or floating-point number by which to boost a field's score. The positive integer or number becomes a multiplier for the original field score generated by the ranking algorithm. For example, if a field score is 2 and the weight value is 3, the boosted score for the field becomes 6. Individual field scores are then aggregated to create a document field score, which is then used to rank the document in the result set.",
"additionalProperties": true
}
},
"required": ["weights"]
}
},
"required": ["name"]
}
]
},
"analyzers": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"@odata.type": {
"type": "string"
},
"name": {
"type": "string"
},
"charFilters": {
"type": "array",
"items": [
{
"type": "string"
}
]
},
"tokenizer": {
"type": "string"
},
"tokenFilters": {
"type": "array",
"items": [
{
"type": "string"
}
]
}
},
"required": ["@odata.type", "name"]
}
]
},
"charFilters": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string"
},
"@odata.type": {
"type": "string"
}
},
"additionalProperties": true,
"required": ["name", "@odata.type"]
}
]
},
"tokenizers": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string"
},
"@odata.type": {
"type": "string"
}
},
"additionalProperties": true,
"required": ["name", "@odata.type"]
}
]
},
"tokenFilters": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string"
},
"@odata.type": {
"type": "string"
}
},
"additionalProperties": true,
"required": ["name", "@odata.type"]
}
]
},
"defaultScoringProfile": {
"type": "string"
},
"encryptionKey": {
"type": "object",
"properties": {
"keyVaultKeyName": {
"type": "string"
},
"keyVaultKeyVersion": {
"type": "string"
},
"keyVaultUri": {
"type": "string"
},
"accessCredentials": {
"type": "object",
"properties": {
"applicationId": {
"type": "string"
},
"applicationSecret": {
"type": "string"
}
},
"required": ["applicationId", "applicationSecret"]
}
},
"required": ["keyVaultKeyName", "keyVaultKeyVersion", "keyVaultUri"]
},
"corsOptions": {
"type": "object",
"properties": {
"allowedOrigins": {
"type": "array",
"description": "This is a list of origins that will be granted access to your index. This means that any JavaScript code served from those origins will be allowed to query your index (assuming it provides the correct api-key). Each origin is typically of the form protocol://<fully-qualified-domain-name>:<port> although <port> is often omitted. See Cross-origin resource sharing (Wikipedia) for more details.\n\nIf you want to allow access to all origins, include * as a single item in the allowedOrigins array. This is not a recommended practice for production search services but it is often useful for development and debugging.",
"items": [
{
"type": "string"
},
{
"type": "string"
}
]
},
"maxAgeInSeconds": {
"type": "integer",
"description": "Browsers use this value to determine the duration (in seconds) to cache CORS preflight responses. This must be a non-negative integer. The larger this value is, the better performance will be, but the longer it will take for CORS policy changes to take effect. If it is not set, a default duration of 5 minutes will be used."
}
},
"required": ["allowedOrigins"]
}
},
"required": ["name", "fields"],
"$defs": {
"fields": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Required. Sets the name of the field, which must be unique within the fields collection of the index or parent field."
},
"type": {
"type": "string",
"description": "Required. Sets the data type for the field. Fields can be simple or complex. Simple fields are of primitive types, like `Edm.String` for text or `Edm.Int32` for integers. Complex fields can have sub-fields that are themselves either simple or complex. This allows you to model objects and arrays of objects, which in turn enables you to upload most JSON object structures to your index. See Supported data types (Azure Cognitive Search) for the complete list of supported types.",
"anyOf": [
{
"enum": [
"Edm.String",
"Edm.Int32",
"Edm.Int64",
"Edm.Double",
"Edm.Boolean",
"Edm.DateTimeOffset",
"Edm.GeographyPoint",
"Edm.ComplexType",
"Collection(Edm.String)",
"Collection(Edm.Int32)",
"Collection(Edm.Int64)",
"Collection(Edm.Double)",
"Collection(Edm.Boolean)",
"Collection(Edm.DateTimeOffset)",
"Collection(Edm.GeographyPoint)",
"Collection(Edm.ComplexType)"
]
}
]
},
"key": {
"type": ["boolean", "null"],
"description": "Required. Set this attribute to true to designate that a field's values uniquely identify documents in the index. The maximum length of values in a key field is 1024 characters. Exactly one top-level field in each index must be chosen as the key field and it must be of type `Edm.String`. Default is false for simple fields and null for complex fields.\n\nKey fields can be used to look up documents directly and update or delete specific documents. The values of key fields are handled in a case-sensitive manner when looking up or indexing documents. See Lookup Document (Azure Cognitive Search REST API) and Add, Update or Delete Documents (Azure Cognitive Search REST API) for details."
},
"retrievable": {
"type": ["boolean", "null"],
"description": "Indicates whether the field can be returned in a search result. Set this attribute to `false` if you want to use a field (for example, margin) as a filter, sorting, or scoring mechanism but do not want the field to be visible to the end user. This attribute must be `true` for key fields, and it must be `null` for complex fields. This attribute can be changed on existing fields. Setting retrievable to `true` does not cause any increase in index storage requirements. Default is `true` for simple fields and `null` for complex fields."
},
"searchable": {
"type": ["boolean", "null"],
"description": "Indicates whether the field is full-text searchable and can be referenced in search queries. This means it will undergo lexical analysis such as word-breaking during indexing. If you set a searchable field to a value like \"Sunny day\", internally it will be normalized and split into the individual tokens \"sunny\" and \"day\". This enables full-text searches for these terms. Fields of type `Edm.String` or `Collection(Edm.String)` are searchable by default. This attribute must be `false` for simple fields of other non-string data types, and it must be `null` for complex fields.\n\nA searchable field consumes extra space in your index since Azure Cognitive Search will process the contents of those fields and organize them in auxiliary data structures for performant searching. If you want to save space in your index and you don't need a field to be included in searches, set searchable to false. See How full-text search works in Azure Cognitive Search for details."
},
"filterable": {
"type": ["boolean", "null"],
"description": "Indicates whether to enable the field to be referenced in `$filter` queries. Filterable differs from searchable in how strings are handled. Fields of type `Edm.String` or `Collection(Edm.String)` that are filterable do not undergo lexical analysis, so comparisons are for exact matches only. For example, if you set such a field f to \"Sunny day\", `$filter=f eq 'sunny'` will find no matches, but `$filter=f eq 'Sunny day'` will. This attribute must be null for complex fields. Default is `true` for simple fields and `null` for complex fields. To reduce index size, set this attribute to false on fields that you won't be filtering on."
},
"sortable": {
"type": ["boolean", "null"],
"description": "Indicates whether to enable the field to be referenced in `$orderby` expressions. By default Azure Cognitive Search sorts results by score, but in many experiences users will want to sort by fields in the documents. A simple field can be sortable only if it is single-valued (it has a single value in the scope of the parent document).\n\nSimple collection fields cannot be sortable, since they are multi-valued. Simple sub-fields of complex collections are also multi-valued, and therefore cannot be sortable. This is true whether it's an immediate parent field, or an ancestor field, that's the complex collection. Complex fields cannot be sortable and the sortable attribute must be null for such fields. The default for sortable is `true` for single-valued simple fields, `false` for multi-valued simple fields, and `null` for complex fields."
},
"facetable": {
"type": ["boolean", "null"],
"description": "Indicates whether to enable the field to be referenced in facet queries. Typically used in a presentation of search results that includes hit count by category (for example, search for digital cameras and see hits by brand, by megapixels, by price, and so on). This attribute must be `null` for complex fields. Fields of type `Edm.GeographyPoint` or `Collection(Edm.GeographyPoint)` cannot be facetable. Default is `true` for all other simple fields. To reduce index size, set this attribute to `false` on fields that you won't be faceting on."
},
"analyzer": {
"type": ["string", "null"],
"description": "Sets the lexical analyzer for tokenizing strings during indexing and query operations. Valid values for this property include language analyzers, built-in analyzers, and custom analyzers. The default is `standard.lucene`. This attribute can only be used with searchable string fields, and it can't be set together with either `searchAnalyzer` or `indexAnalyzer`. Once the analyzer is chosen and the field is created in the index, it cannot be changed for the field. Must be `null` for complex fields."
},
"searchAnalyzer": {
"type": ["string", "null"],
"description": "Set this property in conjunction with `indexAnalyzer` to specify different lexical analyzers for indexing and queries. If you use this property, set analyzer to `null` and make sure `indexAnalyzer` is set to an allowed value. Valid values for this property include built-in analyzers and custom analyzers. This attribute can be used only with searchable fields. The search analyzer can be updated on an existing field since it is only used at query-time. Must be `null` for complex fields."
},
"indexAnalyzer": {
"type": ["string", "null"],
"description": "Set this property in conjunction with searchAnalyzer to specify different lexical analyzers for indexing and queries. If you use this property, set analyzer to `null` and make sure searchAnalyzer is set to an allowed value. Valid values for this property include built-in analyzers and custom analyzers. This attribute can be used only with searchable fields. Once the index analyzer is chosen, it cannot be changed for the field. Must be `null` for complex fields."
},
"synonymMaps": {
"type": ["array", "null"],
"items": [
{
"type": "string",
"description": "A list of the names of synonym maps to associate with this field. This attribute can be used only with searchable fields. Currently only one synonym map per field is supported. Assigning a synonym map to a field ensures that query terms targeting that field are expanded at query-time using the rules in the synonym map. This attribute can be changed on existing fields. Must be `null` or an empty collection for complex fields."
}
]
},
"fields": {
"$ref": "#/$defs/fields"
}
},
"required": ["name", "type", "key"]
}
]
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment