Created
September 22, 2025 13:52
-
-
Save seanstory/a08db2e149897da656db3a1ca72e17ac to your computer and use it in GitHub Desktop.
Initial prompt to generate an evaluation dataset. This prompt attempts to cover a broad set of realistic domains and data modeling to realistically mimic enterprise data sprawl in Elasticsearch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| You are an expert in Elasticsearch data modeling and an experienced data architect. Your task is to generate a dataset of 5 diverse and realistic Elasticsearch index mappings. These mappings will be used to develop and evaluate a Natural Language to Elasticsearch Query Language (NL2ESQL) agent. The agent focuses on translating natural language queries (NLQ) with search and exploratory intent into ESQL queries for single Elasticsearch indices. A key follow-up step will be generating ~10 diverse query examples per index. | |
| **Pre-Generation Reasoning Steps (Your Internal Thought Process):** | |
| Before you generate the JSONL output, please follow these internal reasoning steps to ensure the indices are realistic, diverse, and well-considered: | |
| 1. **Domain and Use Case Brainstorming & Selection:** | |
| * First, think broadly about common real-world domains and primary use cases where Elasticsearch is heavily utilized. Examples include, but are not limited to: website/intranet search, e-commerce product catalogs and recommendations, document management and retrieval systems, etc. | |
| * From your brainstormed list, **select 5 distinct domains/use cases**. These selections should be diverse and particularly well-suited for the types of indices and queries to test NL2ESQL agent (e.g., supporting search/exploratory intent, single-index operations, and a mix of lexical, filter, semantic, and hybrid search capabilities). **Actively avoid simply reusing the examples provided in this prompt unless they genuinely fit your independent reasoning and shortlisting process.** Your goal is to arrive at 5 choices that reflect common, yet varied, real-world Elasticsearch deployments. | |
| 2. **Naming Convention Analysis:** | |
| * Consider how software engineers and data architects typically name indices and fields in production Elasticsearch environments. Think about: | |
| * **Index Naming:** Common patterns often include indicators of the data source, content type, environment, version, or sometimes date patterns (e.g., `my_application-prod-v1`, `products_catalog_master`, `user_profiles_customer_service`, `user_details_YYYY.MM.DD`). Aim for names that are descriptive, clear, and follow a consistent (but not overly rigid) pattern that implies professional setup. | |
| * **Field Naming:** `snake_case` is the most prevalent convention for field names in Elasticsearch JSON. Strive for consistency within each index. Consider common practices like using meaningful suffixes (e.g., `_at` or `_ts` for timestamps/dates, `_id` for unique identifiers, `_count` for numerical counts, `is_` or `has_` for booleans, `_embedding`, `_vector`, or `_semantic` for semantic fields, `_raw` for non-analyzed keyword versions of text fields) or prefixes where they enhance clarity. Field names should be intuitive and clearly reflect the data they hold, balancing descriptiveness with conciseness. For diversity different indices can follow different conventions as long as that pattern is popular in real world as well. | |
| 3. **Index Design Based on Reasoning:** | |
| * Once you have your 5 selected domains/use cases and have considered appropriate naming conventions, proceed to design the detailed mappings for each of the 5 indices. Ensure these designs meet all the requirements specified in the subsequent sections of this prompt (Field Variety, Searchable Field Density, index-level _meta.description, field-level meta.description entries, Index Briefing, etc.). The primary objective is for the generated indices to feel authentic and representative of well-architected real-world scenarios. | |
| --- | |
| **Key Context for Index Design:** | |
| The NL2ESQL agent will support the following: | |
| 1. **User Intent:** 'Search/Exploration'. Queries will be self-contained, single-turn, and target a single index. | |
| 2. **Supported Field Types for Search Operations:** | |
| * `text`: For lexical search. | |
| * `semantic_text`: Primarily for semantic search. When defining a `semantic_text` field, include an `inference_id` as a direct property of the field (e.g., "inference_id": "elser-v2-endpoint"). Its `meta.description` should be a concise note about its purpose. | |
| 3. **Supported Field Types for Filter Conditions:** | |
| * `boolean` | |
| * `keyword` | |
| * `number` (integer, long, float, double) | |
| * `date` | |
| 4. **ESQL Query Focus:** | |
| * lexical search (Using `MATCH` on `text` fields). | |
| * Semantic search (Using `MATCH` on `semantic_text` fields). | |
| * Hybrid search (primarily using `MATCH` on `text` and `semantic_text` fields). | |
| * Filtering (up to 5 conditions on up to 5 distinct fields). | |
| * Operators: `==`, `!=`, `<`, `<=`, `>`, `>=`, `AND`, `OR`, `NOT`, `IS NULL`, `IS NOT NULL`. | |
| * Search Functions: `MATCH`, `QSTR`. | |
| * Commands: `FROM`, `WHERE`, `LIMIT`, `KEEP`, `SORT`, `DESC`, `METADATA`. | |
| --- | |
| **Requirements for Generated Indices:** | |
| * **Quantity:** Generate exactly 5 distinct Elasticsearch indices based on your reasoned domain selections. | |
| * **Diversity:** The chosen domains/use cases should be genuinely different from each other. | |
| * **Realism:** The structure, fields, and chosen naming conventions should reflect common patterns in production. | |
| * **Sufficient Fields:** Each index should have 8-20 fields overall. | |
| * **Searchable Field Density and Variety: ** Each index must support diverse search query examples. | |
| * Ensure a good number and variety of `text` and `semantic_text` fields (aim for at least 2-4 fields suitable for search per index, e.g., titles, descriptions, content bodies). | |
| * This must allow for lexical (single/multi-field), semantic, hybrid and filtered search examples. | |
| * All indices must include at least 1 `semantic_text` and `text` field. | |
| * **Selective `copy_to` for Semantic Enhancement:** | |
| * Selectively use the `copy_to` mechanism where it adds clear value. For example, for a **few, highly relevant primary `text` fields per index** (like a main product description, an article abstract, or a key log message pattern), you can use `copy_to` to populate a corresponding `semantic_text` field. | |
| * This allows the source text to be used for both traditional keyword search and for semantic processing. The intent is **not to apply `copy_to` to all `text` fields indiscriminately**, but only to those primary content fields where a semantic representation is most beneficial. | |
| * **At least two indices should demonstrate this `copy_to` pattern for one or two key fields each.** | |
| * **Utilize Multi-Fields for Flexibility (e.g., text/keyword):** | |
| * In real-world scenarios, a single string input often needs to be indexed in multiple ways for different purposes. For example, a field might be needed as `text` for full-text search and as `keyword` for exact matching, sorting, or aggregations. | |
| * Use the `fields` keyword to define such multi-fields. Example: | |
| ```json | |
| "city_name": { | |
| "type": "text", | |
| "fields": { | |
| "raw": { // Common suffix for the keyword version | |
| "type": "keyword" | |
| } | |
| }, | |
| "meta": { "description": "Name of a city, e.g. Paris" } | |
| } | |
| ``` | |
| * **Incorporate this `text` with `keyword` sub-field pattern for relevant string fields in at least 2-3 of the generated indices.** This adds realism and supports a wider range of query types. | |
| * **Crucial `meta` Descriptions:** | |
| * **Index `mappings._meta.description`**: A very concise explanation of the index's purpose and domain (max 50 characters, approx. 5-8 words). E.g., "Product catalog for e-commerce site." | |
| * **Field `meta.description`**: A very clear and succinct description for every field (max 50 characters, approx. 5-8 words). E.g., "Product title, e.g. Toothbrush." | |
| * **Index Briefing (for each index object):** A narrative string elaborating on the index's purpose, typical data, and the types of search/exploratory questions it can answer (lexical, semantic, hybrid, filtering). This should align with the chosen domain and the index's specific field composition. | |
| --- | |
| **Output Format:** | |
| **Provide the output in JSONL format. | |
| ** Each line must be a separate, self-contained JSON object representing one Elasticsearch index. | |
| **Do NOT wrap the output in a single JSON array or use commas between line-separated JSON objects.** | |
| Schema for each JSON object (each line): | |
| ```json | |
| { | |
| "index_id": "string_reasoned_and_well_named_index_id", | |
| "index_briefing": "string_Narrative brief for the index...", | |
| "mappings": { | |
| "_meta": { // Index-level uses _meta | |
| "description": "string_Succinct index purpose (max 50 chars)" | |
| }, | |
| "properties": { | |
| "main_content_field": { // Example of text field with copy_to for semantic | |
| "type": "text", | |
| "fields": { // Also showing multi-field for keyword version | |
| "keyword": { "type": "keyword" } | |
| }, | |
| "meta": { | |
| "description": "Primary content (e.g., article body, product description)." | |
| }, | |
| "copy_to": "main_content_semantic" | |
| }, | |
| "main_content_semantic": { // semantic_text field populated by copy_to | |
| "type": "semantic_text", | |
| "inference_id": "your_main_content_inference_id", | |
| "meta": { | |
| "description": "Primary content (e.g., article body, product description)" | |
| } | |
| }, | |
| "specialized_semantic_query_text": { // Example of semantic_text with original content | |
| "type": "semantic_text", | |
| "inference_id": "your_specialized_inference_id", | |
| "meta": { | |
| "description": "Content (e.g., a curated summary, or a user's natural language query intent if stored, usually a good choice of long texts)." | |
| } | |
| }, | |
| "title": { // Example of text field that might not need semantic copy | |
| "type": "text", | |
| "fields": { "raw": { "type": "keyword" } }, // text/keyword multi-field | |
| "meta": { | |
| "description": "Title of the item." | |
| } | |
| }, | |
| "category_code": { | |
| "type": "keyword", | |
| "meta": { | |
| "description": "Category code, e.g., electronics, etc..." | |
| } | |
| } | |
| // ... other fields, ensuring variety, adherence to naming conventions, and demonstration of required patterns. | |
| } | |
| }, | |
| "indexmetadata": { | |
| "domain": "string_Chosen_real_world_domain", | |
| "version": 1.0, | |
| "data_source_example": "string_Illustrative source for this domain" | |
| } | |
| } | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment