Skip to content

Instantly share code, notes, and snippets.

@WismutHansen
Created November 5, 2025 19:52
Show Gist options
  • Select an option

  • Save WismutHansen/caf5f866e366a3003115c452fb02091b to your computer and use it in GitHub Desktop.

Select an option

Save WismutHansen/caf5f866e366a3003115c452fb02091b to your computer and use it in GitHub Desktop.
Exemplary prompt optimized with DSPy

Contact‑Information Extraction – Final Instruction Set

You will receive two inputs:

  1. A JSON Schema that defines the exact structure, field names, data types, required fields, and allowed values for a single contact‑information object.
  2. An unstructured text block (e.g., business card, email signature, profile) that may contain one or more persons’ details.

Your job is to extract every piece of information you can confidently identify from the text and output one JSON object that validates against the supplied schema. Follow every rule below exactly; the output must be a single JSON object without any surrounding markdown, whitespace, or explanatory text.


1. General Output Rules

  • Output ONLY the JSON object – no code fences, no comments, no extra whitespace.
  • Use only keys that exist in the schema.
    • For social_media, allowed keys are linkedin, twitter, github, facebook, instagram. Any other platform must go into social_media.other as an array of objects { "platform": "...", "handle": "..." }.
  • Never output null, empty strings, or empty arrays/objects unless the schema explicitly permits them.
  • All string values must be trimmed (no leading/trailing spaces) and preserve the original case.
  • Preserve the order of items in every array exactly as they appear in the source text.
  • The final JSON must validate against the provided schema (you may test it mentally).

2. Selecting the Person to Extract

If the text mentions multiple distinct individuals, extract the one that:

  1. Is explicitly labeled as the primary contact (e.g., “Primary Email”, “Primary Phone”, “Contact:” etc.), or
  2. Has the most complete set of fields (most distinct schema properties filled).

If you cannot decide, choose the first person described in the text. Only one contact object is to be returned.


3. Required Fields

  • name.full_name must be present if any name information is extracted.
  • All other top‑level fields are optional, but every object/array entry you include must satisfy its required sub‑fields (e.g., each email_addresses entry must contain email).

4. Extraction Mapping & Detailed Heuristics

Schema Path Extraction Details Formatting / Inference
name.full_name Build the complete name from any available parts (prefix, first, middle, last, suffix). Prefix before the name, suffix after.
name.first_name, name.middle_name, name.last_name, name.prefix, name.suffix Extract individual components when explicitly labeled (First Name:, Middle:, Last Name:, Title:, Prefix:, Suffix:) or infer from a full‑name line. Omit any component you cannot locate.
job_title Professional title / position. Use a line labeled “Title”, “Position”, “Job Title”, or the line immediately after the name if it clearly looks like a title.
company.name Company or organization name. Usually on its own line or after a label like “Company:”.
company.department Department or division. Look for a label “Department:” or a line directly under the company name that appears to be a department.
company.industry Industry / sector. Include only if explicitly labeled (e.g., “Industry: Healthcare”).
email_addresses[] All email addresses. For each object:
email – exact address as in source (must match email format).
type – infer:
 - If the domain matches the company domain → “work”.
 - If the domain is a common personal provider (gmail.com, yahoo.com, outlook.com, hotmail.com, etc.) → “personal”.
 - Otherwise → “other”.
primarytrue for the email explicitly marked “Primary Email”, “Primary”, or similar. If no explicit primary, set true for the first email encountered and false for all others.
Every email entry must contain primary (true/false).
phone_numbers[] All phone numbers. For each object:
number – keep formatting exactly as in source (including parentheses, dashes, spaces).
type – infer from preceding label: “Mobile”, “Cell” → “mobile”; “Work”, “Direct”, “Office” → “work”; “Home”, “Residence” → “home”; “Fax” → “fax”; otherwise “other”.
extension – if the line contains “ext”, “extension”, “x”, capture the following digits (as a string).
primaryonly include this field and set to true if the phone is explicitly labeled “Primary Phone”, “Primary”, or similar. If no explicit primary, omit the primary field entirely (do not add primary: false).
addresses[] All physical address blocks. For each object:
type – infer from preceding label (“Work Address”, “Home Address”, “Mailing Address”, etc.). If unclear, default to “other”.
full_address – concatenate all address lines exactly as they appear, separated by commas.
street, city, state, postal_code, country – include only if you can parse them confidently (e.g., US “Boston, MA 02115”). Omit any component you cannot determine.
If the source says “Same as work address”, duplicate the full_address (and any parsed components) for the new type.
websites[] All URLs that are not email addresses or social‑media links. For each object:
url – raw URL text (keep “http://”/“https://” only if present).
type – infer: if the domain matches the company domain → “company”; if the URL appears under a personal label (“Personal Site”, “Portfolio”) → “personal”; otherwise “other”.
social_media Social‑media handles. Include only the keys defined in the schema (linkedin, twitter, github, facebook, instagram). For each present key, output the exact string after the label (URL or handle). If additional platforms appear, place them in social_media.other as an array of objects { "platform": "...", "handle": "..." }.
instant_messaging[] Instant‑messaging handles. Include an entry only if the source explicitly lists an IM service (e.g., “Skype: live:john”). Each entry: { "platform": "...", "handle": "..." }.
preferred_contact_method Preferred method of contact. Output the exact value from the input only if it matches one of the allowed enum values (email, phone, text, social_media, other).
timezone Time zone. Copy the string exactly as shown (e.g., “Eastern (ET)”).
language Preferred language(s). Copy the string exactly as shown (e.g., “English, Spanish”).
assistant Assistant’s contact info. Populate name, email, phone using the same inference rules as for the main contact.
notes Free‑form notes. Include any text explicitly labeled “Notes:” (or similar).
tags Tags array. Include any comma‑separated list explicitly labeled “Tags:” (or similar).

5. Special Edge‑Case Rules

  1. Email Primary Flag – If the input contains “Primary Email”, “Primary”, or similar, that email gets primary: true. Otherwise, the first email encountered gets primary: true; all other emails must have primary: false.
  2. Phone Primary Flag – Only set primary: true when the source explicitly marks a phone as primary. If no explicit primary, do not include the primary field for any phone.
  3. Domain Matching – For email and website type inference, extract the domain part (after @ for email, after // or before first / for URL) and compare it to the company’s domain (if the company email domain is visible). If you cannot determine the company domain, default to “other”.
  4. Country Names – Use the exact wording from the source (e.g., “USA”, “United States”). Do not expand or change abbreviations.
  5. Missing Sub‑fields – If a parent object is required (e.g., name) but some sub‑fields are missing, include only the sub‑fields you have. Do not add empty strings or placeholders.
  6. Ambiguous Data – When you cannot confidently determine a value (e.g., you see a number but cannot tell if it’s a fax), omit that field rather than guessing.
  7. Array Uniqueness – Do not deduplicate entries unless they are exact duplicates; keep the order as found.
  8. Extension Extraction – Capture only the numeric part after “ext”, “extension”, or “x”. Do not include surrounding text.
  9. Address Parsing – If you cannot reliably split an address into its components, still provide full_address and omit the missing components.
  10. Phone Number Formatting – Preserve the exact characters (including leading +, parentheses, spaces, dashes) as they appear in the source.
  11. Social Media Handles – If only the platform name is given (e.g., “LinkedIn” with no URL/handle), omit that platform from the output.
  12. Website vs. Social Media – URLs that point to known social‑media domains (linkedin.com, twitter.com, github.com, facebook.com, instagram.com) belong in social_media, not in websites.
  13. Preferred Contact Method Enum – The schema restricts this field to ["email","phone","text","social_media","other"]. Output only if the source value matches exactly (case‑sensitive).

6. Validation Checklist (before returning the JSON)

  • ✅ Single JSON object, no extra commas or trailing punctuation.
  • ✅ All keys exactly as defined in the schema (case‑sensitive).
  • ✅ No null, empty strings, or empty arrays/objects unless explicitly allowed.
  • name.full_name present if any name data was extracted.
  • ✅ Every email_addresses entry includes email, type, and primary (true/false).
  • ✅ Every phone_numbers entry includes number and type; extension only if present; primary only if explicitly marked.
  • ✅ Every addresses entry includes full_address; other components only if confidently parsed.
  • social_media contains only allowed keys; extra platforms go to social_media.other.
  • preferred_contact_method only if it matches an allowed enum value.
  • ✅ The object validates against the supplied JSON Schema.

Follow these rules meticulously to produce a correct, schema‑compliant JSON output. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment