Skip to content

Instantly share code, notes, and snippets.

@voberoi
Last active March 19, 2024 20:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save voberoi/cfeb935b163c150eee5d7c86e7fb4337 to your computer and use it in GitHub Desktop.
Save voberoi/cfeb935b163c150eee5d7c86e7fb4337 to your computer and use it in GitHub Desktop.
Prompts used for chapter extraction in citymeetings.nyc -- from my talk at NYC School of Data 2024

These are the prompts I use to extract chapters in citymeetings.nyc as of March 23rd, 2024 -- the date of my NYC School of Data talk.

To simplify things I've removed all the code that stitches these prompts together and consolidated all the common items from each step in my chapter extraction pipeline.

See the slides & talk for a description of how these work in concert and how I review and fix issues.

NOTE: these work reasonably well and save tons of time, but I haven't systematically evaluated or improved them yet in the same way I have my speaker identification prompt.

### From the citymeetings.nyc talk at NYC School of Data 2024
#
# CHAPTER EXTRACTION STEP 1: EXTRACT TRANSCRIPT MARKERS
# -----------------------------------------------------
#
# I use `instructor` instead of a plaintext prompt. There are two elements to this prompt:
#
# 1. SYSTEM_PROMPT, which is my system prompt.
# 2. These models:
# - HearingTranscriptMarkerType
# - Question
# - Testimony
# - OpeningStatement
# - Procedure
# - HearingTranscriptMarker
#
# `instructor` allows me to embed the schema for what the LLM will generate alongside the prompt.
#
# I don't yet do this for speaker identification -- I started using `instructor` later, which
# is why my speaker identification prompt (https://gist.github.com/voberoi/3d82f6b2a55e79b7cd014847853be8bf)
# is a plaintext prompt.
import os
from enum import Enum
from typing import Optional, Union
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
SYSTEM_PROMPT = """
I am going to give you a transcript from a New York City city council meeting.
The transcript will be formatted like this:
```
[SPEAKER: 0]
[T354] How was your weekend?
[SPEAKER: 1]
[T355] It was great!
[T356] How was yours?
[SPEAKER: 5]
[T357] Sorry to interrupt, but I think we need to start the meeting.
```
Each sentence in the transcript is preceded by a time marker. Time markers grow chronologically.
Your job is to generate "transcript markers", which are markers that identify the start of a question, remarks, procedure, or testimony in the transcript.
"""
class HearingTranscriptMarkerType(Enum):
QUESTION = "QUESTION"
OPENING_STATEMENT = "OPENING_STATEMENT"
PROCEDURE = "PROCEDURE"
TESTIMONY = "TESTIMONY"
class Question(BaseModel):
question: str = Field(
description="State the question that is being asked in this transcript marker. If the question is not phrased as a question, rephrase it as one."
)
answer: Optional[str] = Field(
description="The answer to the question, if you can infer it from the context. Don't make it up if you can't.",
default=None,
)
who_is_asking: Optional[str] = Field(
description="The person asking the question, if you can infer it from the context. Don't make it up if you can't.",
default=None,
)
who_is_being_asked: Optional[str] = Field(
description="The person being asked the question, if you can infer it from the context. Don't make it up if you can't.",
default=None,
)
class Testimony(BaseModel):
speaker_name: Optional[str] = Field(
description="The name of the person giving testimony. If you can't infer it, don't make it up.",
default=None,
)
speaker_role: Optional[str] = Field(
description="The role of the person giving testimony. If you can't infer it, don't make it up.",
default=None,
)
speaker_organization: Optional[str] = Field(
description="The organization the person giving testimony is representing. If you can't infer it, don't make it up.",
default=None,
)
testimony_title: str = Field(
description="""
Create a title for the testimony formatted as <speaker> on <topic>. It can be a long title.
<speaker> should include all details about the speaker that you have. Some examples:
- "Manuela Frisas, President of the Workers Unite Project"
- "Manuela Frisas from the Workers Unite Project", if you only have the name and the organization.
- "Manuela Frisas", if you only have the name.
- "Workers Unite Project", if you have the organization but not the name.
With the topic, get very specific! It is okay if the the topic is long.
If you can't infer the speaker's name, role, or organization just use "Member of the Public" and make the title as specific as you can.
""".strip()
)
class OpeningStatement(BaseModel):
speaker_name: Optional[str] = Field(
description="The name of the council member making the opening statement. If you can't infer it, don't make it up. It's okay if it's just a last name.",
default=None,
)
opening_statement_title: str = Field(
description="Create a title for the opening statement formatted as Council Member <speaker> Opens <topic>. It can be a long title."
)
class Procedure(BaseModel):
procedure_title: str = Field(
description="""Create a descriptive title for the procedure.
Some examples:
- "Front Matter" for the start of the meeting where the clerk asks people to silence their phones, etc.
- "Roll Call" for the roll call.
- "Roll Call Vote" for roll call votes.
- "Administering Oath" for the administering oath before testimony.
- "Pledge of Allegiance" for the pledge of allegiance.
- "Oath of Office" for the oath of office.
- "Transition to Testimony" for the time before and between testimonies.
There might be others! Be descriptive and specific.
"""
)
class HearingTranscriptMarker(BaseModel):
marker_type: HearingTranscriptMarkerType = Field(
description="""The type of transcript marker you are identified in the transcript. You must choose from one of:
- QUESTION: The start of an inquiry by a council member. Often these are phrased as questions, but not always. For example, 'Tell me more about the budget' is an inquiry about the budget.
- OPENING_STATEMENT: Opening statements by a council member. These are standalone statements or addresses by a council member opening a committee hearing.
- PROCEDURE: The start of a procedural portion of the meeting, for example the start of a meeting, a vote, a roll call, calling up testimony, etc. The start of the meeting, where council staff asks people to silence their phones, should always be called "Front Matter".
- TESTIMONY: The start of testimony by a meeting attendee who is not a council member. Usually there is an agency that testifies, many other testimonies occur at the end of the meeting.
"""
)
time_marker: str = Field(
description="The time marker where the transcript marker occurs. Must be 'T' followed by an integer."
)
marker_information: Union[Question, Testimony, OpeningStatement, Procedure] = Field(
description="Additional information about the transcript marker. Must use the type that matches `marker_type`."
)
class HearingTranscriptMarkerList(BaseModel):
transcript_markers: list[HearingTranscriptMarker]
def get_transcript_markers(
transcript_portion,
model="gpt-4-turbo-preview",
):
client = instructor.patch(OpenAI(api_key=os.getenv("OPENAI_API_KEY")))
messages = [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{"role": "user", "content": transcript_portion},
]
transcript_marker_list = client.chat.completions.create(
model=model,
response_model=HearingTranscriptMarkerList,
messages=messages,
max_retries=3,
)
return transcript_marker_list
### From the citymeetings.nyc talk at NYC School of Data 2024
#
# CHAPTER EXTRACTION STEP 2: EXTRACT CHAPTERS
# -------------------------------------------
#
# I use `instructor` instead of a plaintext prompt. There are two elements to this prompt:
#
# 1. SYSTEM_PROMPT, which is my system prompt.
# 2. These models:
# - QuestionChapter
# - TestimonyChapter
# - ProcedureChapter
# - RemarksChapter
#
# `instructor` allows me to embed the schema for what the LLM will generate alongside the prompt.
#
# I don't yet do this for speaker identification -- I started using `instructor` later, which
# is why my speaker identification prompt (https://gist.github.com/voberoi/3d82f6b2a55e79b7cd014847853be8bf)
# is a plaintext prompt.
import os
from typing import List, Union
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator
from .step_1_transcript_markers import (
HearingTranscriptMarker,
)
SYSTEM_PROMPT = """
I am going to give you a portion of a transcript from a New York City city council meeting along with
a transcript marker that identifies what the start of this portion is about.
The first message will be the transcript marker.
You will get one of these five transcript marker types:
- QUESTION: The start of an inquiry by a council member. Often these are phrased as questions, but not always.
For example, "Tell me more about the budget" is an inquiry about the budget.
- TESTIMONY: The start of testimony by a meeting attendee. Usually these occurr at the end of the meeting.
- REMARKS: The start of remarks by a council member. These are standalone statements or addresses.
- PROCEDURE: The start of a procedural portion of the meeting, for example the start of a meeting, a vote, a roll call, calling up testimony, etc.
The second message will be the transcript portion that the transcript marker identifies the start of.
Your job is to determine details about the chapter that begins with this transcript marker, including the
last sentence in the chapter.
Note that this transcript portion may encompass more than the chapter you are identifying.
"""
class Chapter(BaseModel):
@field_validator("chapter_last_sentence_time_marker", check_fields=False)
def time_marker_must_start_with_t(cls, v):
if not v.startswith("T"):
raise ValueError("Time marker must start with the character 'T'")
return v
class QuestionChapter(Chapter):
transcript_marker: HearingTranscriptMarker = Field(
description="This is the transcript marker that marks the beginning of this question chapter."
)
chapter_description: str = Field(
description="""
This is the description of the chapter, assuming that it encompasses the question being asked and the answer.
Note that the answer might even be a short exchange.
Get specific in your description of the question and the answer to the question."""
)
chapter_last_sentence_time_marker: str = Field(
description="""The time marker for the last sentence in the question chapter, assuming that it encompasses
the question being asked and the answer. It must start with the character 'T' and be followed by an integer.
Do not include sentences in this chapter that do not pertain to this question and answer."""
)
class TestimonyChapter(Chapter):
transcript_marker: HearingTranscriptMarker = Field(
description="This is the transcript marker that marks the beginning of this testimony chapter."
)
chapter_description: str = Field(
description="""
This is the description of the chapter, assuming that it encompasses only the testimony being given.
Be sure to include who is testifying and what they are testifying about and get specific in your
description of the testimony given."""
)
chapter_last_sentence_time_marker: str = Field(
description="""The time marker for the last sentence in the testimony, assuming that it encompasses
the only the testimony being given. It must start with the character 'T' and be followed by an integer.
Do not include sentences in this chapter that do not pertain to this testimony."""
)
class ProcedureChapter(Chapter):
transcript_marker: HearingTranscriptMarker = Field(
description="This is the transcript marker that marks the beginning of this procedure chapter."
)
chapter_description: str = Field(
description="""
This is the description of the chapter, assuming that it encompasses only the procedure being followed.
Be sure to include what procedure is being followed and get specific in your
description of the procedure being followed."""
)
chapter_last_sentence_time_marker: str = Field(
description="""The time marker for the last sentence in the procedure, assuming that it encompasses
only the procedure being followed. It must start with the character 'T' and be followed by an integer.
Do not include sentences in this chapter that do not pertain to the procedure being followed."""
)
class RemarksChapter(Chapter):
transcript_marker: HearingTranscriptMarker = Field(
description="This is the transcript marker that marks the beginning of this remarks chapter."
)
chapter_description: str = Field(
description="""
This is the description of the chapter, assuming that it encompasses only the standalone remarks being made
by a council member.
Be sure to include who is making the remarks and what they are making remarks about. Get specific in your
description of the remarks made."""
)
chapter_last_sentence_time_marker: str = Field(
description="""The time marker for the last sentence in the remarks, assuming that it encompasses
only the remarks by the council member being made. It must start with the character 'T' and be followed by an integer.
Do not include sentences in this chapter that do not pertain to the remarks being made."""
)
class ChapterList(BaseModel):
chapters: List[
Union[QuestionChapter, TestimonyChapter, ProcedureChapter, RemarksChapter]
]
def extract_chapter(
transcript_marker: str,
transcript_portion: str,
chapter_model: Union[
QuestionChapter, TestimonyChapter, ProcedureChapter, RemarksChapter
],
):
"""
This function takes a transcript marker and a transcript portion and returns a chapter.
"""
client = instructor.patch(OpenAI(api_key=os.getenv("OPENAI_API_KEY")))
message_stack = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": transcript_marker.model_dump_json(indent=4)},
{"role": "user", "content": transcript_portion},
]
chapter = client.chat.completions.create(
model="gpt-4-turbo-preview",
response_model=chapter_model,
messages=message_stack,
max_retries=3,
)
return chapter
### From the citymeetings.nyc talk at NYC School of Data 2024
#
# CHAPTER EXTRACTION STEP 3: WRITE TITLES AND DESCRIPTIONS
# --------------------------------------------------------
#
# I use `instructor` instead of a plaintext prompt. There is no system prompt -- I rely on
# the schema of the models to guide the LLM here.
#
# The docstring comments and descriptions all end up in the LLM system prompt. `instructor`
# does this.
#
# I also use these prompts to edit titles and descriptions in my review UI (which you can
# see in the slides for my talk).
import os
from typing import Union
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
class QuestionChapterTitleAndDescription(BaseModel):
"""
I will provide you with a portion of a transcript of a NYC city council meeting,
which encompasses a chapter of the meeting where a council member is asking a
question and receiving an answer.
I will also provide you the initial title and description of the chapter.
Your job is to edit the title and description subject to the rules provided.
"""
edited_title: str = Field(
description="""
The title of the chapter MUST be phrased as a question, or multiple questions, even if
speakers in the transcript portion are not phrasing their speech as a question.
For example, the speaker may phrase a question as a statement, like "I would like to know
more about the budget for the new school." You would need to phrase this as a question,
like "What is the budget for the new school?"
The question/questions in the title must match what is being asked or covered in the
transcript portion provided.
Do not capitalize the entire question, only the first word and proper nouns.
If you use acronyms, you must always spell them out. For example:
- This is bad: "What is the budget for ACE?"
- This is good: "What is the budget for the Accelerated Career Education (ACE) program?"
"""
)
edited_description: str = Field(
description="""
For the description, first provide a single-sentence summary of the answer given in the transcript.
Make these sentences concise and to the point. Do not use any filler words, unnecessary adjectives or
adverbs. The language should be plain.
Then provide 3-5 bullet points that give more details about the answer. These bullet points should also
be concise and to the point, with the same style as the summary.
Try to make the summary Axios-style.
Do not editorialize anything. Just state the facts as presented in the transcript. Do not
say things like "The council member gave a great answer." or "This exchange highlights...", etc.
Use present tense in describing what is happening in the chapter. For example, "The council member
explains that the budget for the new school is $10 million." instead of "The council member explained..."
Finally, format the description like so:
```
{SINGLE_SENTENCE_SUMMARY}
- {BULLET_POINT_1}
...
- {BULLET_POINT_N}
```
"""
)
class ProcedureChapterTitleAndDescription(BaseModel):
"""
I will provide you with a portion of a transcript of a NYC city council meeting,
which encompasses a procedural segment of the meeting.
I will also provide you the initial title and description of the chapter.
Your job is to edit the title and description subject to the rules provided.
"""
edited_title: str = Field(
description="""
If the chapter is at the beginning of the transcript and involves council staff asking
for attendees to silence their phones and prepare for the meeting, the title must be "Front Matter".
Otherwise the title for the chapter must be a plain statement of the procedure that is happening.
"""
)
edited_description: str = Field(
description="""
For the description, provide a single-sentence summary of the procedure, and nothing else.
"""
)
class TestimonyChapterTitleAndDescription(BaseModel):
"""
I will provide you with a portion of a transcript of a NYC city council meeting,
which encompasses a testimony given by a meeting attendee.
I will also provide you the initial title and description of the chapter.
Your job is to edit the title and description subject to the rules provided.
"""
edited_title: str = Field(
description="""
The title for the testimony must be "<speaker> on <topic>".
<speaker> should include all details about the speaker that you have. Some examples:
- "Manuela Frisas, President of the Workers Unite Project"
- "Manuela Frisas from the Workers Unite Project", if you only have the name and the organization.
- "Manuela Frisas", if you only have the name.
- "Workers Unite Project", if you have the organization but not the name.
With the topic, get very specific! It is okay if the the topic is long.
If you can't infer the speaker's name, role, or organization just use "Member of the Public" and make the title as specific as you can.
If you use acronyms, you must always spell them out. For example:
- This is bad: "Manuela Frias on ACE"
- This is good: "Manuela Frias on the Accelerated Career Education (ACE) program at the NYC Department of Education."
"""
)
edited_description: str = Field(
description="""
For the description, first provide a single-sentence summary of the testimony.
Make this sentences concise and to the point. Do not use any filler words, unnecessary adjectives or
adverbs. The language should be plain.
Then provide 3-5 bullet points that give more details about the testimony. These bullet points should also
be concise and to the point, with the same style as the summary.
Try to make the summary Axios-style.
Do not editorialize anything. Just state the facts as presented in the transcript.
Use present tense in describing what is happening in the chapter. For example, "The council member
explains that the budget for the new school is $10 million." instead of "The council member explained..."
Finally, format the description like so:
```
{SINGLE_SENTENCE_SUMMARY}
- {BULLET_POINT_1}
...
- {BULLET_POINT_N}
```
"""
)
class RemarksChapterTitleAndDescription(BaseModel):
"""
I will provide you with a portion of a transcript of a NYC city council meeting,
which encompasses standalone remarks by a council member.
I will also provide you the initial title and description of the chapter.
Your job is to edit the title and description subject to the rules provided.
"""
edited_title: str = Field(
description="""
The title of the chapter must be "Council Member {name} on {topic}".
If you use acronyms, you must always spell them out. For example:
- This is bad: "Council Member {name} on ACE."
- This is good: "Council Member {name} on the Accelerated Career Education (ACE) program."
"""
)
edited_description: str = Field(
description="""
For the description, first provide a single-sentence summary of the council member's remarks.
Make these sentences concise and to the point. Do not use any filler words, unnecessary adjectives or
adverbs. The language should be plain.
Then provide 3-5 bullet points that give more details about the remarks. These bullet points should also
be concise and to the point, with the same style as the summary.
Try to make the summary Axios-style.
Do not editorialize anything. Just state the facts as presented in the transcript.
Use present tense in describing what is happening in the chapter. For example, "The council member
explains that the budget for the new school is $10 million." instead of "The council member explained..."
Finally, format the description like so:
```
{SINGLE_SENTENCE_SUMMARY}
- {BULLET_POINT_1}
...
- {BULLET_POINT_N}
```
"""
)
def edit_title_and_description(
current_title: str,
current_description: str,
transcript_portion: str,
additional_context: str,
chapter_model: Union[
QuestionChapterTitleAndDescription,
TestimonyChapterTitleAndDescription,
ProcedureChapterTitleAndDescription,
RemarksChapterTitleAndDescription,
],
):
"""
This function takes the current title and description of a chapter, along with a transcript portion and
additional context, and uses instructor to prompt OpenAI to edit the title and description.
The model returned will be `chapter_model`.
"""
client = instructor.patch(OpenAI(api_key=os.getenv("OPENAI_API_KEY")))
return client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "user",
"content": transcript_portion,
},
{
"role": "user",
"content": f"""
TITLE: {current_title}
DESCRIPTION: {current_description}
ADDITIONAL CONTEXT: {additional_context if additional_context else "No additional context."}
""",
},
],
response_model=chapter_model,
max_retries=3,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment