To remove newline characters (\n
) or punctuation from strings within your Pydantic models, you can enhance your validators to include these additional checks and transformations. Python's str
methods and the string
module can be very helpful for such tasks.
Here's an example of how you might adjust a Pydantic model validator to remove newline characters and punctuation from each item in a list attribute:
from pydantic import BaseModel, validator
from typing import List
import string
class CleanedTextModel(BaseModel):
texts: List[str]
# Validator to clean texts by removing newlines and punctuation
@validator('texts', pre=True, each_item=True)
def clean_text(cls, v):
# Remove newline characters
cleaned_text = v.replace('\n', ' ')
# Remove punctuation
cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation))
return cleaned_text
In this example, str.maketrans('', '', string.punctuation)
creates a translation table that maps each punctuation character to None
, effectively removing them. The translate
method then applies this table to the string. This approach ensures that every string in the texts
list is processed to remove both newline characters and punctuation.
If you want to apply similar cleaning to attributes in your existing models (e.g., MinioObject
, MinioEvent
, etc.), you can introduce a similar validator to those models. Here’s how you could apply it to the MinioFile
model to clean the content
attribute, as an example:
class MinioFile(BaseModel):
key: str
size: int
contentType: Optional[str] = "application/octet-stream"
metadata: Optional[dict] = Field(default_factory=dict)
content: Optional[str] = None # Text content that might need cleaning
bucket: Optional[MinioBucket] = None
@validator('content', pre=True, each_item=False)
def clean_content(cls, v):
if v is not None:
# Remove newline characters and punctuation
v = v.replace('\n', ' ').translate(str.maketrans('', '', string.punctuation))
return v
This adjustment ensures that any text content within the MinioFile
model is cleaned of both newline characters and punctuation before any further processing or validation, improving the consistency and usability of your data.
To conditionally clean text—removing newline characters and punctuation unless the text is enclosed in triple backticks (indicating a code block or similar formatting where such characters should be preserved)—you'll need a more nuanced approach in your Pydantic validator. This involves detecting segments of text enclosed in triple backticks and applying text cleaning outside those segments.
Here's an advanced example using regular expressions to identify and preserve text within triple backticks while cleaning the rest:
import re
from pydantic import BaseModel, validator
from typing import Optional
import string
class ContentModel(BaseModel):
content: Optional[str] = None
@validator('content', pre=True, each_item=False)
def clean_content_except_code_blocks(cls, v):
if v is None:
return v
# Pattern to detect code blocks enclosed in triple backticks
code_block_pattern = r'(```[\s\S]+?```)'
# Function to clean text (remove newlines and punctuation)
def clean_text(text):
text = text.replace('\n', ' ')
text = text.translate(str.maketrans('', '', string.punctuation))
return text
# List to hold processed segments
processed_segments = []
# Last end index of matched code block
last_end = 0
# Iterate over matches of code blocks and clean text outside of them
for match in re.finditer(code_block_pattern, v):
# Clean text from last end to the start of the current match
processed_segments.append(clean_text(v[last_end:match.start()]))
# Add the code block without cleaning
processed_segments.append(v[match.start():match.end()])
# Update last end index
last_end = match.end()
# Clean and add any remaining text after the last code block
processed_segments.append(clean_text(v[last_end:]))
# Join all segments back together
return ''.join(processed_segments)
# Example Usage
content = """
This is a sample text.It contains a code block:\n```python\n
def hello_world():\n print("Hello, world!")```And more text after the code block.
"""
model = ContentModel(content=content)
print(model.content)
- Regular Expression: The pattern
r'(```[\s\S]+?```)'
is used to match text enclosed in triple backticks. It captures everything between the opening and closing triple backticks, including newlines ([\s\S]+?
ensures lazy matching to handle multiple code blocks correctly). - Text Cleaning: Text outside of code blocks is cleaned by removing newlines and punctuation. The text within code blocks is left untouched.
- Iteration and Reconstruction: The code iterates through each match of the regular expression, cleaning text outside of matches (code blocks) and reconstructing the full text with cleaned segments and untouched code blocks.
This approach allows you to conditionally apply text cleaning to your content while preserving the integrity of sections that are meant to be left as-is, such as code blocks enclosed in triple backticks.