Skip to content

Instantly share code, notes, and snippets.

@Cdaprod
Last active February 12, 2024 22:23
Show Gist options
  • Save Cdaprod/6b217a2f9383084e57e6bf87c968bef0 to your computer and use it in GitHub Desktop.
Save Cdaprod/6b217a2f9383084e57e6bf87c968bef0 to your computer and use it in GitHub Desktop.
Clean text with PYDANTIC! :3

To remove newline characters (\n) or punctuation from strings within your Pydantic models, you can enhance your validators to include these additional checks and transformations. Python's str methods and the string module can be very helpful for such tasks.

Here's an example of how you might adjust a Pydantic model validator to remove newline characters and punctuation from each item in a list attribute:

Extending the Pydantic Model with Custom Validation

from pydantic import BaseModel, validator
from typing import List
import string

class CleanedTextModel(BaseModel):
    texts: List[str]

    # Validator to clean texts by removing newlines and punctuation
    @validator('texts', pre=True, each_item=True)
    def clean_text(cls, v):
        # Remove newline characters
        cleaned_text = v.replace('\n', ' ')
        # Remove punctuation
        cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation))
        return cleaned_text

In this example, str.maketrans('', '', string.punctuation) creates a translation table that maps each punctuation character to None, effectively removing them. The translate method then applies this table to the string. This approach ensures that every string in the texts list is processed to remove both newline characters and punctuation.

Applying This Concept to Your Existing Models

If you want to apply similar cleaning to attributes in your existing models (e.g., MinioObject, MinioEvent, etc.), you can introduce a similar validator to those models. Here’s how you could apply it to the MinioFile model to clean the content attribute, as an example:

class MinioFile(BaseModel):
    key: str
    size: int
    contentType: Optional[str] = "application/octet-stream"
    metadata: Optional[dict] = Field(default_factory=dict)
    content: Optional[str] = None  # Text content that might need cleaning
    bucket: Optional[MinioBucket] = None

    @validator('content', pre=True, each_item=False)
    def clean_content(cls, v):
        if v is not None:
            # Remove newline characters and punctuation
            v = v.replace('\n', ' ').translate(str.maketrans('', '', string.punctuation))
        return v

This adjustment ensures that any text content within the MinioFile model is cleaned of both newline characters and punctuation before any further processing or validation, improving the consistency and usability of your data.

To conditionally clean text—removing newline characters and punctuation unless the text is enclosed in triple backticks (indicating a code block or similar formatting where such characters should be preserved)—you'll need a more nuanced approach in your Pydantic validator. This involves detecting segments of text enclosed in triple backticks and applying text cleaning outside those segments.

Here's an advanced example using regular expressions to identify and preserve text within triple backticks while cleaning the rest:

Advanced Pydantic Model Validator

import re
from pydantic import BaseModel, validator
from typing import Optional
import string

class ContentModel(BaseModel):
    content: Optional[str] = None

    @validator('content', pre=True, each_item=False)
    def clean_content_except_code_blocks(cls, v):
        if v is None:
            return v

        # Pattern to detect code blocks enclosed in triple backticks
        code_block_pattern = r'(```[\s\S]+?```)'
        
        # Function to clean text (remove newlines and punctuation)
        def clean_text(text):
            text = text.replace('\n', ' ')
            text = text.translate(str.maketrans('', '', string.punctuation))
            return text

        # List to hold processed segments
        processed_segments = []
        
        # Last end index of matched code block
        last_end = 0
        
        # Iterate over matches of code blocks and clean text outside of them
        for match in re.finditer(code_block_pattern, v):
            # Clean text from last end to the start of the current match
            processed_segments.append(clean_text(v[last_end:match.start()]))
            # Add the code block without cleaning
            processed_segments.append(v[match.start():match.end()])
            # Update last end index
            last_end = match.end()
        
        # Clean and add any remaining text after the last code block
        processed_segments.append(clean_text(v[last_end:]))
        
        # Join all segments back together
        return ''.join(processed_segments)

# Example Usage
content = """
This is a sample text.It contains a code block:\n```python\n
def hello_world():\n    print("Hello, world!")```And more text after the code block.
"""
model = ContentModel(content=content)
print(model.content)

Explanation

  • Regular Expression: The pattern r'(```[\s\S]+?```)' is used to match text enclosed in triple backticks. It captures everything between the opening and closing triple backticks, including newlines ([\s\S]+? ensures lazy matching to handle multiple code blocks correctly).
  • Text Cleaning: Text outside of code blocks is cleaned by removing newlines and punctuation. The text within code blocks is left untouched.
  • Iteration and Reconstruction: The code iterates through each match of the regular expression, cleaning text outside of matches (code blocks) and reconstructing the full text with cleaned segments and untouched code blocks.

This approach allows you to conditionally apply text cleaning to your content while preserving the integrity of sections that are meant to be left as-is, such as code blocks enclosed in triple backticks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment