Types API Reference

pyCTAKES uses a comprehensive type system to represent clinical text and annotations.

Core Types

`class Document`

Main document class containing text and all annotations.

Attributes: - text (str): The clinical text content - metadata (dict): Document metadata - sentences (List[Sentence]): Sentence annotations - tokens (List[Token]): Token annotations - entities (List[Entity]): Entity annotations - sections (List[Section]): Section annotations - annotations (List[Annotation]): All annotations

Methods: - to_json(): Serialize to JSON - from_json(data): Deserialize from JSON

`class Annotation`

Base class for all annotations.

Attributes: - start (int): Start character position - end (int): End character position - text (str): Annotated text span - label (str): Annotation label/type - confidence (float): Confidence score (0.0-1.0)

`class Token(Annotation)`

Represents a single token.

Attributes: - pos (str): Part-of-speech tag - lemma (str): Lemmatized form - is_alpha (bool): Contains alphabetic characters - is_digit (bool): Contains only digits - is_punct (bool): Is punctuation

`class Sentence(Annotation)`

Represents a sentence with tokens.

Attributes: - tokens (List[Token]): Tokens in the sentence

`class Entity(Annotation)`

Represents a named entity.

Attributes: - assertion (Assertion): Assertion information - umls_concept (UMLSConcept): UMLS concept mapping

`class Section(Annotation)`

Represents a document section.

Attributes: - section_type (str): Type of section

`class Assertion`

Assertion attributes for entities.

Attributes: - polarity (str): POSITIVE, NEGATIVE - uncertainty (str): CERTAIN, UNCERTAIN - temporality (str): PRESENT, PAST, FUTURE - experiencer (str): PATIENT, FAMILY, OTHER

`class UMLSConcept`

UMLS concept information.

Attributes: - cui (str): Concept Unique Identifier - preferred_term (str): Preferred term - semantic_types (List[str]): Semantic type codes - sources (List[str]): Source vocabularies - confidence (float): Mapping confidence

Usage Examples

Creating Documents

from pyctakes.types import Document

# Create document from text
doc = Document(text="Patient has diabetes and hypertension.")

# Create with metadata
doc = Document(
    text="Clinical note text",
    metadata={
        "patient_id": "12345",
        "note_type": "progress_note",
        "date": "2025-01-15"
    }
)

Working with Annotations

from pyctakes.types import Annotation, Document

doc = Document(text="Patient has diabetes.")

# Create annotation
annotation = Annotation(
    start=12,
    end=20,
    text="diabetes",
    label="CONDITION",
    confidence=0.95
)

# Add to document
doc.annotations.append(annotation)

# Access annotations
for ann in doc.annotations:
    print(f"{ann.text}: {ann.label} ({ann.confidence})")

Working with Entities

from pyctakes.types import Entity, Assertion

# Create entity with assertion
entity = Entity(
    start=12,
    end=20,
    text="diabetes",
    label="CONDITION",
    assertion=Assertion(
        polarity="POSITIVE",
        uncertainty="CERTAIN",
        temporality="PRESENT"
    )
)

# Add to document
doc.entities.append(entity)

Working with Tokens

from pyctakes.types import Token, Sentence

# Create tokens
tokens = [
    Token(start=0, end=7, text="Patient", pos="NOUN"),
    Token(start=8, end=11, text="has", pos="VERB"),
    Token(start=12, end=20, text="diabetes", pos="NOUN")
]

# Create sentence
sentence = Sentence(
    start=0,
    end=21,
    text="Patient has diabetes.",
    tokens=tokens
)

# Add to document
doc.sentences.append(sentence)

Working with Sections

from pyctakes.types import Section

# Create section
section = Section(
    start=0,
    end=100,
    section_type="CHIEF_COMPLAINT",
    text="Chief Complaint: Chest pain for 2 days."
)

# Add to document
doc.sections.append(section)

Working with UMLS Concepts

from pyctakes.types import UMLSConcept

# Create UMLS concept
concept = UMLSConcept(
    cui="C0011849",
    preferred_term="Diabetes Mellitus",
    semantic_types=["T047"],  # Disease or Syndrome
    sources=["SNOMEDCT_US", "ICD10CM"],
    confidence=0.89
)

# Attach to entity
entity.umls_concept = concept

Type Hierarchies

Annotation Hierarchy

Annotation (base)
├── Token
├── Sentence  
├── Entity
└── Section

Entity Types

Common entity labels used in pyCTAKES:

MEDICATION: Drugs and medications
DOSAGE: Medication dosages
FREQUENCY: Dosing frequency
CONDITION: Medical conditions and diseases
SYMPTOM: Signs and symptoms
ANATOMY: Anatomical structures
PROCEDURE: Medical procedures
TEST_RESULT: Lab results and measurements

Section Types

Standard clinical section types:

CHIEF_COMPLAINT: Primary reason for visit
HISTORY_OF_PRESENT_ILLNESS: Current problem details
PAST_MEDICAL_HISTORY: Previous medical history
MEDICATIONS: Current medications
ALLERGIES: Known allergies
SOCIAL_HISTORY: Social and lifestyle factors
FAMILY_HISTORY: Family medical history
REVIEW_OF_SYSTEMS: Systematic review
PHYSICAL_EXAMINATION: Physical exam findings
ASSESSMENT_AND_PLAN: Clinical assessment and treatment plan

Assertion Values

Polarity

POSITIVE: Entity is present/affirmed
NEGATIVE: Entity is negated/denied

Uncertainty

CERTAIN: Definite statement
UNCERTAIN: Possible/probable/likely

Temporality

PRESENT: Current condition
PAST: Historical condition
FUTURE: Future/planned condition

Experiencer

PATIENT: Refers to the patient
FAMILY: Refers to family member
OTHER: Refers to someone else

JSON Serialization

All types support JSON serialization:

import json
from pyctakes.types import Document

# Create document with annotations
doc = Document(text="Patient has diabetes.")
# ... add annotations ...

# Serialize to JSON
json_data = doc.to_json()
print(json.dumps(json_data, indent=2))

# Deserialize from JSON
doc_restored = Document.from_json(json_data)

Example JSON output:

href="#__codelineno-8-1">{ "text": "Patient has diabetes.", "metadata": {}, "sentences": [ { "start": 0, "end": 21, "text": "Patient has diabetes.", "tokens": [ { "start": 0, "end": 7, "text": "Patient", "pos": "NOUN", "lemma": "patient" } ] } ], "entities": [ { "start": 12, "end": 20, "text": "diabetes", "label": "CONDITION", "confidence": 0.95, "assertion": { "polarity": "POSITIVE", "uncertainty": "CERTAIN", "temporality": "PRESENT", "experiencer": "PATIENT" } } ], "sections": [ { "start": 0, "end": 21, "section_type": "ASSESSMENT_AND_PLAN" } ] }

Type Validation

pyCTAKES includes validation for type safety:

from pyctakes.types import Document, ValidationError

try:
    # Invalid annotation (end before start)
    annotation = Annotation(start=10, end=5, text="invalid")
except ValidationError as e:
    print(f"Validation error: {e}")

Extending Types

You can extend the base types for custom use cases:

from pyctakes.types import Entity
from dataclasses import dataclass

@dataclass
class CustomEntity(Entity):
    custom_field: str = ""
    custom_score: float = 0.0

    def custom_method(self):
        return f"Custom: {self.text}"

Best Practices

Use appropriate types: Choose the most specific type for your annotations
Set confidence scores: Always provide confidence when available
Validate inputs: Check spans and text alignment
Use standard labels: Stick to established entity and section types
Include metadata: Add relevant document metadata for tracking