Skip to content

Building Your First Data Source

Welcome! In this tutorial, you'll build your first Terraform data source using pyvider. Data sources are read-only queries that fetch information without creating infrastructure.

๐Ÿค– AI-Generated Content

This documentation was generated with AI assistance and is still being audited. Some, or potentially a lot, of this information may be inaccurate. Learn more.

What You'll Learn:

  • How data sources differ from resources
  • Creating a data source class with pyvider
  • Defining input/output schemas
  • Implementing read operations
  • Using data sources in Terraform

Time to Complete: 10-15 minutes

Prerequisites:

  • Python 3.11+ installed
  • pyvider installed (installation guide)
  • Basic Python knowledge
  • Basic Terraform knowledge

What is a Data Source?

A data source is a read-only query that fetches information from external systems. Unlike resources (which manage infrastructure), data sources just read data.

Examples:

  • Query file information
  • Look up cloud resources
  • Fetch API data
  • Read database records

Key Differences from Resources:

Data Source Resource
Read-only Read-write
No lifecycle (just query) Full CRUD lifecycle
No state management Terraform tracks state
Quick queries Manages infrastructure

Step 1: Create Your Package Structure

mkdir -p my_provider/data_sources
touch my_provider/__init__.py
touch my_provider/data_sources/__init__.py
touch my_provider/data_sources/file_info.py

Your structure:

my_provider/
โ”œโ”€โ”€ __init__.py
โ””โ”€โ”€ data_sources/
    โ”œโ”€โ”€ __init__.py
    โ””โ”€โ”€ file_info.py    # We'll work here


Step 2: Define Runtime Types

Data sources have two types:

  • Config - Input from user (what to query)
  • Data - Output to user (query results)

Open my_provider/data_sources/file_info.py:

import attrs

# Configuration: User inputs
@attrs.define
class FileInfoConfig:
    """What the user wants to query."""
    path: str  # Which file to query

# Data: Query results
@attrs.define
class FileInfoData:
    """Information we return about the file."""
    id: str          # Unique identifier
    path: str        # File path
    size: int        # File size in bytes
    exists: bool     # Whether file exists
    content: str     # File content

Why two types?

  • Config = what to query
  • Data = query results

Simple and clean separation!


Step 3: Create the Data Source Class

Now let's create the data source:

from pyvider.data_sources import register_data_source, BaseDataSource
from pyvider.schema import s_data_source, a_str, a_num, a_bool, PvsSchema

@register_data_source("file_info")
class FileInfo(BaseDataSource):
    """Reads information about a local file."""

    # Link our runtime types
    config_class = FileInfoConfig
    state_class = FileInfoData

    @classmethod
    def get_schema(cls) -> PvsSchema:
        """Define what Terraform users see."""
        return s_data_source({
            # Input (from user)
            "path": a_str(required=True, description="File path to query"),

            # Outputs (we compute all of these)
            "id": a_str(computed=True, description="File path as ID"),
            "size": a_num(computed=True, description="File size in bytes"),
            "exists": a_bool(computed=True, description="Whether file exists"),
            "content": a_str(computed=True, description="File content"),
        })

What's happening?

  • @register_data_source("file_info") - Registers as a Terraform data source
  • config_class / data_class - Links our attrs classes
  • All outputs are computed=True - We calculate them

Step 4: Implement the Read Method

Data sources have ONE method: read(). It takes a ResourceContext and returns data:

async def read(self, ctx: ResourceContext) -> FileInfoData | None:
    """Read file information."""
    if not ctx.config:
        return None

    from pathlib import Path

    file_path = Path(ctx.config.path)

    # Check if file exists
    if file_path.exists():
        # File exists - read information
        content = file_path.read_text()
        size = file_path.stat().st_size

        return FileInfoData(
            id=str(file_path.absolute()),
            path=str(file_path),
            size=size,
            exists=True,
            content=content,
        )
    else:
        # File doesn't exist - return empty data
        return FileInfoData(
            id=str(file_path.absolute()),
            path=str(file_path),
            size=0,
            exists=False,
            content="",
        )

Key points:

  • Takes ctx: ResourceContext parameter (same as resources)
  • Access configuration via ctx.config
  • Return None if config is unavailable
  • Always return data (even if file doesn't exist)
  • Generate a stable, deterministic ID
  • Handle missing data gracefully

Complete Code

Here's your complete file_info.py:

import attrs
from pyvider.data_sources import register_data_source, BaseDataSource
from pyvider.resources.context import ResourceContext
from pyvider.schema import s_data_source, a_str, a_num, a_bool, PvsSchema
from pathlib import Path

# Configuration (input)
@attrs.define
class FileInfoConfig:
    path: str

# Data (output)
@attrs.define
class FileInfoData:
    id: str
    path: str
    size: int
    exists: bool
    content: str

@register_data_source("file_info")
class FileInfo(BaseDataSource):
    """Reads information about a local file."""

    config_class = FileInfoConfig
    state_class = FileInfoData

    @classmethod
    def get_schema(cls) -> PvsSchema:
        """Define Terraform schema."""
        return s_data_source({
            # Input
            "path": a_str(required=True, description="File path to query"),

            # Outputs (all computed)
            "id": a_str(computed=True, description="File path as ID"),
            "size": a_num(computed=True, description="File size in bytes"),
            "exists": a_bool(computed=True, description="Whether file exists"),
            "content": a_str(computed=True, description="File content"),
        })

    async def read(self, ctx: ResourceContext) -> FileInfoData | None:
        """Read file information."""
        if not ctx.config:
            return None

        file_path = Path(ctx.config.path)

        if file_path.exists():
            content = file_path.read_text()
            size = file_path.stat().st_size
            return FileInfoData(
                id=str(file_path.absolute()),
                path=str(file_path),
                size=size,
                exists=True,
                content=content,
            )
        else:
            return FileInfoData(
                id=str(file_path.absolute()),
                path=str(file_path),
                size=0,
                exists=False,
                content="",
            )

Step 5: Test with Terraform

Create a Terraform configuration test.tf:

terraform {
  required_providers {
    local = {
      source = "mycompany/local"
    }
  }
}

# Query file information
data "local_file_info" "readme" {
  path = "../README.md"
}

# Use the data in outputs
output "readme_exists" {
  value = data.local_file_info.readme.exists
}

output "readme_size" {
  value = data.local_file_info.readme.size
}

# Use data in a resource
resource "local_file" "summary" {
  path    = "summary.txt"
  content = <<EOT
README Information:
- Exists: ${data.local_file_info.readme.exists}
- Size: ${data.local_file_info.readme.size} bytes
EOT
}

Run it:

terraform init
terraform plan
terraform apply

You should see:

  • Data source queries the file
  • Outputs show file existence and size
  • Resource uses the data

Advanced Example: API Data Source

Here's a more realistic example that queries an API:

import attrs
from pyvider.data_sources import register_data_source, BaseDataSource
from pyvider.resources.context import ResourceContext
from pyvider.schema import s_data_source, a_str, a_num, a_list, PvsSchema
import httpx

@attrs.define
class APIQueryConfig:
    endpoint: str
    limit: int = 10

@attrs.define
class APIQueryData:
    id: str
    endpoint: str
    results: list[str]
    count: int

@register_data_source("api_query")
class APIQuery(BaseDataSource):
    """Queries an external API."""

    config_class = APIQueryConfig
    state_class = APIQueryData

    @classmethod
    def get_schema(cls) -> PvsSchema:
        return s_data_source({
            # Inputs
            "endpoint": a_str(required=True, description="API endpoint"),
            "limit": a_num(default=10, description="Max results"),

            # Outputs
            "id": a_str(computed=True, description="Query ID"),
            "results": a_list(a_str(), computed=True, description="Results"),
            "count": a_num(computed=True, description="Result count"),
        })

    async def read(self, ctx: ResourceContext) -> APIQueryData | None:
        """Execute API query."""
        if not ctx.config:
            return None

        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"https://api.example.com{ctx.config.endpoint}",
                params={"limit": ctx.config.limit}
            )
            data = response.json()
            items = data.get("items", [])

            return APIQueryData(
                id=f"{ctx.config.endpoint}:{ctx.config.limit}",
                endpoint=ctx.config.endpoint,
                results=items,
                count=len(items),
            )

Best Practices

  1. Generate Stable IDs - Use deterministic ID generation so repeated queries return the same ID

    id = f"{config.param1}:{config.param2}"
    

  2. Handle Missing Data - Return empty values instead of raising errors

    if not found:
        return Data(id=id, results=[], count=0)
    

  3. Make Reads Idempotent - Multiple reads should return the same result

    # Good: Same query always returns same result
    async def read(self, config):
        return query_api(config.endpoint)  # Deterministic
    

  4. Use Computed Outputs - All outputs should be computed=True

    "result": a_str(computed=True, description="Query result")
    

  5. Add Error Handling - Handle API failures gracefully

    try:
        result = await api.query()
    except APIError:
        return Data(id=id, results=[], error="API unavailable")
    


What You've Learned

Congratulations! You've built your first pyvider data source. You now understand:

โœ… Data Sources vs Resources - Read-only queries vs managed infrastructure โœ… Simple Read Pattern - One method that returns data โœ… Input/Output Separation - Config for inputs, Data for outputs โœ… Deterministic IDs - Stable identification for query results โœ… Error Handling - Graceful handling of missing data


Next Steps

Now that you understand data sources, explore:


Troubleshooting

Q: My data source isn't being registered

Make sure you're using register_data_source() as a decorator and importing the module.

Q: Terraform says "computed values can't be configured"

All outputs in data sources must be computed=True. Inputs should not be computed.

Q: Data isn't refreshing

Data sources are refreshed on every terraform plan. Make sure your read() method is actually querying fresh data.

Q: How do I handle errors?

Return data with error fields instead of raising exceptions:

@attrs.define
class QueryData:
    id: str
    results: list[str]
    error: str | None = None  # Add error field

async def read(self, config):
    try:
        results = await query()
        return QueryData(id=id, results=results, error=None)
    except Exception as e:
        return QueryData(id=id, results=[], error=str(e))

For more help, see Troubleshooting Guide.