============================== Framework Overview ============================== This framework provides a modular and reproducible environment for managing experiments. It introduces three core concepts — **Component**, **Workflow**, and **Pipeline** — each representing a different layer of abstraction and flexibility in the experimental process. These concepts allow users to design, execute, and reproduce complex computational or machine learning experiments with a clear separation of concerns between logic, configuration, and execution history. .. contents:: :local: :depth: 2 ------------------------------ 1. Component ------------------------------ A **Component** is the smallest reusable unit in the framework. It represents a **functional block** — such as a data loader, model, optimizer, or evaluator — that performs a single, well-defined task. A component can be: - A **Python function** or **class** encapsulated in a file. - A **reference** to a shared resource, library, or custom code base. - A **configurable unit**, whose behavior depends on arguments specified at runtime. **Key Ideas** - **Reusability:** Components can be reused across different workflows and pipelines. - **Configurability:** Each component can have a flexible configuration (`args`) dictionary. - **Isolation:** Components do not depend directly on other components — they only define *what* they do, not *when* they run. **Example** .. code-block:: json { "loc": "components.data.load_dataset", "args": { "path": "./data/train.csv", "batch_size": 32 } } In this example, the component loads a dataset and exposes it to downstream steps. The `"loc"` field points to the function to execute, and `"args"` defines how it behaves. This separation of *location* and *arguments* allows the same logical component to be reused in many workflows with different parameters — increasing modularity. ------------------------------ 2. Workflow ------------------------------ A **Workflow** defines **the structure and order** in which components are executed. While a component is a single building block, a workflow describes *how components connect*. Think of a workflow as a **template or blueprint** for a process. It defines *what happens* and *in what sequence* but does not fix the data or model parameters. **Key Ideas** - **Declarative Composition:** A workflow lists components and their dependencies. - **Dynamic Construction:** Workflows can be loaded from JSON, YAML, or Python definitions. - **Reproducibility:** The same workflow can be reused across multiple runs, ensuring consistency. **Example** .. code-block:: json { "workflow": { "loc": "components.training.supervised_training", "template": ["load_data", "build_model", "train", "evaluate"] }, "args": { "load_data": {"loc": "components.data.load_dataset", "args": {"path": "data/train.csv"}}, "build_model": {"loc": "components.model.create_cnn", "args": {"num_layers": 5}}, "train": {"loc": "components.training.train_epoch", "args": {"epochs": 10}}, "evaluate": {"loc": "components.eval.compute_accuracy", "args": {}} } } Here, the workflow specifies **the topology** of execution (the “template”) and how each step is realized via components. Each step can be substituted or reconfigured without breaking the structure. This makes workflows **flexible, composable, and shareable** across projects. ------------------------------ 3. Pipeline ------------------------------ A **Pipeline** is a **runtime instantiation** of a workflow with concrete settings, logs, and status tracking. While a workflow defines *what should happen*, a pipeline defines *what actually happened*. It binds together: - The workflow definition. - The specific component configurations. - Metadata about the environment, logs, and execution history. Each pipeline has a unique identifier (`pplid`) and is tracked in a database (`ppls.db`), which stores: - Pipeline metadata (hashes, creation time, status). - Relationships between pipelines (via the `edges` table). - Active runs (`runnings` table). **Pipeline Lifecycle** 1. **Creation:** A pipeline is created using `PipeLine(pplid=...)`, loading its configuration and workflow. 2. **Preparation:** It sets up its directories, loads components, and initializes resources. 3. **Execution:** The pipeline runs through its workflow components in order (or dynamically). 4. **Status Tracking:** Each run is recorded in the database, making pipelines reproducible and auditable. 5. **Archival / Transfer:** Finished pipelines can be archived, deleted, or transferred between environments, preserving their full state. **Flexibility** - You can create many pipelines from a single workflow with different component arguments. - You can rerun or resume a pipeline at any stage. - Pipelines can be programmatically filtered, grouped, and compared using utilities like: - :func:`experiment.get_ppl_status` - :func:`experiment.filter_ppls` - :func:`experiment.group_by_common_columns` ------------------------------ 4. Hierarchical View ------------------------------ +-------------+-----------------------------------+-------------------------------------------+ | Level | Represents | Purpose | +=============+===================================+===========================================+ | Component | A single reusable operation | Define atomic behavior (e.g., load, | | | | preprocess, train, evaluate). | +-------------+-----------------------------------+-------------------------------------------+ | Workflow | A structured composition of | Define *how* components connect. | | | components | Manage process logic and dependencies. | +-------------+-----------------------------------+-------------------------------------------+ | Pipeline | A concrete, executable instance | Execute and track a specific run. | | | of a workflow | Store results and ensure reproducibility. | +-------------+-----------------------------------+-------------------------------------------+ Together, these layers create a **flexible, declarative, and traceable experimental system** that supports both **research iteration** and **production reproducibility**. ------------------------------ 5. Design Philosophy ------------------------------ - **Modularity:** Each layer is independent and composable. - **Transparency:** All configurations and runs are logged and queryable. - **Reproducibility:** Every experiment can be reloaded, re-executed, or audited. - **Portability:** Pipelines and workflows can be transferred between machines or environments. - **Scalability:** Supports many experiments with shared or divergent configurations. This design makes it easy to iterate on ideas quickly while preserving the integrity and traceability of experimental data — crucial for scientific and ML workflows alike.