Writing Models to and from YAML Files#

In all the other tutorials, we write the models from scratch in Python. This is useful when interactively testing new models. However, there comes a point where we want to test multiple models quickly and writing separate Python files for all of them becomes cumbersome. For those use cases, simpple allows you to specify your entire model in a YAML file.

Simulated Data#

We will use a simple linear regression dataset, the same as in Fitting a Line to Data. We will not perform any fit but the data will be required as an argument to the log-likelihood.

import numpy as np

rng = np.random.default_rng(123)

x = np.sort(10 * rng.random(100))
m_true = 1.338
b_true = -0.45
truths = {"m": m_true, "b": b_true, "sigma": None}
y_true = m_true * x + b_true
yerr = 0.1 + 0.5 * rng.random(x.size)
y = y_true + 2 * yerr * rng.normal(size=x.size)

YAML specification#

YAML files have a data structure very close to Python dictionaries. They are mostly used to map strings to other strings or to numbers. The Model and Distribution classes from simpple, as well as its load module, were written to interface efficiently with YAML files.

In this tutorial, we will use two different YAML files. They can both be found on the simpple GitHub page.

The simplest of the two is line.yaml. Let us see what it contains:

import yaml

line_path = "./examples/line.yaml"
with open(line_path) as f:
    print(f.read())

class: ForwardModel
kwargs:
  log_likelihood: log_likelihood
  forward: linear_model
parameters:
  m:
    dist: Uniform
    args: [-10, 10]
  b:
    dist: ScipyDistribution
    args:
      - uniform
      - -10
      - 20
  sigma:
    dist: LogUniform
    args:
      low: 1.0e-5
      high: 100

This shows the main components of the YAML spec:

class: the Model class.
kwargs: keyword arguments for model initialization. This will only work for models that accept keyword arguments. log_likelihood and forward are treated as special keyword arguments that refer to functions and we try to resolve them with simpple.load.resolve. This means that if it can be accessed from your Python code, simpple should be able to find it.
parameters: a mapping from parameter names to YAML distribution specs.

Reading and writing parameters#

In many cases, what we actually want to store is the prior distribution on parameters. For example, maybe the data was already imported with Numpy, our forward model function is already implemented, and we want to test various prior distributions. In this instance, we can simply load the parameters dictionary and pass it to a forward model as we normally would. The load_parameters function from simpple.load enables us to do just that. As you can see below, it converts the content of the YAML file to a regular parameter dictionary.

from simpple.load import load_parameters, write_parameters

parameters_from_yaml = load_parameters("./examples/line.yaml")
parameters_from_yaml

{'m': Uniform(low=-10, high=10),
 'b': ScipyDistribution(uniform(-10, 20)),
 'sigma': LogUniform(low=1e-05, high=100)}

Conversely, we can write the parameters back to a YAML file with write_parameters().

from pathlib import Path

results_dir = Path("./results")
results_dir.mkdir(exist_ok=True)
write_parameters(
    results_dir / "line_parameters_write.yaml", parameters_from_yaml, overwrite=True
)

# Open the parameters we just wrote to see the output file
with open(results_dir / "line_parameters_write.yaml") as f:
    print(f.read())

b:
  args:
  - uniform
  - -10
  - 20
  dist: ScipyDistribution
  kwargs: {}
m:
  args:
  - -10
  - 10
  dist: Uniform
  kwargs: {}
sigma:
  args:
  - 1.0e-05
  - 100
  dist: LogUniform
  kwargs: {}

Reading and writing a `ForwardModel`#

The line.yaml discussed above contains an entire model specification. Not only the parameters. We can actually build the entire model from this YAML file. Since the log_likelihood and forward arguments are function names, simpple expects them to be availabe when creating the model. We will define them directly here, but they could also have been imported from a module. If the name contains a dot in YAML file, simpple will try to import it, otherwise, it will search for it in the current global namespace. This means that as long as the function is available to be used or imported in your script, simpple should be able to find it.

from simpple.model import ForwardModel

def linear_model(p, x):
    return p["m"] * x + p["b"]

def log_likelihood(p, x, y, yerr):
    ymod = linear_model(p, x)
    var = yerr**2 + p["sigma"] ** 2
    return -0.5 * np.sum(np.log(2 * np.pi * var) + (y - ymod) ** 2 / var)

model = ForwardModel.from_yaml("./examples/line.yaml")

print("Model:", model)
print("Log-likelihod:", model.log_likelihood([1, 2, 3], x, y, yerr))

Model: ForwardModel(parameters={'m': Uniform(low=-10, high=10), 'b': ScipyDistribution(uniform(-10, 20)), 'sigma': LogUniform(low=1e-05, high=100)}, log_likelihood=log_likelihood, forward=linear_model)
Log-likelihod: -214.43765489106804

And that’s it! We now have a model that is completely equivalent to the one built in the line-fitting tutorial.

Another option is to specify the arguments and keyword arguments by passing them to the from_yaml() method. This is required if they are not in the YAML file, but expected by the model’s initialization. If they are both in the YAML file and passed to from_yaml(), the ones passed to from_yaml() take precedence.

For example, if we wanted a log-likelihood that always returns 0.0, for some reason, we could do this.

model_lambda = ForwardModel.from_yaml(
    "./examples/line.yaml", forward=linear_model, log_likelihood=lambda p: 0.0
)
model_lambda.log_likelihood([1, 2, 3])

0.0

To save the model back to YAML, simply call its to_yaml() method.

model.to_yaml(results_dir / "save_line.yaml", overwrite=True)

# Open saved model to view the output
with open(results_dir / "save_line.yaml") as f:
    print(f.read())

class: ForwardModel
kwargs:
  forward: linear_model
  log_likelihood: log_likelihood
parameters:
  b:
    args:
    - uniform
    - -10
    - 20
    dist: ScipyDistribution
    kwargs: {}
  m:
    args:
    - -10
    - 10
    dist: Uniform
    kwargs: {}
  sigma:
    args:
    - 1.0e-05
    - 100
    dist: LogUniform
    kwargs: {}

Working with custom models#

In the previous section, we were working with a built-in simpple class, the ForwardModel. Any custom model class (see Writing Model Classes for more on this) can be read from YAML.

Using the default YAML functions#

Since these custom models subclass ForwardModel or Model, they inherit their from_yaml() and to_yaml() methods. We can use our implementation below with a YAML specification.

import simpple.distributions as sdist


class PolyModel(ForwardModel):
    def __init__(self, parameters: dict[str, sdist.Distribution], order: int):
        super().__init__(parameters)
        self.order = order
        for i in range(self.order + 1):
            k = "a" + str(i)
            if k not in self.parameters:
                raise KeyError(
                    f"Parameters should have keys from a0 to a{self.order} for polynomial of order {self.order}. Key {k} not found."
                )

    def _forward(self, p, x):
        parr = np.array([p[f"a{i}"] for i in range(self.order + 1)])
        return np.vander(x, self.order + 1, increasing=True) @ parr

    def _log_likelihood(self, p, x, y, yerr):
        ymod = self.forward(p, x)
        var = yerr**2 + p["sigma"] ** 2
        return -0.5 * np.sum(np.log(2 * np.pi * var) + (y - ymod) ** 2 / var)

Here PolyModel already defines its forward and log-likelihood functions. Therefore the only thing needed as an argument in the YAML file is the order of the polynomial.

with open("./examples/line_poly.yaml") as f:
    print(f.read())

class: PolyModel
kwargs:
  order: 1
parameters:
  a1:
    dist: Uniform
    args: [-10, 10]
  a0:
    dist: ScipyDistribution
    args:
      - uniform
      - -10
      - 20
  sigma:
    dist: LogUniform
    args:
      low: 1.0e-5
      high: 100

Let’s load the model…

poly_model = PolyModel.from_yaml("./examples/line_poly.yaml")

And save it back to YAML.

poly_model.to_yaml(results_dir / "save_line_poly.yaml", overwrite=True)
with open(results_dir / "save_line_poly.yaml") as f:
    print(f.read())

args:
- 1
class: PolyModel
parameters:
  a0:
    args:
    - uniform
    - -10
    - 20
    dist: ScipyDistribution
    kwargs: {}
  a1:
    args:
    - -10
    - 10
    dist: Uniform
    kwargs: {}
  sigma:
    args:
    - 1.0e-05
    - 100
    dist: LogUniform
    kwargs: {}

Using custom YAML functions#

There are many cases where it can be useful to add custom functionality when reading from or writing to YAML files. We can easily do so by implementing our own from_yaml() and to_yaml() methods.

For example, let us say our polynomial model also stores the data as attributes. It is then convenient to specify the data directly as arrays when creating our model in a notebook or a script. However, when storing the models in a YAML file, we will need to read and write the data in some other way. One simple option is to store the arrays in a text file and load them from that text file.

Note that this is a simple implementation for demonstration purposes, but you could customize the two methods in any way you see fit for your own use case!

import simpple.distributions as sdist
from simpple.load import unparse_parameters, parse_parameters


class PolyModelData(ForwardModel):
    def __init__(
        self,
        parameters: dict[str, sdist.Distribution],
        order: int,
        x: np.ndarray,
        y: np.ndarray,
        yerr: np.ndarray,
    ):
        super().__init__(parameters)
        # Assign the data as attributes
        self.x = x
        self.y = y
        self.yerr = yerr
        # Set the order and the parameters as we did before
        self.order = order
        for i in range(self.order + 1):
            k = "a" + str(i)
            if k not in self.parameters:
                raise KeyError(
                    f"Parameters should have keys from a0 to a{self.order} for polynomial of order {self.order}. Key {k} not found."
                )

    @classmethod
    def from_yaml(cls, path: Path | str, data_file: Path | None = None):
        with open(path) as f:
            mdict = yaml.safe_load(f)
        parameters = parse_parameters(mdict["parameters"])
        # Find a pointer to the data file in the YAML spec
        data_file = data_file or mdict["data_file"]
        # Read the data
        x, y, yerr = np.loadtxt(data_file, delimiter=",").T
        # Initialize the class as usual
        model = cls(parameters, mdict["order"], x, y, yerr)
        # Track the data file for future reference and for to_yaml()
        model.data_file = data_file
        return model

    def to_yaml(self, path: Path | str, overwrite: bool = False):
        path = Path(path)

        model_dict = {}
        model_dict["class"] = self.__class__.__name__
        model_dict["parameters"] = unparse_parameters(self.parameters)
        model_dict["order"] = self.order

        if hasattr(self, "data_file"):
            # If there is a data file argument, don't re-save the data, just point to the file from which it was loaded
            model_dict["data_file"] = str(self.data_file)
        else:
            # If the data file does not exist, save the data in a new CSV file
            data_file = path.parent / (path.stem + "_data.csv")
            model_dict["data_file"] = str(data_file)
            if data_file.exists() and not overwrite:
                raise FileExistsError(
                    f"The data file {path} already exists. Use overwrite=True to overwrite it."
                )
            np.savetxt(data_file, np.array([x, y, yerr]).T, delimiter=",")

        if path.exists() and not overwrite:
            raise FileExistsError(
                f"The file {path} already exists. Use overwrite=True to overwrite it."
            )
        with open(path, mode="w") as f:
            yaml.dump(model_dict, f)

    def _forward(self, p, x):
        parr = np.array([p[f"a{i}"] for i in range(self.order + 1)])
        return np.vander(x, self.order + 1, increasing=True) @ parr

    def _log_likelihood(self, p, x, y, yerr):
        ymod = self.forward(p, x)
        var = yerr**2 + p["sigma"] ** 2
        return -0.5 * np.sum(np.log(2 * np.pi * var) + (y - ymod) ** 2 / var)

This new model class cannot load directly from our old line_poly.yaml, so we can just load the parameters and initialize the model manually.

test_p = load_parameters("./examples/line_poly.yaml")
poly_model_data = PolyModelData(test_p, 1, x, y, yerr)

Now, let us save the model and look at the output directory and the YAML file.

poly_model_data.to_yaml(results_dir / "save_poly_data.yaml", overwrite=True)

print("Output directory contents")
print([f.name for f in results_dir.iterdir()])
with open(results_dir / "save_poly_data.yaml") as f:
    print(f.read())

Output directory contents
['save_poly_data_data.csv', 'save_line_poly.yaml', 'line_parameters_write.yaml', 'save_line.yaml', 'save_poly_data.yaml']
class: PolyModelData
data_file: results/save_poly_data_data.csv
order: 1
parameters:
  a0:
    args:
    - uniform
    - -10
    - 20
    dist: ScipyDistribution
    kwargs: {}
  a1:
    args:
    - -10
    - 10
    dist: Uniform
    kwargs: {}
  sigma:
    args:
    - 1.0e-05
    - 100
    dist: LogUniform
    kwargs: {}

The CSV file was saved and is included in the YAML spec as well. This means we can load our custom model from this new file!

poly_model_data_yaml = PolyModelData.from_yaml(results_dir / "save_poly_data.yaml")
print(poly_model_data_yaml)

PolyModelData(parameters={'a0': ScipyDistribution(uniform(-10, 20)), 'a1': Uniform(low=-10, high=10), 'sigma': LogUniform(low=1e-05, high=100)}, log_likelihood=_log_likelihood, forward=_forward)

Writing Models to and from YAML Files

Contents