Analysis Example¶

This page will showcase how pycomex tries to simplify the analysis of the results of an experiment which has already been completed and filed away.

Motivation¶

Ask yourself the following question: When conducting a computational experiment, where do you put the code for the analysis of the results?

We conduct computational experiments mainly to gather data, which we can then further process into rich visualizations or more advanced metrics of sorts. There are primarily two choices for where to put the code which does this post-experiment analysis:

at the end of the experiment. We can put the code for the analysis at the end of the file which executes the experiment itself. This is a neat and direct solution, but just consider how often you had to fix a typo in the plot title, tweak the value range or compute just one more metric after looking at the results. Obviously we cant just repeat the experiment all over again.
into a separate file. Thus, we might find it more comfortable to put the analysis into a separate file from the beginning. But these kinds of files can become real time sinks over time, as they contain a large amount of boilerplate code. Just think of all the standard imports… And then there is the code to load all the data from the JSON files into which you previously saved it. You have to use different variable names and possibly even different data structures since you dont have the same environment as in the experiment file.

pycomex combines both methods and gets rid of the disadvantages of either.

Templating analysis files into the results folder¶

The core idea is this: Whenever an experiment is completed, an analysis.py file is automatically generated and placed into the results folder as well. This file will already contain most of the boilerplate code to get started with analysing the results of that specific experiment run.

At the end of the page there is an example of such a generated file. Note how it imports from a module called snapshot. This is actually a copy of the experiment script as it were at the exact moment of execution for this run. It can be imported without triggering the main content of the experiment to be executed! Moreover, it automatically detects that it is in fact just a snapshot which is being imported and automatically populate the internal data storage from the saved json file! This means you can use the experiment object from the experiment file, as if it were just the end of the experiment.

Defining analysis code at the end of the experiment¶

It is also possible to define some analysis code at the end of the experiment itself. If you do that, you might want to wrap it with the Experiment.analysis context manager. The code within will be executed just fine.

Moreover, all of the code defined within this context manager will automatically be copied to the analysis.py file as well. Under some light restrictions (see below) that code will just work. So, if you need to fix some typo in the header of a plot, you literally just have to change that line in the analysis file and run it again to recreate all of your analysis results!

Example¶

The following example file shows how this works. Specifically note how you can define analysis code within the actual experiment script by wrapping it in the Experiment.analysis context manager. That code is then automatically copied into the analysis.py* file and will work as is!

(So long as it only uses the experiment’s internally stored data or the upper case experiment variables defined above the experiment context)

"""
This experiment will repeatedly create a text made of randomly sampled words.
The words are assembled into a text file, which is supposed to be saved as an
artifact of the computational experiment. Additionally, information such as the
total text length / run time of the calculations are to be saved as experiment
metadata.

This is the same experiment content, which is also featured in the "basic.py"
example.
"""
import tempfile
import random
import textwrap
import urllib.request

from pycomex.experiment import Experiment
from pycomex.util import Skippable


NUM_WORDS = 1000
REPETITIONS = 10

with Skippable(), (e := Experiment(base_path=tempfile.gettempdir(),
                                   namespace="example/analysis", glob=globals())):
    e.work = REPETITIONS

    response = urllib.request.urlopen("https://www.mit.edu/~ecprice/wordlist.10000")
    WORDS = response.read().decode("utf-8").splitlines()

    for i in range(e.parameters["REPETITIONS"]):
        sampled_words = random.sample(WORDS, k=NUM_WORDS)
        text = "\n".join(textwrap.wrap(" ".join(sampled_words), 80))
        e.commit_raw(f"{i:02d}_random.txt", text)

        text_length = len(text)
        e[f"metrics/length/{i}"] = text_length
        e.info(f"saved text file with {text_length} characters")

        e.update()

    # ~ post-experiment analysis
    # Suppose we want to conduct some sort of analysis on the results of the completed
    # experiment. in this case we want to find the texts with the min and max number
    # of characters. We also want to find out the average value for the
    # character count. We then store this information as additional character count.


# ALl of the code defined within this "Experiment.analyis" context manager will be
# copied to the analyis.py template of the record folder of this experiment run and
# it will work as it is.
# NOTE: As long as the analysis code is only using experiment data or experiment
#       variables
with Skippable(), e.analysis:
    # (1) Note how the experiment path will be dynamically determined to be a *new*
    #     folder when actually executing the experiment, but it will refer to the
    #     already existing experiment record folder when imported from
    #     "snapshot.py"
    print(e.path)
    e.info('Starting analysis of experiment results')

    index_min, count_min = min(e['metrics/length'].items(),
                               key=lambda item: item[1])
    index_max, count_max = max(e['metrics/length'].items(),
                               key=lambda item: item[1])
    count_mean = sum(e['metrics/length'].values()) / len(e['metrics/length'])

    analysis_results = {
        'index_min': index_min,
        'count_min': count_min,
        'index_max': index_max,
        'count_max': count_max,
        'count_mean': count_mean
    }
    # (2) Committing new files to the already existing experiment record folder will
    #     also work as usual, whether executed here directly or later in "analysis.py"
    e.commit_json('analysis_results.json', analysis_results)

When executed, the above file produces the following output:

$ python ../pycomex/examples/analysis.py
2022-11-28 10:11:23,859 - ================================================================================
2022-11-28 10:11:23,860 - EXPERIMENT STARTED
2022-11-28 10:11:23,860 -     namespace:          example/analysis
2022-11-28 10:11:23,860 -     start time:         Monday, 28 Nov 2022  at 10:11
2022-11-28 10:11:23,860 -     archive path:       /tmp/example/analysis/001
2022-11-28 10:11:23,860 -     debug mode?         False
2022-11-28 10:11:23,860 - ================================================================================
2022-11-28 10:11:24,029 - saved text file with 7692 characters
2022-11-28 10:11:24,029 - (1/10) DONE - ETA: 2022-11-28 10:11:25.571942 (remaining time: 0.000h)
2022-11-28 10:11:24,032 - saved text file with 7506 characters
2022-11-28 10:11:24,032 - (2/10) DONE - ETA: 2022-11-28 10:11:24.729462 (remaining time: 0.000h)
2022-11-28 10:11:24,035 - saved text file with 7653 characters
2022-11-28 10:11:24,035 - (3/10) DONE - ETA: 2022-11-28 10:11:24.448535 (remaining time: 0.000h)
2022-11-28 10:11:24,038 - saved text file with 7676 characters
2022-11-28 10:11:24,038 - (4/10) DONE - ETA: 2022-11-28 10:11:24.308084 (remaining time: 0.000h)
2022-11-28 10:11:24,041 - saved text file with 7666 characters
2022-11-28 10:11:24,041 - (5/10) DONE - ETA: 2022-11-28 10:11:24.223737 (remaining time: 0.000h)
2022-11-28 10:11:24,043 - saved text file with 7560 characters
2022-11-28 10:11:24,043 - (6/10) DONE - ETA: 2022-11-28 10:11:24.167511 (remaining time: 0.000h)
2022-11-28 10:11:24,046 - saved text file with 7525 characters
2022-11-28 10:11:24,046 - (7/10) DONE - ETA: 2022-11-28 10:11:24.127358 (remaining time: 0.000h)
2022-11-28 10:11:24,049 - saved text file with 7580 characters
2022-11-28 10:11:24,049 - (8/10) DONE - ETA: 2022-11-28 10:11:24.097243 (remaining time: 0.000h)
2022-11-28 10:11:24,052 - saved text file with 7506 characters
2022-11-28 10:11:24,052 - (9/10) DONE - ETA: 2022-11-28 10:11:24.073829 (remaining time: 0.000h)
2022-11-28 10:11:24,054 - saved text file with 7534 characters
2022-11-28 10:11:24,055 - (10/10) DONE - ETA: 2022-11-28 10:11:24.055077 (remaining time: 0.000h)
2022-11-28 10:11:24,058 - ================================================================================
2022-11-28 10:11:24,058 - EXPERIMENT ENDED
2022-11-28 10:11:24,058 -     start time:         Monday, 28 Nov 2022  at 10:11
2022-11-28 10:11:24,058 -     end time:           Monday, 28 Nov 2022  at 10:11
2022-11-28 10:11:24,058 -     duration:           0.000 hrs
2022-11-28 10:11:24,058 -     error?              None
2022-11-28 10:11:24,058 - ================================================================================
/tmp/example/analysis/001
2022-11-28 10:11:24,059 - Starting analysis of experiment results
None None

The following file contents belong the the analysis.py file which was created by the above run of the experiment:

generated analysis.py¶

#! /usr/bin/env python3
import os
import json
import pathlib
from pprint import pprint
from typing import Dict, Any

# Useful imports for conducting analysis
import numpy as np
import matplotlib.pyplot as plt

# Importing the experiment
from snapshot import *

# List of experiment parameters
# - NUM_WORDS
# - REPETITIONS

PATH = pathlib.Path(__file__).parent.absolute()
DATA_PATH = os.path.join(PATH, 'experiment_data.json')
# Load the all raw data of the experiment
with open(DATA_PATH, mode='r') as json_file:
    DATA: Dict[str, Any] = json.load(json_file)


if __name__ == '__main__':
    print('RAW DATA KEYS:')
    pprint(list(DATA.keys()))

    # The analysis template from the experiment file
    # (1) Note how the experiment path will be dynamically determined to be a *new*
    #     folder when actually executing the experiment, but it will refer to the
    #     already existing experiment record folder when imported from
    #     "snapshot.py"
    print(e.path)
    e.info('Starting analysis of experiment results')
    
    index_min, count_min = min(e['metrics/length'].items(),
                               key=lambda item: item[1])
    index_max, count_max = max(e['metrics/length'].items(),
                               key=lambda item: item[1])
    count_mean = sum(e['metrics/length'].values()) / len(e['metrics/length'])
    
    analysis_results = {
        'index_min': index_min,
        'count_min': count_min,
        'index_max': index_max,
        'count_max': count_max,
        'count_mean': count_mean
    }
    # (2) Committing new files to the already existing experiment record folder will
    #     also work as usual, whether executed here directly or later in "analysis.py"
    e.commit_json('analysis_results.json', analysis_results)