Continuous Analysis: Enhancing Reproducibility in Scientific Research

Microsoft

Nov 06, 2024

In the ever-evolving fields of computational analysis, ensuring reproducibility is a critical challenge. We introduce Continuous Analysis (CA) that offers a promising solution by extending the principles of Continuous Integration (CI) and Continuous Deployment (CD) to scientific research. This approach integrates DevOps, DataOps and MLOps principles to manage data and analysis lifecycles, ensuring that results can be consistently replicated using shared data, code, and methods.

The Importance of Reproducibility

Reproducibility is essential for validating scientific results and fostering collaboration. In computational science, complex workflows, software dependencies, and data sharing practices often hinder reproducibility. Studies across multiple fields have shown that a significant number of scientific publications lack reproducibility due to insufficient code and data sharing.

Introducing Continuous Analysis

Continuous Analysis addresses these challenges by incorporating version control, analysis orchestration, and feedback mechanisms (Figure 1). This ensures that every modification is recorded, allowing researchers to revert to previous versions if necessary. Automated workflows guarantee that analysis tasks are performed consistently, reducing human error and enhancing reliability.

Figure 1. Overview: Code and Data are the parts of the Development that need to be versioned. Code and Data are the parts of the Development that need to be versioned. Dependencies are tracked in code and represent external entities and relations. Continuous Integration, Continuous Deployment, and Continuous Monitoring, together represent CI/CD/CM pipeline that encapsulates a process of release creation. Automation is crucial in CI/CD/CM pipeline, as it ensures extensibility, reproducibility and supports agile development. Feedback is a collection of artifacts/signals containing items like logs, input parameters and commands, outputs, telemetry, and captured system states, gathered at each stage of CI/CD/CM pipeline. The feedback is analyzed and used for recommendations to be implemented in code (or data). Data is implicitly related to code (data schema, storage location, type of data, etc., need to be accounted in design, implementation, and tests), but the most impact is delayed until code runs on real data vs a test dataset. And when real data is used/processed, monitoring for data-related signals and issues is essential, in addition to other functional metrics.

Continuous Analysis Workflow

In managing scientific development, six main types of artifacts are essential: data, code, code dependencies, tests, deployment artifacts, and results. Each artifact type has a dedicated storage system linked to a larger graph, with code dependencies referencing external development artifacts and data artifacts managed via data management systems. Automation is key to capturing meaningful connections between artifacts, triggered by changes in each repository. For example, analyzing a flow graph can help identify critical artifacts and their impact on outcomes, ensuring that changes are captured and covered by the automation system (Figure 2).

Figure 2. An example of an experiment run graph: The artifact boxes and the output datasets for each step are shaded. The outcome dataset created by Step 1 feeds into Step 4 input, while Step 4 is a partitioned process and includes merging logic for outcomes.

Continuous analysis involves various stages, including updating data repositories, running validation pipelines, and storing results in the results repository for visualization by developers or stakeholders (Figure 3). If new data is added to the data repository or if a data dependency is updated, an event is triggered to run a validation pipeline on the current version of other repositories. The results of the analysis are stored in the results repository, where they can be accessed and visualized by developers or stakeholders. Validated and approved results trigger updates in the main branch of the code repository, initiating the release pipeline. The identical procedure is applied when there are modifications to the code or dependencies. The test pipeline ensures comprehensive validation. The test results are added to the results storage, coupled with branch and change information, ensuring a seamless integration of data and code changes. This process, including the automation of task execution and dependencies, ensures seamless integration of data and code changes, promoting consistent and reproducible workflows. The release pipeline experiment runs, either automatically or manually performed, with maintaining the same level of artifact granularity gathered during the test pipeline.

Figure 3. Continuous Analysis workflow overview: Code* represents modified code in working code branch. The blue shaded boxes indicate processes in the working branch, while the green shaded boxes represent processes in the release/main branch. The white box outlined by blue border encapsulates additional feedback artifact storage and Feedback Gateway logic added for Continuous Analysis.

Benefits of Continuous Analysis

By adopting Continuous Analysis, researchers can benefit from automated workflows that facilitate documentation, sharing, testing, and deployment of their code and data. This approach promotes transparency and collaboration, enabling more efficient and reproducible research.

To read more details about CA please see the paper.

Published Nov 06, 2024

Version 1.0

Reproducibility in Research

scientific research

version control

Venkat_Malladi