Usage Notes

After completing the software deployment you should have at least 2 servers/VMs (it could technically be just 1) a data server and a processing server. This section is intended to provide a guide on how to use the infraestructure for it’s intended purpose of neuroimaging data management and analysis acceleration.

Data Flow

This section describes an overview of the data flow from the moment a study participant is scanned to the moment the data is available for a release.

        %%{ init:
        { "theme": "forest",
        "sequence": { "showSequenceNumbers": true } }
     }%%
sequenceDiagram
box Gray Dicom Acquisition
participant MRI Scanner
participant Mercure as Mercure<br/>Storescp
end
box Orange Local Gitlab
participant DicomStudy
participant BIDS
participant MRIQC
participant BIDS-derivatives
end

box Green Centralized Gitlab
participant BIDS-fed
participant BIDS-derivatives-fed
end

MRI Scanner ->> Mercure: send dicom
Mercure ->> DicomStudy: init
Mercure ->> BIDS: init

loop Every new session
Mercure ->> DicomStudy: add session as submodule
DicomStudy ->> BIDS: trigger heudiconv
BIDS ->>+BIDS: open MR
activate BIDS
BIDS ->> BIDS: test: bids-validator
BIDS ->> BIDS: test: protocol compliance
BIDS ->> BIDS: run: defacing
BIDS ->> MRIQC: trigger: mriqc
MRIQC ->> BIDS: include/add link to reports in MR
create actor A as Dataset Admin
BIDS ->>-A: Notify
A ->> BIDS: Review + (Fix) + merge MR
loop For each configured preproc pipelines
BIDS ->>+BIDS-derivatives: trigger preproc newly merged session
BIDS-derivatives ->> BIDS-derivatives: open MR (preproc reports as artifacts)
BIDS-derivatives ->>-A: Notify
A ->> BIDS-derivatives: Review reports + merge MR
end
end
A ->> BIDS: Create release
BIDS ->> BIDS-fed: Push git + "green" data (minio)
A ->> BIDS-derivatives: Create release
BIDS-derivatives ->> BIDS-derivatives-fed: Push git + "green" data (minio)

As we can see in the diagram the data flow is divided into 3 parts.

Data acquisition in DICOM format which will be pushed to the Mercure instance. The data is then automatically pushed to the local GitLab instance.
Local GitLab git flow illustrates the workflow, starting from the push of a new DICOM session to executing data conversion and processing.
Data integration and sharing using a centralized GitLab instance.

The dataset admin stick man corresponds the indiviudal responsible for QC reviewing, dataset merging and pipeline monitoring. These tasks can be divided into multiple individuals for more efficient and robust management.

We utilize DICOM networking protocol to transfer the images from the scanner to the data server Mercure instance where it gets archived and automatically pushed to the GitLab instance based on the following DICOM tags:

ReferringPhysicianName: This determines the Principal Investigator Name and corresponds to the root GitLab group name for the hierarchical structure of the dataset.
StudyDescription: This determines the study name and it corresponds to the following sub-group name in the hierarchical structure of the dataset.
StudyInstanceUID: This determines the unique dicom study ID and it is used to track the dicom data in GitLab.
PatientID: This determines the BIDS unique participant ID and session IDs and it is used to create link the DICOM data to the BIDS dataset.

Technically speaking, mercure can receive data from any MRI vendor, however, it has been only been configured to work with either a Siemens or GE 3T MRI scanner yet. This can be adapted to any scanner vendor with a bit of work and testing.

Note

The selected DICOM tags can be modified to adapt to the restrictions of the acquisition site. Nevertheless, it is advised to reliably have enough information in the tags to be able to create an equivalent structure.

Git Flow

Using DataLad for data version control enables tracking the provenance of datasets from their creation to their sharing. This is achieved through a Git flow approach, where changes to the dataset are stored in separate branches and merged when ready.

The previous sequence diagram illustrates the workflow, starting from the push of a new DICOM session to executing data processing and release mechanisms on the federated instance (details to be designed).

Each GitLab actor/repository (e.g., DicomStudy, BIDS, Derivatives) is specific to a study, as defined by the DICOM tag StudyDescription (e.g., C-PIP). Each study follows this structure within a GitLab group, organized under groups corresponding to the local Principal Investigator or local consortia.

All operations on GitLab are automated through GitLab pipelines, executed as CI jobs by the GitLab runners and can be divided into different phases.

Pilot Phase

During the pilot phase, an experimenter will acquire one or multiple sessions to test sequences and/or full protocol. When the sessions are labelled as dev or pilot in the PatientID these are considered pilot sessions. The pilot sessions are converted to BIDS as regular session but open a Merge Request (MR) to the pilot branch. That MR triggers the same workflow as for sessions in the production phase, including BIDS-validation, defacing, and MRIQC: all useful to examine the compliance and quality of the data.

Once merged to the pilot branch they also trigger:

A configuration of the forbids tool that will enforce the protocol in future sessions.

A configuration of standard pre-processing pipelines based on the acquired data.

Standard pre-processing pipelines are then triggered to check if the pilot data are compatible and produce sensible results.

The merge of new sessions iterating on the protocol reconfigure the protocol and pipelines, and also opens a Merge Request from the cherry-picked configs on config to the base branch. When the protocol is finalized and all checks pass, that MR with the latest config is to be reviewed, manually edited if necessary, and merged, effectively setting-up the repo for tests and derivatives generation during the production phase.

        %%{ init: { "theme": "forest" } }%%
gitGraph:
    commit "start"
    branch config
    branch base
    checkout base
    commit id:"zzzzzzzzzzz"

    branch pilot
    checkout base

    branch convert/pilot1
    checkout convert/pilot1
    commit id:"heudiconv"
    commit id:"post-heudiconv-fixes"
    commit id:"fill-intendedfor/b0field"
    commit id:"deface"

    checkout pilot
    merge convert/pilot1
    commit id:"configs"

    checkout config
    cherry-pick id:"configs"

    checkout base
    commit id:"to better align"

    branch convert/pilot2
    checkout convert/pilot2
    commit id:"heudiconv-2"
    commit id:"post-heudiconv-fixes-2"
    commit id:"fill-intendedfor/b0field-2"
    commit id:"deface-2"

    checkout pilot
    merge convert/pilot2
    commit id:"reconfigs"

    checkout config
    cherry-pick id:"reconfigs"

    checkout base
    merge config

    checkout main
    merge base

Production Phase

During the production phase, new sessions are converted into separated convert/{session_name} branches and open new Merge Requests with tests / QC reports to be reviewed and edited if necessary, before merging into the dev branch.

        %%{ init: { "theme": "forest" } }%%
gitGraph:
    commit "start"

    branch base
    checkout base
    commit id:"zzzzzzzzzzz"

    branch dev
    checkout base

    branch convert/session_name1
    checkout convert/session_name1
    commit id:"heudiconv"
    commit id:"post-heudiconv-fixes"
    commit id:"fill-intendedfor/b0field"
    commit id:"deface"

    checkout dev
    merge convert/session_name1

    checkout base
    branch convert/session_name2
    checkout convert/session_name2
    commit id:"heudiconv-2"
    commit id:"post-heudiconv-fixes-2"
    commit id:"fill-intendedfor/b0field-2"
    commit id:"deface-2"

    checkout dev
    merge convert/session_name2

Release Phase

When working on a data-release, a new release branch can be created from dev, iterated upon (eg. edit README, docs, ) through branches and MRs, and finally merge to the main branch and tagged with a release version. New sessions continues to be added to the dev branch in the back.

        %%{ init: { "theme": "forest" } }%%
gitGraph:
    branch dev
    checkout main
    commit
    commit id:"previous_release" tag:"rel/www"

    checkout dev
    commit id:"long history"
    commit id:"bunch_of_sessions_now"

    branch rel/xxx
    checkout rel/xxx

    branch fix/xyz
    checkout dev
    commit
    commit

    checkout fix/xyz
    commit id:"random-fix"
    checkout rel/xxx
    merge fix/xyz

    checkout dev
    commit
    commit

    checkout rel/xxx
    branch fix/zyx
    checkout fix/zyx
    commit id:"edit README"
    checkout rel/xxx
    merge fix/zyx

    checkout main
    merge rel/xxx tag:"rel/xxx"

    checkout dev
    commit

Pipeline Management

Automated

After proper configurations have been made, the data ingestion process is fully automated. The data is pushed to the Mercure instance and automatically pushed to the local GitLab instance. The data is then converted to BIDS format and processed using the configured pipelines.

Heudiconv Conversion to BIDS

The Heudiconv tool is used to convert DICOM files to BIDS format following a set of heuristics that define how the data should be organized.

The heuristics file is a Python script that can be found in ci-pipelines BIDS-flux repository.

In general the heuristics file is configured to run multiple functions:

def custom_seqinfo(wrapper, series_files): This function is used to extract the relevant DICOM tags from the DICOM files that will be used to determine the BIDS sequence information.
def infotoids(seqinfos, outdir): This function leverages the extracted DICOM tags to determine the BIDS subject and session IDS.
def infotodict(seqinfo): Heuristic evaluator for determining which runs belong where allowed template fields follow python string module.

Deface of BIDS images

The defacing of BIDS images is performed using a simple custom tool that affinely registers the T1w image to the MNI spcase and applies a mask to the image.

BIDS-validation

The BIDS-validation process is performed using the dockerized version of the BIDS-validator tool, which checks the newly created BIDS dataset for compliance with the BIDS standard. This step is repeated for every change made to the BIDS datalad dataset in GitLab.

MRIQC

The MRIQC tools is used to asses the quality of the BIDS images. The MRIQC reports are generated and stored in the qc/mriqc datalad dataset in gitlab. The reports are linked to the BIDS images, allowing for easy access and review through the merge request.

Manual Input

The dataset administrator is responsible for reviewing the BIDS-converted data and associated MRIQC reports. They may also manually edit the BIDS dataset when necessary. Additionally, the administrator oversees the approval process for merge requests, ensuring that any required modifications are made prior to granting approval.

Retiggering of Heudiconv

If the Heudiconv conversion process fails or requires reconfiguration of the heuristics, the dataset administrator can manually trigger the process again using the GitLab interface. This allows for flexibility in managing the conversion process and ensuring that the data is properly formatted.

Note

If the DICOM data was partially converted causing the pipeline to fail the BIDS-validation and a new convert/sub-1_ses-1 branch was created. You will need to either change the branch name to something like convert/sub-1_ses-1_originalconv or delete it as the retrigger process will try to recreate the same branch as before failing in the process.

The partially converted data will be kept in the S3 compatible storage (MinIO) unless you delete it manually. You can delete it using a combination of git, git-annex, and datalad with the following command:

git checkout convert/sub-1_ses-1
git annex drop --from=<remote name> /path/to/data --force
datalad save --message "deleted partial conversion data"

The reason we need to save the changes after the fact is that git-annex needs to be notified that you dropped the binary data from the remote. Otherwise when reconverting the data, datalad might think the data already exists in the remote and not upload the complete data.

Manual Editing of BIDS Dataset

The dataset administrator can manually edit the BIDS dataset using git and Datalad commands. This allows for flexibility in managing the dataset and ensuring that it meets the BIDs standards.

git mv /path/to/file /new/path/to/file
datalad save --message "Renamed files"
git rm /path/to/file
datalad save --message "Deleted files"
datalad push --to=origin
datalad push --to=<bids s3 remote>

MRIQC Report & Merge Request Review

The MRIQC reports will need to be reviewed by the dataset administrator. Depending on the project needs the dataset administrator can choose to either approve the merge request of new convert/sub-1_ses-1 to the dev branch or reject it.

Data Access

Access to the data is managed through GitLab groups and S3 bucket policies. This access can be as granular as the project requires. The dataset administrator is responsible for managing access to the data, including granting and revoking permissions as needed. Access to the data is typically restricted to authorized personnel only, ensuring that sensitive information is protected. When data is ready to be shared openly or with specific collaboration groups or individuals, the dataset administrator can create a release branch and tag it with a version number.

Different tiers of access using gitlab can be reviewed in the official GitLab documentation.

S3 bucket policies can be used to restrict access to the data stored in MinIO. The dataset administrator can create policies that allow or deny access to specific users or groups based on their roles and responsibilities. This ensures that only authorized personnel have access to sensitive data, while still allowing for collaboration and sharing of non-sensitive data.

Locally

GitLab serves as a catalogue for the BIDS-flux data.

To access data from the BIDS-flux infrastructure you will need to work with two of the software applications deployed for BIDS-flux, GitLab and MinIO.

GitLab tracks the structure and history of the repositories, or in our case, the study directory hierarchy. The hierarchy of directories inside of GitLab is defined in this order: Principal Investigator / Study Name / (bids, sourcedata, qc, derivatives). Principal Investigator will be the investigator who is heading the study. Study Name will be the name of the study or studies which are under the principal investigator. Under each independent study, you will find 4 different repositories containing study-specific data. The sourcedata repository will be the one keeping track of all the DICOM files of the study. The bids folder tracks the BIDS formatted images for the study. The qc repository tracks the quality control checks for the data of the study, and the derivatives repository tracks processing steps for the BIDS formatted data.

MinIO will serve as the object storage for all the data for the repositories in GitLab. GitLab track the file’s history and the structure while MinIO stores all the images and binary objects (all non-text files).