Ridge regression example#

This notebook implements a cross-valided voxelwise encoding model for a single subject using Regularized Ridge Regression.

The goal is to demonstrate how to obtain Neuroscout data to fit models using custom pipelines. For a comprehensive tutorial, check out the excellent voxelwise modeling tutorials from the Gallant Lab.

Note: By implementing a custom pipeline, your analysis will not be centrally registered on neuroscout.org, and a reproducible record will not be made. For analyses supported the Neuroscout-CLI (e.g. group voxelwise mass univariate GLM models), it is recommended to use the neuroscout.org web interface, or follow the guide for programmatically creating analyses using pyNS.

Citing Neuroscout#

If you publish any results using the Neuroscout data, be sure to cite the Neuroscout paper, and corresponding datasets:

Alejandro de la Vega, Roberta Rocca, Ross W Blair, Christopher J Markiewicz, Jeff Mentch, James D Kent, Peer Herholz, Satrajit S Ghosh, Russell A Poldrack, Tal Yarkoni (2022). *Neuroscout, a unified platform for generalizable and reproducible fMRI research*. eLife 11:e79277. https://doi.org/10.7554/eLife.79277

Visconti di Oleggio Castello, M., Chauhan, V., Jiahui, G., & Gobbini, M. I. An fMRI dataset in response to “The Grand Budapest Hotel”, a socially-rich, naturalistic movie. Sci Data 7, 383 (2020). https://doi.org/10.1038/s41597-020-00735-4

If running this notebook on Google Colab, be sure to switch the runtime to GPU to expedite model fitting (Runtime -> Change runtime type)

#@title 1) Run once to install dependencies on Colab (pyNS, Datalad, and Machine Learning libraries) (~1 minute)
%%capture --no-display --no-stderr
from IPython.display import display
display("Installing NeuroDebian...")

## Set up DataLad
!wget -O- http://neuro.debian.net/lists/focal.us-ca.libre | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list && sudo apt-key adv --recv-keys --keyserver hkps://keyserver.ubuntu.com 0xA5D32F012649A5A9 && sudo apt-get update
display("Installing DataLad...")

!sudo apt-get install datalad -y
%pip install -U datalad
!git config --global user.email "you@example.com" && git config --global user.name "Your Name"

display("Installing PyNS...")
%pip install pyns
%pip install git+https://github.com/bids-standard/pybids.git#egg=pybids

display("Installing analysis dependencies (nilearn, and himalaya)...")
%pip install nilearn himalaya

display("Done.")
UsageError: Line magic function `%%capture` not found.

Fetching Neuroscout Data#

We can easily retrieve data from Neuroscout using pyNS — the Python Neuroscout API client.

Be sure to refer to the official pyNS documentation for further usage information, with particular focus on the section on fetching predictors and images.

What datasets are available?#

If you’re not sure what is available, you can browse Neuroscout’s datasets and predictors online.

Here, we were going to focus on the ‘Budapest’ dataset, choosing the first subject (sid000005) as an example.

# Find first subject for Budapest
from pyns import Neuroscout

api = Neuroscout()

dataset_name = 'Budapest'

# First run for Budapest dataset
first_run = api.runs.get(dataset_name=dataset_name)[0]
subject = first_run['subject'] # Save subject name

print(subject)
sid000005

Fetching predictors#

We will fetch two sets of predictors: Mel spetrogram, and Mel Frequency Cepstral Coefficient (MFCC). Both of these features are extracted from the auditory track of the movie stimulus (‘The Grand Budapest Hotel’). Later in the tutorial we’ll fit an encoding model to each set of features separately.

First, we define the names of the predictors we will fetch.

To learn more about basic Neuroscout API querying, see this guide.

mfccs = [f'mfcc_{i}' for i in range(20)]
mel = [f'mel_{i}' for i in range(64)]

all_vars = mfccs + mel

Next, we can use the high-level utility fetch_predictors to retrieve these predictors for the target subject, rescaled to unit variance, and resampled to the imaging data’s Repetition Time (TR).

Since downloading and resampling can take a minute, for this example we’ll only use the first three (out of 8) runs.

from pyns.fetch_utils import fetch_predictors

predictors = fetch_predictors(all_vars, dataset_name='Budapest', subject='sid000005', run=[1, 2, 3], rescale=True, resample=True)

This results in a DataFrame with predictors, plus metadata such as file entities (e.g. subjects, runs)

Fetch fMRI data and loading images#

To retrieve Neuroscout data, we use datalad to fetch the preprocessed images remote. pyNS includes a helper function to facilitate installing and fetching the dataset using datalad: fetch_images.

Provide the name of the dataset (Budapest), plus a directory where your datasets are stored, plus (optionally) filters to restrict which files are downloaded (e.g. subject)

from pyns.fetch_utils import fetch_images

# Note: This command can hang after fetching images. 
# If so, stop this cell and re-run, and it will immediately re-execute
preproc_dir, img_objs = fetch_images('Budapest', '/tmp/', subject=subject, run=[1, 2, 3], fetch_brain_mask=True)
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/bids/layout/validation.py:48: UserWarning: The ability to pass arguments to BIDSLayout that control indexing is likely to be removed in future; possibly as early as PyBIDS 0.14. This includes the `config_filename`, `ignore`, `force_index`, and `index_metadata` arguments. The recommended usage pattern is to initialize a new BIDSLayoutIndexer with these arguments, and pass it to the BIDSLayout via the `indexer` argument.
  warnings.warn("The ability to pass arguments to BIDSLayout that control "
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/bids/layout/validation.py:122: UserWarning: The PipelineDescription field was superseded by GeneratedBy in BIDS 1.4.0. You can use ``pybids upgrade`` to update your derivative dataset.
  warnings.warn("The PipelineDescription field was superseded "
action summary:
  get (notneeded: 6)

The fetch_images function returns images as pybids BIDSImageFile objects, which include metadata such as entities as part of the object

img_objs[2].get_entities()
{'datatype': 'func',
 'desc': 'preproc',
 'extension': '.nii.gz',
 'run': 3,
 'space': 'MNI152NLin2009cAsym',
 'subject': 'sid000005',
 'suffix': 'bold',
 'task': 'movie'}

Using the entities, we can separate the list of images into two lists: preprocessed functional images and brain masks:

all_funcs = [f for f in img_objs if f.entities['suffix'] == 'bold']
all_masks = [f for f in img_objs if f.entities['suffix'] == 'mask']

Image preprocessing#

The following computes a joint mask for all runs. Alternatively, we could also provide a apriori ROI mask here.

import nibabel as nib
import numpy as np

def _compute_mask_intersection(masks): 
    """ Compute joint masks (i.e. where all masks have 1s)"""
    _masks = [nib.load(m) for m in masks]
    _data = np.stack(n.get_fdata() for n in _masks)
    
    intersection = _data.min(axis=0)
    
    mask = nib.Nifti1Image(
        intersection, affine = _masks[0].affine, header=_masks[0].header)
    
    return mask

inter_mask = _compute_mask_intersection(all_masks)
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3442: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  exec(code_obj, self.user_global_ns, self.user_ns)

Now we can apply the corresponding brain mask to each run using NiftiMasker from nilearn, and stack them into a single array for subsequent analysis:

import pandas as pd
import numpy as np
from nilearn.maskers import NiftiMasker

def _mask_and_stack_images(image_objects, mask):
    """ Stack images into single array, and collect metadata entities into DataFrame """
    masker = NiftiMasker(mask_img=mask)

    arrays = []
    entities = []
    image_objects = sorted(image_objects, key=lambda x: x.entities['run'])
    for img in image_objects:
        run_y = masker.fit_transform(img)
        arrays.append(run_y)
        entities += [dict(img.entities)] * run_y.shape[0]
    entities = pd.DataFrame(entities)
    return np.vstack(arrays), entities, masker
y, img_entities, masker = _mask_and_stack_images(all_funcs, inter_mask)

The stacked runs have shape: (n_volumes, n_voxels)

y.shape
(1631, 124614)

We also keep the entities associated with each volume in a dataframe:

img_entities
datatype desc extension run space subject suffix task
0 func preproc .nii.gz 1 MNI152NLin2009cAsym sid000005 bold movie
1 func preproc .nii.gz 1 MNI152NLin2009cAsym sid000005 bold movie
2 func preproc .nii.gz 1 MNI152NLin2009cAsym sid000005 bold movie
3 func preproc .nii.gz 1 MNI152NLin2009cAsym sid000005 bold movie
4 func preproc .nii.gz 1 MNI152NLin2009cAsym sid000005 bold movie
... ... ... ... ... ... ... ... ...
1626 func preproc .nii.gz 3 MNI152NLin2009cAsym sid000005 bold movie
1627 func preproc .nii.gz 3 MNI152NLin2009cAsym sid000005 bold movie
1628 func preproc .nii.gz 3 MNI152NLin2009cAsym sid000005 bold movie
1629 func preproc .nii.gz 3 MNI152NLin2009cAsym sid000005 bold movie
1630 func preproc .nii.gz 3 MNI152NLin2009cAsym sid000005 bold movie

1631 rows × 8 columns

Fitting ridge regression models using Mel spectrogram features#

First, we’ll fit a basic voxelwise ridge regression model to a subset of voxels (for demonstration purposes).

We’ll define two cross-validators: an outer and an inner CV. The outer cross-validator will loop will be used to estimate the performance of the model on unseen data, and the inner cross-validator will be used to select the alpha hyperparameter for ridge regression, within each training fold of the outer cross-validator.

In both cases, we’ll use the leave-one-run-out strategy, testing the model on an unseen run. Since there are multiple observations per run, we use therun column of X_entities to group observations, and ensure they appear together within each fold.

Finally, we also define a scoring function, in this case, correlation_score.

Note that we are using the himalaya library’s KernelRidgeCV estimator, as it is optimized for multi-target (i.e. voxel) estimation & hyperparameter optimization with an API inspired by scikit-learn.

from himalaya.ridge import RidgeCV
from sklearn.model_selection import GroupKFold
from himalaya.scoring import correlation_score
from himalaya.backend import set_backend

backend = set_backend("torch_cuda", on_error="warn")

groups = predictors['run'].tolist()

# Set up estimator and CV objects
estimator = RidgeCV()
n_runs = len(set(groups))
cv = GroupKFold(n_splits=n_runs)
inner_cv = GroupKFold(n_splits=n_runs - 1)
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/himalaya/backend/_utils.py:56: UserWarning: Setting backend to torch_cuda failed: PyTorch with CUDA is not available..Falling back to numpy backend.
  warnings.warn(f"Setting backend to {backend} failed: {str(error)}."

The following is a generic pipeline for applying the estimator to the data:

from collections import defaultdict

def _model_cv(estimator, cv, X_vars, y, groups=None,
              scoring=correlation_score, inner_cv=None):
    """ Cross-validate a model on a set of variables.

    Parameters
    ----------
    estimator : Scikit-learn like estimator
        Estimator to use for fitting the model. Must have a fit and predict
        method.
    cv : cross-validation generator or an iterable
        Outer cross-validation generator to use for cross-validation.
    X_vars : pandas.DataFrame
        Dataframe containing the variables to use for prediction.
    y : array-like, shape (n_samples, n_targets)
        Target values.
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.
    scoring : callable, optional  (default: correlation_score)
        Scoring function to use for evaluating the model.
    inner_cv : cross-validation generator or an iterable, optional
        Inner cross-validation generator to use for cross-validation.
    """
    
    # Container for results
    results = defaultdict(list)
    
    # Extract data from dataframe
    X = X_vars.values

    # Extract number of samples for convenience
    n_samples = y.shape[0]
    
    # Loop through outer cross-validation folds
    for train, test in cv.split(np.arange(n_samples), groups=groups):
        
        # Get training model for list of model bands
        X_train = [x[train] for x in X] if type(X) == list else X[train]
        X_test = [x[test] for x in X] if type(X) == list else X[test]
        
        # Create inner cross-validation loop if specified
        if inner_cv:
            # Split inner cross-validation with groups if supplied
            inner_groups = np.array(groups)[train] if groups else groups
            inner_splits = inner_cv.split(np.arange(n_samples)[train],
                                          groups=inner_groups)
            
            # Update estimator with inner cross-validator
            estimator.set_params(cv=inner_splits)
        
        # Fit the regression model on training data
        estimator.fit(X_train, y[train])
        
        # Compute predictions
        test_prediction = estimator.predict(X_test)
        
        # Test scores
        test_score = scoring(y[test], test_prediction)

        # If output is not an array, assume its a torch Tensor and convert to array
        if not isinstance(test_prediction, np.ndarray):
            test_prediction = test_prediction.cpu().numpy()
            test_score = test_score.cpu().numpy()
        
        # Populate results dictionary
        results['test_predictions'].append(test_prediction)
        results['test_scores'].append(test_score)
        
        
    # Combine into single aray
    results['test_scores'] = np.stack(results['test_scores'])
    
    return results

Mel-spectrogram features only#

First, we’ll fit this model using only the Mel spectrogram features:

X = predictors[mel] # Take a subset of X for this model
results = _model_cv(estimator, cv, X, y, inner_cv=inner_cv, groups=groups)

The results dictionary contains:

print(results.keys())
dict_keys(['test_predictions', 'test_scores'])

The test_scores and coefficients are of shape: (n_folds, n_voxels).

results['test_scores'].shape
(4, 124614)

Then, we average across folds to get the mean score across for this subject:

mean_scores = results['test_scores'].mean(axis=0)

For interpretation only, we’ll only look at positive scores:

mean_scores[mean_scores < 0] = 0
mean_scores.max()
0.21056749957050686
mean_scores.mean()
0.027309387725607995

We can inverse transform the score map, to a 3D volume, and vizualize it:

from nilearn.plotting import plot_stat_map

plot_stat_map(masker.inverse_transform(mean_scores), display_mode='z')
<nilearn.plotting.displays._slicers.ZSlicer at 0x7f44051f3b50>
../_images/2be63bd7cc0358faf20cb0ab26c638527ba8b4f0771054c3168073a01d926d06.png
# Clean up (for memory consumption)
del results['test_scores']
del results

Ridge regression using MFCC features#

We can now run a similar model, but using MFCC featues, instead of the Mel spectrogram

X = predictors[mfccs]
results_mfcc = _model_cv(estimator, cv, X, y, inner_cv=inner_cv, groups=groups)
mean_scores_mfcc = results_mfcc['test_scores'].mean(axis=0)
mean_scores_mfcc[mean_scores_mfcc < 0] = 0

From looking at whole-brain scores, the MFCC mdoel may outperform the Mel spectrogram model:

mean_scores_mfcc.max()
0.36887338343916876
mean_scores_mfcc.mean()
0.02817772388516969
from nilearn.plotting import plot_stat_map, plot_glass_brain

plot_stat_map(masker.inverse_transform(mean_scores_mfcc), display_mode='z')
<nilearn.plotting.displays._slicers.ZSlicer at 0x7f4407c4e020>
../_images/160d4130d8762b0cab36d473d34e72fd70f68f0308098bf7a7a304f85177b24b.png
# Clean up (for memory consumption)
del results_mfcc['test_scores']
del results_mfcc

Finite impulse response model#

Due to the hemodynamic lag, the model may perform better if the predictors were delayed relative to the stimulus onset.

We can do so using a Finite Impulse Response (FIR) model.

Using fetch_predictors, we can ask for the predictors to be returned as a BIDSVariableCollection, which enables us to apply any of the transformations implemented in pybids.

Alternatively, you could apply any arbitrary transformations, by requesting a pandas DataFrame.

collection = fetch_predictors(all_vars, dataset_name='Budapest', subject='sid000005', 
                              rescale=True, resample=True, return_type='collection')

Next, we use Convolve to apply a fir model, and convert to a pandas df

from bids.modeling.transformations import Convolve

def _convolve_df(collection, predictor_names):    
    Convolve(collection, predictor_names, model='fir', fir_delays=[1, 2, 3, 4])

    # To df, and sort rows by keys
    collection_df = collection.to_df().sort_values(['subject', 'run', 'onset'])

    # Reorder columns
    sort_columns = ['onset', 'duration'] + predictor_names
    sort_columns += collection_df.columns.difference(sort_columns).tolist()
    collection_df = collection_df[sort_columns]
    
    return collection_df

collection_df = _convolve_df(collection, all_vars)
collection_df.shape
(3052, 425)

The FIR convolution results in 5x amount of predictors, including the delayed versions of the predictors.

We can now fit a model with all the Mel spectrogram features, to see if it improves prediction:

X = collection_df.iloc[:, collection_df.columns.str.startswith('mel')]
results_mel_fir = _model_cv(estimator, cv, X, y, inner_cv=inner_cv, groups=groups)
mean_scores_mel_fir = results_mel_fir['test_scores'].mean(axis=0)
mean_scores_mel_fir[mean_scores_mel_fir < 0] = 0

The max score increases, while the mean score remains similar (although marginally higher):

mean_scores_mel_fir.max()
0.28626928535573065
mean_scores_mel_fir.mean()
0.026594240316020888
plot_stat_map(masker.inverse_transform(mean_scores_mel_fir), display_mode='z')
<nilearn.plotting.displays._slicers.ZSlicer at 0x7f4407769570>
../_images/fd5f3a346dda6be77364423736f67e360cb7856ac6d941d8e92c136385bc6b7d.png

Where to go from here?#

The goal of this tutorial is to familiarize you with how to access Neuroscout’s data to fit custom models.

This tutorial does not exhaustively cover all possible models that can be fit using these data, but at this point you should have the tools necessary to access the necessary data from NeuroScout’s API.

By being able to fetch both predictor and brain imaging time courses using a simple, uniform API, it should be easy to extend these examples to a wide variety of machine learning workflows.

Citing Neuroscout#

Please ensure to cite Neuroscout if publishing any work using these data, and be aware that there are ongoing efforts to standardize the most common variations of voxel-wise encoding models, in order to enable users to fully specify and register their analysis in Neuroscout (like you currently can for summary-statistics multi-level GLM models).