Banded ridge regression example#

In this example, we model fMRI responses in a Neuroscout dataset using banded ridge regression.

Banded ridge regression allows you to fit and optimize a distinct regularization hyperparameters for each group or “band” of feature spaces. This is useful if you want to jointly fit two feature space sets. We can then also estimate the relative contribution of each feature set to our prediction for each voxel.

This example assumes that you have already worked through the basic ridge regression example, which demonstrates how to use pyns.fetch_utils to retrieve Neuroscout predictor and neuroimaging data for model fitting. For a comprehensive tutorial, check out the excellent voxelwise modeling tutorials from the Gallant Lab.

Note: By implementing a custom pipeline, your analysis will not be centrally registered on neuroscout.org, and a reproducible record will not be made. For analyses supported the Neuroscout-CLI (e.g. group voxelwise mass univariate GLM models), it is recommended to use the neuroscout.org web inteface, or follow the guide for programmatically creating analyses using pyNS.

Citing Neuroscout#

If you publish any results using the Neuroscout data, be sure to cite the Neuroscout paper, and corresponding datasets:

Alejandro de la Vega, Roberta Rocca, Ross W Blair, Christopher J Markiewicz, Jeff Mentch, James D Kent, Peer Herholz, Satrajit S Ghosh, Russell A Poldrack, Tal Yarkoni (2022). *Neuroscout, a unified platform for generalizable and reproducible fMRI research*. eLife 11:e79277. https://doi.org/10.7554/eLife.79277

Visconti di Oleggio Castello, M., Chauhan, V., Jiahui, G., & Gobbini, M. I. An fMRI dataset in response to “The Grand Budapest Hotel”, a socially-rich, naturalistic movie. Sci Data 7, 383 (2020). https://doi.org/10.1038/s41597-020-00735-4

If running this notebook on Google Colab, be sure to switch the runtime to GPU to expedite model ftting (Runtime -> Change runtime type)

#@title 1) Run once to install dependencies on Colab (pyNS, Datalad, and Machine Learning libraries) (~1 minute)
%%capture --no-display --no-stderr
from IPython.display import display
display("Installing NeuroDebian...")

## Set up DataLad
!wget -O- http://neuro.debian.net/lists/focal.us-ca.libre | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list && sudo apt-key adv --recv-keys --keyserver hkps://keyserver.ubuntu.com 0xA5D32F012649A5A9 && sudo apt-get update
display("Installing DataLad...")

!sudo apt-get install datalad -y
%pip install -U datalad
!git config --global user.email "you@example.com" && git config --global user.name "Your Name"

display("Installing PyNS...")
%pip install pyns
%pip install git+https://github.com/bids-standard/pybids.git#egg=pybids

display("Installing analysis dependencies (nilearn, and himalaya)...")
%pip install nilearn himalaya

display("Done.")

Fetching Neuroscout Data#

For this example, we’ll build on our previous example, and focus on a single subject from the Budapest dataset, in which subjects watched the movie “The Grand Budapest Hotel”.

See the previous example to learn how to browse which datasets and features are available.

First, we define the dataset name, and subject we want to analyze:

dataset_name = 'Budapest'
subject = 'sid000005'
runs = [1, 2, 3]

Next we define the names of the features we want to analyze.

We will fetch two sets of predictors: Mel spetrogram, and Mel Frequency Cepstral Coefficient (MFCC).

In addition, we will also fetch 7 nuisance regressors: framewise_displacement, trans_x, trans_y, trans_z, rot_x, and rot_y.

mfccs = [f'mfcc_{i}' for i in range(20)]
mel = [f'mel_{i}' for i in range(64)]
confounds = ['rot_x', 'rot_y', 'rot_z', 'trans_x', 'trans_y', 'trans_z']

all_vars = mfccs + mel + confounds

Next, we’ll use pyns.fetch_predictors to retrieve these predictors for the target subject, rescale to unit variance, and resampl to the imaging data’s Repetition Time (TR).

Since downloading and resampling can take a minute, for this example we’ll only use the first three (out of 8) runs.

from pyns.fetch_utils import fetch_predictors, fetch_images

predictors = fetch_predictors(all_vars, dataset_name=dataset_name, subject=subject, run=runs, rescale=True, resample=True)
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/pyns/endpoints/base.py:135: UserWarning: No API endpoint for stimulus_id, could not convert
  warnings.warn(f"No API endpoint for {col}, could not convert")

Finally, we load the corresponding images:

# Note: This command can hang after fetching images. 
# If so, stop this cell and re-run, and it will immediately re-execute
preproc_dir, img_objs = fetch_images('Budapest', '/tmp/', subject=subject, run=runs)
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/bids/layout/validation.py:48: UserWarning: The ability to pass arguments to BIDSLayout that control indexing is likely to be removed in future; possibly as early as PyBIDS 0.14. This includes the `config_filename`, `ignore`, `force_index`, and `index_metadata` arguments. The recommended usage pattern is to initialize a new BIDSLayoutIndexer with these arguments, and pass it to the BIDSLayout via the `indexer` argument.
  warnings.warn("The ability to pass arguments to BIDSLayout that control "
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/bids/layout/validation.py:122: UserWarning: The PipelineDescription field was superseded by GeneratedBy in BIDS 1.4.0. You can use ``pybids upgrade`` to update your derivative dataset.
  warnings.warn("The PipelineDescription field was superseded "
action summary:
  get (notneeded: 6)

Preprocessing Images#

To preprocess the images, we must then:

  1. Define a mask to apply

  2. Stack masked images into a single numpy array

This time, we will use a apriori ROI to restrict the voxels to reduce computational time.

In this case, we will use a mask of the Superior Temporal Gyrus based on the Harvard-Oxford atlas, as we hypothesize that auditory processing relevant to our predictors should occur in this area.

# Use nilearn to compute STG mask for both posterior and anterior divison
from nilearn import plotting, datasets, image
import nibabel as nib

dataset_ho = datasets.fetch_atlas_harvard_oxford("cort-maxprob-thr0-2mm")

# Extract both ROIs and combine
stg_p_ix = np.where(np.array(dataset_ho['labels']) == 'Superior Temporal Gyrus, posterior division')[0]
stg_a_ix = np.where(np.array(dataset_ho['labels']) == 'Superior Temporal Gyrus, anerior division')[0]
atlas_data = atlas.get_data()
stg = nilearn.image.new_img_like(atlas, (atlas_data == stg_p_ix) | (atlas_data == stg_a_ix))

# Downsample STG mask to functional sapce
resampled_stg = nilearn.image.resample_img(stg, all_funcs[0].get_image().affine, (78, 93, 78))

plotting.plot_roi(resampled_stg)
/tmp/ipykernel_3687308/3598549859.py:10: DeprecationWarning: get_data() is deprecated in favor of get_fdata(), which has a more predictable return type. To obtain get_data() behavior going forward, use numpy.asanyarray(img.dataobj).

* deprecated from version: 3.0
* Will raise <class 'nibabel.deprecator.ExpiredDeprecationError'> as of version: 5.0
  atlas_data = atlas.get_data()
/tmp/ipykernel_3687308/3598549859.py:11: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  stg = nilearn.image.new_img_like(atlas, (atlas_data == stg_p_ix) | (atlas_data == stg_a_ix))
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/nilearn/image/resampling.py:274: UserWarning: Resampling binary images with continuous or linear interpolation. This might lead to unexpected results. You might consider using nearest interpolation instead.
  warnings.warn("Resampling binary images with continuous or "
<nilearn.plotting.displays._slicers.OrthoSlicer at 0x7fb70ff554b0>
../_images/688f39e774c8efde10821754038e45a578b74afc9065f3f417e7c078d0bf057e.png

Now using this STG mask, we can mask our functional BOLD timecourses, and stack them together into a single dataframer

# Only grab bold images
all_funcs = [f for f in img_objs if f.entities['suffix'] == 'bold']

# Mask images and stack into a single array
import pandas as pd
from nilearn.maskers import NiftiMasker

def _mask_and_stack_images(image_objects, mask):
    """ Stack images into single array, and collect metadata entities into DataFrame """
    masker = NiftiMasker(mask_img=mask)

    arrays = []
    entities = []
    image_objects = sorted(image_objects, key=lambda x: x.entities['run'])
    for img in image_objects:
        run_y = masker.fit_transform(img)
        arrays.append(run_y)
        entities += [dict(img.entities)] * run_y.shape[0]
    entities = pd.DataFrame(entities)
    return np.vstack(arrays), entities, masker

y, img_entities, masker = _mask_and_stack_images(all_funcs, resampled_stg)

The stacked runs have shape: (n_volumes, n_voxels)

y.shape
(1631, 943)

Fitting a banded (or grouped) ridge regression model#

To fit both Mel spectrogram and MFCC features jointly in the same run, we must modify our previous workflow to have the concept of “bands”, or groups of features.

The following workflow assumes a single pandas DataFrame that includes all of the features (X_vars). For a banded model, you additionally need to specify bands which is a list of lists, where each sublist contains the names of the features in that band.

In this example, we will use the Mel spectrogram features as the first band, and the MFCC features as the second band.

In addition, we can specify a list of confounds to be included in the model, but not used for prediction. This is useful if you want to include nuisance regressors in the model, but not use them to predict voxelwise activity.

Finally, the split parameter allows us to adjust model scores in such a way to to disentangle the contribution of the two feature spaces. Split scoring computes the prediction for each feature space separately, and corrects the resulting scores so that the sum of scores from distinct feature spaces sum to the scores of the full prediction. This can be helpful to understand which feature space is better at predicting activity in a given voxel.

To learn more, please see the banded ridge regression example in the Gallant lab voxlelwise modeling tutorials.

The following working _model_cv expands on the previous example with the addition of the bands, split and confounds parameters.

from collections import defaultdict
from himalaya.scoring import correlation_score

def _model_cv(estimator, cv, X_vars, y, bands=None, confounds=None,groups=None,
    scoring=correlation_score, split=True, inner_cv=None):
    """ Cross-validate a model on a set of variables.

    Parameters
    ----------
    estimator : Scikit-learn like estimator
        Estimator to use for fitting the model. Must have a fit and predict
        method.
    cv : cross-validation generator or an iterable
        Outer cross-validation generator to use for cross-validation.
    X_vars : pandas.DataFrame
        Dataframe containing the variables to use for prediction.
    y : array-like, shape (n_samples, n_targets)
        Target values.
    bands : list of str, optional
        List of bands to use for prediction. If None, all of X_vars will be
        used.
    confounds : list of str, optional
        List of confounds to use for prediction. Confounds will be removed
        from the model coefficients prior to scoring.
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.
    scoring : callable, optional  (default: correlation_score)
        Scoring function to use for evaluating the model.
    split : bool, optional (default: True)
        If True, the prediction will be split by band.
    inner_cv : cross-validation generator or an iterable, optional
        Inner cross-validation generator to use for cross-validation.
    """

    # Container for results
    results = defaultdict(list)

    if bands is not None:
        X = []
        for band in bands:
            X.append(X_vars[band].values)

        # If confounds, stack at the end
        if confounds is not None:
            bands.append(confounds)
    else:
        X = X_vars.as_matrix()

        if confounds is not None:
            X = np.hstack([X, confounds])

    # Extract data from dataframe
    X = X_vars.values

    # Extract number of samples for convenience
    n_samples = y.shape[0]

    # Loop through outer cross-validation folds
    for train, test in cv.split(np.arange(n_samples), groups=groups):
        
        # Get training model for list of model bands
        
        X_train = [x[train] for x in X] if type(X) == list else X[train]
        X_test = [x[test] for x in X] if type(X) == list else X[test]
        
        # Create inner cross-validation loop if specified
        if inner_cv:
            # Split inner cross-validation with groups if supplied
            inner_groups = np.array(groups)[train] if groups else groups
            inner_splits = inner_cv.split(np.arange(n_samples)[train],
                                            groups=inner_groups)
            
            # Update estimator with inner cross-validator
            estimator.set_params(cv=inner_splits)

        # Fit the regression model on training data
        estimator.fit(X_train, y[train])

        # Zero out coefficients for confounds if provided
        if confounds is not None:
            estimator.dual_coef_[-len(confounds):] = 0
    
        # Compute predictions with optional splitting by band
        kwargs = {}
        if split is not None:
            kwargs['split'] = split

        test_prediction = estimator.predict(X_test, **kwargs)
        
        # Test scores
        test_score = scoring(y[test], test_prediction)
        
        # Test scores
        test_score = scoring(y[test], test_prediction)

        # If output is not an array, assume its a torch Tensor and convert to array
        if not isinstance(test_prediction, np.ndarray):
            test_prediction = test_prediction.cpu().numpy()
            test_score = test_score.cpu().numpy()
            
        # Populate results dictionary
        results['test_predictions'].append(test_prediction)
        results['test_scores'].append(test_score)
        
        
    # Combine into single aray
    results['test_scores'] = np.stack(results['test_scores'])

    return results

Mel spectrogram features and MFCC features are jointly fit in the same model#

Here, we will fit a banded ridge regression model, where the Mel spectrogram features and MFCC features are jointly fit in the same model. In addition, we will include a set of nuisance regressors in the model (6 motion parameters), but not use them to predict voxelwise activity.

First, we will instantiate the MultipleKernelRidgeCV estimator, cross-validation strategy (leave one run out), and extract the group labels for each observation (i.e. the run number).

from himalaya.kernel_ridge import MultipleKernelRidgeCV
from sklearn.model_selection import GroupKFold
from himalaya.scoring import correlation_score_split
from himalaya.backend import set_backend

backend = set_backend("torch_cuda", on_error="warn")

groups = predictors['run'].tolist()

# Set up estimator and CV objects
estimator = MultipleKernelRidgeCV()
n_runs = len(set(groups))
cv = GroupKFold(n_splits=n_runs)
inner_cv = GroupKFold(n_splits=n_runs - 1)
/home/alejandro/miniconda3/envs/nsencoding/lib/python3.10/site-packages/himalaya/backend/_utils.py:56: UserWarning: Setting backend to torch_cuda failed: PyTorch with CUDA is not available..Falling back to numpy backend.
  warnings.warn(f"Setting backend to {backend} failed: {str(error)}."

Next, we will fit the model:

# Set up X variables (i.e. only predictor columns, no meta-data)
X = predictors.drop(['run', 'run_id', 'subject', 'onset', 'duration'], axis=1)

results_combined = _model_cv(
    estimator, cv, X, y, bands=[mfccs, mel], confounds=confounds, inner_cv=inner_cv, groups=groups
    )
[........................................] 100% | 21.88 sec | 100 random sampling with cv | 
[........................................] 100% | 21.48 sec | 100 random sampling with cv | 
[........................................] 100% | 22.67 sec | 100 random sampling with cv | 
[........................................] 100% | 50.97 sec | 100 random sampling with cv | 

The test_scores and coefficients are of shape: (n_folds, n_bands, n_voxels).

results_combined['test_scores'].shape
(4, 2, 943)

We can now average across folds to get the mean scores for each band, seperately.

scores_combined = results_combined['test_scores'].mean(axis=0)
scores_combined.shape
(2, 943)

Next we can look at mean scores for each band separately:

# MFCC scores
scores_combined[0].max()
0.3640245096813226
from nilearn.plotting import plot_stat_map

plot_stat_map(masker.inverse_transform(scores_combined[0]), cut_coords=[-5, 0, 5], display_mode='z')
<nilearn.plotting.displays._slicers.ZSlicer at 0x7fb710403fa0>
../_images/ab59edc33348c4ac54e6b9631b9b1a6af97556b68a729f8b004f58afddd694f5.png
# Mel-spectrogram scores
scores_combined[1].max()
0.341532606934612
plot_stat_map(masker.inverse_transform(scores_combined[1]), cut_coords=[-5, 0, 5], display_mode='z')
<nilearn.plotting.displays._slicers.ZSlicer at 0x7fb710391db0>
../_images/edd2d2629fd88c6e6008b33b3f566b91cf8d4033763a674bdf28def4a31338b8.png

In this case,

What’s next?#

The goal of this tutorial is to demonstrate how a banded regression model can be fit using pyns.fetch_utils and `himalaya.

For a more comprehensive tutorial on the methods presented here, check out the excellent voxelwise modeling tutorials from the Gallant Lab.

Citing Neuroscout#

Please ensure to cite Neuroscout if publishing any work using these data, and be aware that there are ongoing efforts to standardize the most common variations of voxel-wise encoding models, in order to enable users to fully specify and register their analysis in Neuroscout (like you currently can for summary-statistics multi-level GLM models).