Back to Problems

Random Shuffle of Dataset

Write a Python function to perform a random shuffle of the samples in two numpy arrays, X and y, while maintaining the corresponding order between them. The function should have an optional seed parameter for reproducibility.

Example

Example:
    X = np.array([[1, 2], 
                  [3, 4], 
                  [5, 6], 
                  [7, 8]])
    y = np.array([1, 2, 3, 4])
    output: (array([[5, 6],
                    [1, 2],
                    [7, 8],
                    [3, 4]]), 
             array([3, 1, 4, 2]))
    

Understanding Dataset Shuffling

Random shuffling of a dataset is a common preprocessing step in machine learning to ensure that the data is randomly distributed before training a model. This helps to avoid any potential biases that may arise from the order in which data is presented to the model.

Here's a step-by-step method to shuffle a dataset:

  1. Generate a Random Index Array: Create an array of indices corresponding to the number of samples in the dataset.
  2. Shuffle the Indices: Use a random number generator to shuffle the array of indices.
  3. Reorder the Dataset: Use the shuffled indices to reorder the samples in both X and y.

This method ensures that the correspondence between X and y is maintained after shuffling.

import numpy as np

def shuffle_data(X, y, seed=None):
    if seed:
        np.random.seed(seed)
    idx = np.arange(X.shape[0])
    np.random.shuffle(idx)
    return X[idx], y[idx]
    

There’s no video solution available yet 😔, but you can be the first to submit one at: GitHub link.

Your Solution

Output will be shown here.