Back to Problems
## Generate Random Subsets of a Dataset

#### Example

## Understanding Random Subsets of a Dataset

Write a Python function to generate random subsets of a given dataset. The function should take in a 2D numpy array X, a 1D numpy array y, an integer n_subsets, and a boolean replacements. It should return a list of n_subsets random subsets of the dataset, where each subset is a tuple of (X_subset, y_subset). If replacements is True, the subsets should be created with replacements; otherwise, without replacements.

Example: X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]) y = np.array([1, 2, 3, 4, 5]) n_subsets = 3 replacements = False get_random_subsets(X, y, n_subsets, replacements) Output: [array([[7, 8], [1, 2]]), array([4, 1])] [array([[9, 10], [5, 6]]), array([5, 3])] [array([[3, 4], [5, 6]]), array([2, 3])] Reasoning: The function generates three random subsets of the dataset without replacements. Each subset includes 50% of the samples (since replacements=False). The samples are randomly selected without duplication.

Generating random subsets of a dataset is a useful technique in machine learning, particularly in ensemble methods like bagging and random forests. By creating random subsets, models can be trained on different parts of the data, which helps in reducing overfitting and improving generalization.

In this problem, you will write a function to generate random subsets of a given dataset. Given a 2D numpy array X, a 1D numpy array y, an integer n_subsets, and a boolean replacements, the function will create a list of n_subsets random subsets. Each subset will be a tuple of (X_subset, y_subset).

If replacements is True, the subsets will be created with replacements, meaning that samples can be repeated in a subset. The subset size should be the same as the original dataset in this case. If replacements is False, the subsets will be created without replacements, meaning that samples cannot be repeated within a subset. The subset size should take the floor of the original dataset size divided by 2 if replacements is False

By understanding and implementing this technique, you can enhance the performance of your models through techniques like bootstrapping and ensemble learning.

import numpy as np def get_random_subsets(X, y, n_subsets, replacements=True, seed=42): np.random.seed(seed) n, m = X.shape subset_size = n if replacements else n // 2 idx = np.array([np.random.choice(n, subset_size, replace=replacements) for _ in range(n_subsets)]) # convert all ndarrays to lists return [(X[idx][i].tolist(), y[idx][i].tolist()) for i in range(n_subsets)]

There’s no video solution available yet 😔, but you can be the first to submit one at: GitHub link.

Output will be shown here.

Solution copied to clipboard!