Missing Data Imputation and Regression Pipeline
You are given a pandas DataFrame with numerical features and a target column. Some values in the feature columns are missing. Your task is to build a complete modeling pipeline.
**(a)** Implement a function `linear_interpolate(df, col)` that fills missing values in column `col` using linear interpolation, with backward fill as a fallback for any remaining NaNs at the edges.
**(b)** Implement a function `eda(df)` that prints summary statistics, missing value counts, and the correlation matrix.
**(c)** Fit a linear regression model on the cleaned data using MSE as the loss. Justify why MSE is the right loss function here, not just as a convention but as a statistical argument.
**(d) (Bonus)** Implement greedy forward feature selection: starting from an empty feature set, at each step add the feature that most improves cross-validated MSE, until all features have been ranked.
**Constraints:**
- Use `pandas` for data handling and `scikit-learn` for modeling
- Cross-validation should use 5-fold CV
- All functions must handle edge cases: all-NaN columns, single-feature DataFrames, and columns where interpolation leaves leading NaNs
**Example:**
Input DataFrame:
```
x1 x2 y
0 1.0 NaN 2.1
1 NaN 3.0 4.0
2 3.0 5.0 6.2
3 4.0 6.0 7.9
```
After `linear_interpolate(df, 'x1')`: the NaN at index 1 is replaced by `2.0` (linear interpolation between 1.0 and 3.0).
After `linear_interpolate(df, 'x2')`: the NaN at index 0 is replaced by `3.0` (backward fill, since there is no prior value).
Open the full interactive solver, hints, and worked solution →