Missing Data Imputation and Regression Pipeline

Machine Learning · Medium · Free problem

You are given a pandas DataFrame with numerical features and a target column. Some values in the feature columns are missing. Your task is to build a complete modeling pipeline. **(a)** Implement a function `linear_interpolate(df, col)` that fills missing values in column `col` using linear interpolation, with backward fill as a fallback for any remaining NaNs at the edges. **(b)** Implement a function `eda(df)` that prints summary statistics, missing value counts, and the correlation matrix. **(c)** Fit a linear regression model on the cleaned data using MSE as the loss. Justify why MSE is the right loss function here, not just as a convention but as a statistical argument. **(d) (Bonus)** Implement greedy forward feature selection: starting from an empty feature set, at each step add the feature that most improves cross-validated MSE, until all features have been ranked. **Constraints:** - Use `pandas` for data handling and `scikit-learn` for modeling - Cross-validation should use 5-fold CV - All functions must handle edge cases: all-NaN columns, single-feature DataFrames, and columns where interpolation leaves leading NaNs **Example:** Input DataFrame: ``` x1 x2 y 0 1.0 NaN 2.1 1 NaN 3.0 4.0 2 3.0 5.0 6.2 3 4.0 6.0 7.9 ``` After `linear_interpolate(df, 'x1')`: the NaN at index 1 is replaced by `2.0` (linear interpolation between 1.0 and 3.0). After `linear_interpolate(df, 'x2')`: the NaN at index 0 is replaced by `3.0` (backward fill, since there is no prior value).

Open the full interactive solver, hints, and worked solution →