Data science

pipeline.data_science.pipeline.py

Automatic Doc creation

In this example one still has to create the Inputs and Outputs tables by hand, which is pretty tedius. So to investigate if this can be a more automated process similar to standard docstrings.

`create_pipeline(**kwargs)`

Overview

The data_science pipeline uses the model_input_table and splits the dataset into a train and test set and then uses LinearRegression to build a model to predict flight prices. It then evaluates the model and prints the result to the log. It creates 2 instances of the pipelines with independent parameters, active_modelling_pipeline and candidate_modelling_pipeline.

Inputs:

Name	Type	Description
`model_input_table`	`pandas.DataFrame`	Tidied up and combined list of all shuttles with companies and reviews

Outputs:

Name	Type	Description
`active_modelling_pipeline.regressor`	`pickle.PickleDataSet`	Active model in production
`candidate_modelling_pipeline.regressor`	`pickle.PickleDataSet`	Candidate model in development

Source code in src/spaceflights/pipelines/data_science/pipeline.py

def create_pipeline(**kwargs) -> Pipeline:
    """ ## Overview

    The `data_science` pipeline uses the `model_input_table` and splits the 
    dataset into a train and test set and then uses `LinearRegression` to build a model
    to predict flight prices. It then evaluates the model and prints the result to the log.
    It creates 2 instances of the pipelines with independent parameters, `active_modelling_pipeline`
    and `candidate_modelling_pipeline`.

    ## Inputs:

    | Name                | Type               | Description                             |
    | ------------------- | ------------------ | --------------------------------------- |
    | `model_input_table` | `pandas.DataFrame` | Tidied up and combined list of all </br>shuttles with companies and reviews |



    **Outputs:**

    | Name                                   | Type                 | Description                             |
    | -------------------------------------- | -------------------- | --------------------------------------- |
    | `active_modelling_pipeline.regressor`    | `pickle.PickleDataSet` |  Active model in production             |
    | `candidate_modelling_pipeline.regressor` | `pickle.PickleDataSet` |  Candidate model in development         |
    """

    pipeline_instance = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )
    return pipeline(
        pipe=ds_pipeline_1 + ds_pipeline_2,
        inputs="model_input_table",
        namespace="data_science",
    )

pipeline.data_science.nodes.py

Automatic Doc creation

Just writing standard docstring is fine for nodes - they are parsed using mkdocstrings and inserted into the main markdown files.

`evaluate_model(regressor, X_test, y_test)`

Calculates and logs the coefficient of determination.

Parameters:

Name	Type	Description	Default
`regressor`	`LinearRegression`	Trained model.	required
`X_test`	`pd.DataFrame`	Testing data of independent features.	required
`y_test`	`pd.Series`	Testing data for price.	required

Source code in src/spaceflights/pipelines/data_science/nodes.py

def evaluate_model(
    regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
):
    """Calculates and logs the coefficient of determination.

    Args:
        regressor: Trained model.
        X_test: Testing data of independent features.
        y_test: Testing data for price.
    """
    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f on test data.", score)

`split_data(data, parameters)`

Splits data into features and targets training and test sets.

Parameters:

Name	Type	Description	Default
`data`	`pd.DataFrame`	Data containing features and target.	required
`parameters`	`Dict`	Parameters defined in parameters/data_science.yml.	required

Returns:

Type	Description
`Tuple`	Split data.

Source code in src/spaceflights/pipelines/data_science/nodes.py

def split_data(data: pd.DataFrame, parameters: Dict) -> Tuple:
    """Splits data into features and targets training and test sets.

    Args:
        data: Data containing features and target.
        parameters: Parameters defined in parameters/data_science.yml.
    Returns:
        Split data.
    """
    X = data[parameters["features"]]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

`train_model(X_train, y_train)`

Trains the linear regression model.

Parameters:

Name	Type	Description	Default
`X_train`	`pd.DataFrame`	Training data of independent features.	required
`y_train`	`pd.Series`	Training data for price.	required

Returns:

Type	Description
`LinearRegression`	Trained model.

Source code in src/spaceflights/pipelines/data_science/nodes.py

def train_model(X_train: pd.DataFrame, y_train: pd.Series) -> LinearRegression:
    """Trains the linear regression model.

    Args:
        X_train: Training data of independent features.
        y_train: Training data for price.

    Returns:
        Trained model.
    """
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor