pipeline.data_science.pipeline.py

Automatic Doc creation

In this example one still has to create the Inputs and Outputs tables by hand, which is pretty tedius. So to investigate if this can be a more automated process similar to standard docstrings.

create_pipeline(**kwargs)

Overview

The data_science pipeline uses the model_input_table and splits the dataset into a train and test set and then uses LinearRegression to build a model to predict flight prices. It then evaluates the model and prints the result to the log. It creates 2 instances of the pipelines with independent parameters, active_modelling_pipeline and candidate_modelling_pipeline.

Inputs:
Name Type Description
model_input_table pandas.DataFrame Tidied up and combined list of all
shuttles with companies and reviews

Outputs:

Name Type Description
active_modelling_pipeline.regressor pickle.PickleDataSet Active model in production
candidate_modelling_pipeline.regressor pickle.PickleDataSet Candidate model in development
Source code in src/spaceflights/pipelines/data_science/pipeline.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def create_pipeline(**kwargs) -> Pipeline:
    """ ## Overview

    The `data_science` pipeline uses the `model_input_table` and splits the 
    dataset into a train and test set and then uses `LinearRegression` to build a model
    to predict flight prices. It then evaluates the model and prints the result to the log.
    It creates 2 instances of the pipelines with independent parameters, `active_modelling_pipeline`
    and `candidate_modelling_pipeline`.

    ## Inputs:

    | Name                | Type               | Description                             |
    | ------------------- | ------------------ | --------------------------------------- |
    | `model_input_table` | `pandas.DataFrame` | Tidied up and combined list of all </br>shuttles with companies and reviews |



    **Outputs:**

    | Name                                   | Type                 | Description                             |
    | -------------------------------------- | -------------------- | --------------------------------------- |
    | `active_modelling_pipeline.regressor`    | `pickle.PickleDataSet` |  Active model in production             |
    | `candidate_modelling_pipeline.regressor` | `pickle.PickleDataSet` |  Candidate model in development         |
    """

    pipeline_instance = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )
    return pipeline(
        pipe=ds_pipeline_1 + ds_pipeline_2,
        inputs="model_input_table",
        namespace="data_science",
    )

pipeline.data_science.nodes.py

Automatic Doc creation

Just writing standard docstring is fine for nodes - they are parsed using mkdocstrings and inserted into the main markdown files.

evaluate_model(regressor, X_test, y_test)

Calculates and logs the coefficient of determination.

Parameters:

Name Type Description Default
regressor LinearRegression

Trained model.

required
X_test pd.DataFrame

Testing data of independent features.

required
y_test pd.Series

Testing data for price.

required
Source code in src/spaceflights/pipelines/data_science/nodes.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def evaluate_model(
    regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
):
    """Calculates and logs the coefficient of determination.

    Args:
        regressor: Trained model.
        X_test: Testing data of independent features.
        y_test: Testing data for price.
    """
    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f on test data.", score)

split_data(data, parameters)

Splits data into features and targets training and test sets.

Parameters:

Name Type Description Default
data pd.DataFrame

Data containing features and target.

required
parameters Dict

Parameters defined in parameters/data_science.yml.

required

Returns:

Type Description
Tuple

Split data.

Source code in src/spaceflights/pipelines/data_science/nodes.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def split_data(data: pd.DataFrame, parameters: Dict) -> Tuple:
    """Splits data into features and targets training and test sets.

    Args:
        data: Data containing features and target.
        parameters: Parameters defined in parameters/data_science.yml.
    Returns:
        Split data.
    """
    X = data[parameters["features"]]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

train_model(X_train, y_train)

Trains the linear regression model.

Parameters:

Name Type Description Default
X_train pd.DataFrame

Training data of independent features.

required
y_train pd.Series

Training data for price.

required

Returns:

Type Description
LinearRegression

Trained model.

Source code in src/spaceflights/pipelines/data_science/nodes.py
34
35
36
37
38
39
40
41
42
43
44
45
46
def train_model(X_train: pd.DataFrame, y_train: pd.Series) -> LinearRegression:
    """Trains the linear regression model.

    Args:
        X_train: Training data of independent features.
        y_train: Training data for price.

    Returns:
        Trained model.
    """
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor