pipeline.data_processing.pipeline.py

Automatic Doc creation

In this example one still has to create the Inputs and Outputs tables by hand, which is pretty tedius. So we should investigate if this can be a more automated process similar to standard docstrings.

create_pipeline(**kwargs)

Overview

The data_processing pipeline takes in the raw input data and carries out preprocessing to clean up the data nd merge the 3 input tables to a single model_input_table to be used in model creation.

Inputs:
Name Type Description
shuttles pandas.DataFrame List of all shuttles
companies pandas.DataFrame List of companies
reviews pandas.DataFrame List of reviews

Outputs:

Name Type Description
model_input_table pandas.DataFrame Tidied up and combined list of all
shuttles with companies and reviews
Source code in src/spaceflights/pipelines/data_processing/pipeline.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def create_pipeline(**kwargs) -> Pipeline:
    """ ## Overview

    The `data_processing` pipeline takes in the raw input data and carries 
    out preprocessing to clean up the data nd merge the 3 input tables to a 
    single `model_input_table` to be used in model creation.

    ## Inputs:

    | Name      | Type             | Description          |
    | --------- | ---------------- | -------------------- |
    | shuttles  | pandas.DataFrame | List of all shuttles |
    | companies | pandas.DataFrame | List of companies    |
    | reviews   | pandas.DataFrame | List of reviews      |



    **Outputs:**

    | Name              | Type             | Description                             |
    | ----------------- | ---------------- | --------------------------------------- |
    | model_input_table | pandas.DataFrame | Tidied up and combined list of all </br>shuttles with companies and reviews |
    """


    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ],
        namespace="data_processing",
        inputs=["companies", "shuttles", "reviews"],
        outputs="model_input_table",
    )

pipeline.data_processing.nodes.py

Automatic Doc creation

Just writing standard docstring is fine for nodes - they are parsed using mkdocstrings and inserted into the main markdown files.

create_model_input_table(shuttles, companies, reviews)

Combines all data to create a model input table.

Parameters:

Name Type Description Default
shuttles pd.DataFrame

Preprocessed data for shuttles.

required
companies pd.DataFrame

Preprocessed data for companies.

required
reviews pd.DataFrame

Raw data for reviews.

required

Returns:

Type Description
pd.DataFrame

Model input table.

Source code in src/spaceflights/pipelines/data_processing/nodes.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame, reviews: pd.DataFrame
) -> pd.DataFrame:
    """Combines all data to create a model input table.

    Args:
        shuttles: Preprocessed data for shuttles.
        companies: Preprocessed data for companies.
        reviews: Raw data for reviews.
    Returns:
        Model input table.

    """
    rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
    model_input_table = rated_shuttles.merge(
        companies, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table

preprocess_companies(companies)

Preprocesses the data for companies.

Parameters:

Name Type Description Default
companies pd.DataFrame

Raw data.

required

Returns:

Type Description
pd.DataFrame

Preprocessed data, with company_rating converted to a float and

pd.DataFrame

iata_approved converted to boolean.

Source code in src/spaceflights/pipelines/data_processing/nodes.py
26
27
28
29
30
31
32
33
34
35
36
37
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies.

    Args:
        companies: Raw data.
    Returns:
        Preprocessed data, with `company_rating` converted to a float and
        `iata_approved` converted to boolean.
    """
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    companies["company_rating"] = _parse_percentage(companies["company_rating"])
    return companies

preprocess_shuttles(shuttles)

Preprocesses the data for shuttles.

Parameters:

Name Type Description Default
shuttles pd.DataFrame

Raw data.

required

Returns:

Type Description
pd.DataFrame

Preprocessed data, with price converted to a float and d_check_complete,

pd.DataFrame

moon_clearance_complete converted to boolean.

Source code in src/spaceflights/pipelines/data_processing/nodes.py
40
41
42
43
44
45
46
47
48
49
50
51
52
def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for shuttles.

    Args:
        shuttles: Raw data.
    Returns:
        Preprocessed data, with `price` converted to a float and `d_check_complete`,
        `moon_clearance_complete` converted to boolean.
    """
    shuttles["d_check_complete"] = _is_true(shuttles["d_check_complete"])
    shuttles["moon_clearance_complete"] = _is_true(shuttles["moon_clearance_complete"])
    shuttles["price"] = _parse_money(shuttles["price"])
    return shuttles