Data processing

pipeline.data_processing.pipeline.py

Automatic Doc creation

In this example one still has to create the Inputs and Outputs tables by hand, which is pretty tedius. So we should investigate if this can be a more automated process similar to standard docstrings.

`create_pipeline(**kwargs)`

Overview

The data_processing pipeline takes in the raw input data and carries out preprocessing to clean up the data nd merge the 3 input tables to a single model_input_table to be used in model creation.

Inputs:

Name	Type	Description
shuttles	pandas.DataFrame	List of all shuttles
companies	pandas.DataFrame	List of companies
reviews	pandas.DataFrame	List of reviews

Outputs:

Name	Type	Description
model_input_table	pandas.DataFrame	Tidied up and combined list of all shuttles with companies and reviews

Source code in src/spaceflights/pipelines/data_processing/pipeline.py

def create_pipeline(**kwargs) -> Pipeline:
    """ ## Overview

    The `data_processing` pipeline takes in the raw input data and carries 
    out preprocessing to clean up the data nd merge the 3 input tables to a 
    single `model_input_table` to be used in model creation.

    ## Inputs:

    | Name      | Type             | Description          |
    | --------- | ---------------- | -------------------- |
    | shuttles  | pandas.DataFrame | List of all shuttles |
    | companies | pandas.DataFrame | List of companies    |
    | reviews   | pandas.DataFrame | List of reviews      |



    **Outputs:**

    | Name              | Type             | Description                             |
    | ----------------- | ---------------- | --------------------------------------- |
    | model_input_table | pandas.DataFrame | Tidied up and combined list of all </br>shuttles with companies and reviews |
    """


    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ],
        namespace="data_processing",
        inputs=["companies", "shuttles", "reviews"],
        outputs="model_input_table",
    )

pipeline.data_processing.nodes.py

Automatic Doc creation

Just writing standard docstring is fine for nodes - they are parsed using mkdocstrings and inserted into the main markdown files.

`create_model_input_table(shuttles, companies, reviews)`

Combines all data to create a model input table.

Parameters:

Name	Type	Description	Default
`shuttles`	`pd.DataFrame`	Preprocessed data for shuttles.	required
`companies`	`pd.DataFrame`	Preprocessed data for companies.	required
`reviews`	`pd.DataFrame`	Raw data for reviews.	required

Returns:

Type	Description
`pd.DataFrame`	Model input table.

Source code in src/spaceflights/pipelines/data_processing/nodes.py

def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame, reviews: pd.DataFrame
) -> pd.DataFrame:
    """Combines all data to create a model input table.

    Args:
        shuttles: Preprocessed data for shuttles.
        companies: Preprocessed data for companies.
        reviews: Raw data for reviews.
    Returns:
        Model input table.

    """
    rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
    model_input_table = rated_shuttles.merge(
        companies, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table

`preprocess_companies(companies)`

Preprocesses the data for companies.

Parameters:

Name	Type	Description	Default
`companies`	`pd.DataFrame`	Raw data.	required

Returns:

Type	Description
`pd.DataFrame`	Preprocessed data, with `company_rating` converted to a float and
`pd.DataFrame`	`iata_approved` converted to boolean.

Source code in src/spaceflights/pipelines/data_processing/nodes.py

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies.

    Args:
        companies: Raw data.
    Returns:
        Preprocessed data, with `company_rating` converted to a float and
        `iata_approved` converted to boolean.
    """
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    companies["company_rating"] = _parse_percentage(companies["company_rating"])
    return companies

`preprocess_shuttles(shuttles)`

Preprocesses the data for shuttles.

Parameters:

Name	Type	Description	Default
`shuttles`	`pd.DataFrame`	Raw data.	required

Returns:

Type	Description
`pd.DataFrame`	Preprocessed data, with `price` converted to a float and `d_check_complete`,
`pd.DataFrame`	`moon_clearance_complete` converted to boolean.

Source code in src/spaceflights/pipelines/data_processing/nodes.py

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for shuttles.

    Args:
        shuttles: Raw data.
    Returns:
        Preprocessed data, with `price` converted to a float and `d_check_complete`,
        `moon_clearance_complete` converted to boolean.
    """
    shuttles["d_check_complete"] = _is_true(shuttles["d_check_complete"])
    shuttles["moon_clearance_complete"] = _is_true(shuttles["moon_clearance_complete"])
    shuttles["price"] = _parse_money(shuttles["price"])
    return shuttles