pipeline.data_processing.pipeline.py
Automatic Doc creation
In this example one still has to create the Inputs
and Outputs
tables by hand,
which is pretty tedius. So we should investigate if this can be a more automated process similar
to standard docstrings.
create_pipeline(**kwargs)
Overview
The data_processing
pipeline takes in the raw input data and carries
out preprocessing to clean up the data nd merge the 3 input tables to a
single model_input_table
to be used in model creation.
Inputs:
Name | Type | Description |
---|---|---|
shuttles | pandas.DataFrame | List of all shuttles |
companies | pandas.DataFrame | List of companies |
reviews | pandas.DataFrame | List of reviews |
Outputs:
Name | Type | Description |
---|---|---|
model_input_table | pandas.DataFrame | Tidied up and combined list of all shuttles with companies and reviews |
Source code in src/spaceflights/pipelines/data_processing/pipeline.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
pipeline.data_processing.nodes.py
Automatic Doc creation
Just writing standard docstring is fine for nodes - they are parsed using mkdocstrings
and inserted into the main markdown files.
create_model_input_table(shuttles, companies, reviews)
Combines all data to create a model input table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
shuttles |
pd.DataFrame
|
Preprocessed data for shuttles. |
required |
companies |
pd.DataFrame
|
Preprocessed data for companies. |
required |
reviews |
pd.DataFrame
|
Raw data for reviews. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
Model input table. |
Source code in src/spaceflights/pipelines/data_processing/nodes.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
preprocess_companies(companies)
Preprocesses the data for companies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
companies |
pd.DataFrame
|
Raw data. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
Preprocessed data, with |
pd.DataFrame
|
|
Source code in src/spaceflights/pipelines/data_processing/nodes.py
26 27 28 29 30 31 32 33 34 35 36 37 |
|
preprocess_shuttles(shuttles)
Preprocesses the data for shuttles.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
shuttles |
pd.DataFrame
|
Raw data. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
Preprocessed data, with |
pd.DataFrame
|
|
Source code in src/spaceflights/pipelines/data_processing/nodes.py
40 41 42 43 44 45 46 47 48 49 50 51 52 |
|