pipeline.data_science.pipeline.py
Automatic Doc creation
In this example one still has to create the Inputs
and Outputs
tables by hand,
which is pretty tedius. So to investigate if this can be a more automated process similar
to standard docstrings.
create_pipeline(**kwargs)
Overview
The data_science
pipeline uses the model_input_table
and splits the
dataset into a train and test set and then uses LinearRegression
to build a model
to predict flight prices. It then evaluates the model and prints the result to the log.
It creates 2 instances of the pipelines with independent parameters, active_modelling_pipeline
and candidate_modelling_pipeline
.
Inputs:
Name | Type | Description |
---|---|---|
model_input_table |
pandas.DataFrame |
Tidied up and combined list of all shuttles with companies and reviews |
Outputs:
Name | Type | Description |
---|---|---|
active_modelling_pipeline.regressor |
pickle.PickleDataSet |
Active model in production |
candidate_modelling_pipeline.regressor |
pickle.PickleDataSet |
Candidate model in development |
Source code in src/spaceflights/pipelines/data_science/pipeline.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
pipeline.data_science.nodes.py
Automatic Doc creation
Just writing standard docstring is fine for nodes - they are parsed using mkdocstrings
and inserted into the main markdown files.
evaluate_model(regressor, X_test, y_test)
Calculates and logs the coefficient of determination.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
regressor |
LinearRegression
|
Trained model. |
required |
X_test |
pd.DataFrame
|
Testing data of independent features. |
required |
y_test |
pd.Series
|
Testing data for price. |
required |
Source code in src/spaceflights/pipelines/data_science/nodes.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
split_data(data, parameters)
Splits data into features and targets training and test sets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
pd.DataFrame
|
Data containing features and target. |
required |
parameters |
Dict
|
Parameters defined in parameters/data_science.yml. |
required |
Returns:
Type | Description |
---|---|
Tuple
|
Split data. |
Source code in src/spaceflights/pipelines/data_science/nodes.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
train_model(X_train, y_train)
Trains the linear regression model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_train |
pd.DataFrame
|
Training data of independent features. |
required |
y_train |
pd.Series
|
Training data for price. |
required |
Returns:
Type | Description |
---|---|
LinearRegression
|
Trained model. |
Source code in src/spaceflights/pipelines/data_science/nodes.py
34 35 36 37 38 39 40 41 42 43 44 45 46 |
|