Data Manipulation Tutorial
This notebook provides examples on how to carry out data manipulation and aggregation using the post_processing python library. Be sure to go through the Quick Start section of the documentation for instructions on how to access and import the libary and its packages.
If you would like to open an editable runnable version of the tutorial click here to be directed to a binder platform
The Library is still under active development and empty sections will be completed in Due time
Table of content
All files are available in the github repository here
Requirements
The conda environmnent contains all libraries associated the post processing library. After setting up the conda environment, you only have to import the data maniupulation module from postprocessinglib.evaluation.
In this example though, I will also be importing other modules to help generate the data that I will be trying to analyse.
[2]:
import pandas as pd
from postprocessinglib.evaluation import data
GENERATE DATAFRAMES
This is the main overarching function that returns the required files in theier respective formats for use by the other modules and functions in the library. In its simplest form, it requires a csv file which contains the predicted and measured data formatted reffered to as the merged data as shown below :
merged data
Some datetime |
station1_obs |
station1_sim |
station2_obs |
station2_sim |
|---|---|---|---|---|
or two csv files - one for observed and one for the simulated data, similarly formatted as shown below:
obs data
Some datetime |
station1_obs |
station2_obs |
|---|---|---|
sim data
Some datetime |
station1_sim |
station2_sim |
|---|---|---|
We then pass these into our generate dataframes function as shown below:
[3]:
# passing a small controlled csv file with only two stations for testing
path = "MESH_output_streamflow_1.csv"
# csv_fpath is used to represent the merged csv file
DATAFRAMES = data.generate_dataframes(csv_fpaths=path)
The start date for the Data is 1980-01-01
By default the function returns a dictionary that contains 3 dataframes - the Merged dataframe, the Observed dataframe and the Simuated dataframe, represented as DF, DF_OBSERVED and DF_SIMULATED respectively as demonstrated below:
[4]:
print("The Merged dataframe:")
print(DATAFRAMES["DF"].head(5))
print("\nThe Observed dataframe:")
print(DATAFRAMES["DF_OBSERVED"].head(5))
print("\nThe Simulated dataframe:")
print(DATAFRAMES["DF_SIMULATED"].head(5))
The Merged dataframe:
QOMEAS_05BB001 QOSIM_05BB001 QOMEAS_05BA001 QOSIM_05BA001
1980-01-01 15.0 940.934800 NaN 695.375700
1980-01-02 15.0 1000.471000 NaN 233.646000
1980-01-03 13.3 303.156100 NaN 5.282405
1980-01-04 11.8 13.658420 NaN 0.481292
1980-01-05 12.9 2.095571 NaN 0.298271
The Observed dataframe:
QOMEAS_05BB001 QOMEAS_05BA001
1980-01-01 15.0 NaN
1980-01-02 15.0 NaN
1980-01-03 13.3 NaN
1980-01-04 11.8 NaN
1980-01-05 12.9 NaN
The Simulated dataframe:
QOSIM_05BB001 QOSIM_05BA001
1980-01-01 940.934800 695.375700
1980-01-02 1000.471000 233.646000
1980-01-03 303.156100 5.282405
1980-01-04 13.658420 0.481292
1980-01-05 2.095571 0.298271
You are also able to tell the function to skip the first few values by writing a value to the warm_up parameter of the function. We are also able to specify a start and end date using the start_date and end_date. These are useful in cases when you want only a fixed time like a particular year everything after or before a particular date. A few examples are shown below:
[5]:
# assuming the simulation model needs 366 days (the first year) to warm up and account for errors during the learning phase.
DATAFRAMES = data.generate_dataframes(csv_fpaths=path, warm_up=366)
The start date for the Data is 1981-01-01
Observe that the data now skips the entire first year starting from 1981-01-01 as opposed to 1980-01-01.
[6]:
DATAFRAMES_till2009 = data.generate_dataframes(csv_fpaths=path, warm_up=366, end_date='2009-12-31')
print("\nThe End of the Merged dataframe:")
print(DATAFRAMES_till2009["DF"].tail(5))
The start date for the Data is 1981-01-01
The End of the Merged dataframe:
QOMEAS_05BB001 QOSIM_05BB001 QOMEAS_05BA001 QOSIM_05BA001
2009-12-27 NaN 4.114114 NaN 0.815359
2009-12-28 NaN 4.091105 NaN 0.810912
2009-12-29 NaN 4.068261 NaN 0.806497
2009-12-30 NaN 4.045577 NaN 0.802113
2009-12-31 NaN 4.023057 NaN 0.797758
Notice how it ends at 2009 as specified
[7]:
DATAFRAMES_from1995 = data.generate_dataframes(csv_fpaths=path, warm_up=366, start_date='1995-01-01')
print("\nThe Start of the Observed dataframe:")
print(DATAFRAMES_from1995["DF_OBSERVED"].head(5))
The start date for the Data is 1995-01-01
The Start of the Observed dataframe:
QOMEAS_05BB001 QOMEAS_05BA001
1995-01-01 8.37 NaN
1995-01-02 10.10 NaN
1995-01-03 12.20 NaN
1995-01-04 13.00 NaN
1995-01-05 13.20 NaN
Observe that the data now starts 1995
[8]:
DATAFRAMES_January2010 = data.generate_dataframes(csv_fpaths=path, warm_up=366, start_date='2010-01-01' , end_date='2010-1-31')
print("\nThe Start of the Merged dataframe:")
print(DATAFRAMES_January2010["DF"].head(5))
print("\nThe End of the Merged dataframe:")
print(DATAFRAMES_January2010["DF"].tail(5))
The start date for the Data is 2010-01-01
The Start of the Merged dataframe:
QOMEAS_05BB001 QOSIM_05BB001 QOMEAS_05BA001 QOSIM_05BA001
2010-01-01 NaN 4.000698 NaN 0.793435
2010-01-02 NaN 3.978494 NaN 0.789141
2010-01-03 NaN 3.956450 NaN 0.784877
2010-01-04 NaN 3.934558 NaN 0.780643
2010-01-05 NaN 3.912824 NaN 0.776437
The End of the Merged dataframe:
QOMEAS_05BB001 QOSIM_05BB001 QOMEAS_05BA001 QOSIM_05BA001
2010-01-27 NaN 3.471175 NaN 0.690854
2010-01-28 NaN 3.452648 NaN 0.687258
2010-01-29 NaN 3.434252 NaN 0.683687
2010-01-30 NaN 3.415977 NaN 0.680139
2010-01-31 NaN 3.397829 NaN 0.676615
Observe that the data now starts from January 1st 2010 as specified and ends at January 31st 2010
Something to note though is that when specifying a warm up date and a start date and the start date exists before the warm up, its start date will be pushed forward to the warm up date as the warm up parameter takes precedence. For instance:
[9]:
# Here we set our start date to be somewhere within the 366 warmup time. It will get overidden and start from the end of
# the warm up time
DATAFRAMES_from_June_1980 = data.generate_dataframes(csv_fpaths=path, warm_up=366, start_date='1980-06-01')
print("\nThe Start of the Predicted Data: ")
print(DATAFRAMES_from_June_1980["DF_SIMULATED"].head(5))
The start date for the Data is 1981-01-01
The Start of the Predicted Data:
QOSIM_05BB001 QOSIM_05BA001
1981-01-01 2.518999 1.001954
1981-01-02 2.507289 0.997078
1981-01-03 2.495637 0.992233
1981-01-04 2.484073 0.987417
1981-01-05 2.472571 0.982631
As you can observe it starts from 1981 despite specifying a start date of June 1st 1980!
The three dataframes - merged, observed and simulated - form the backbone of the library. Every other function in the library uses one or more of at least these three dataframes to perform analysis whether visual, descriptive or diagonistic.
DAILY AGGREGATION
This function returns the daily aggregate of the data passed into it. Its aggregates using the method passed or if one isnt given, its default is mean. Its that simple. Most of the data already comes with daily time stamps though so this is one of the fewer used functions. Its functionality is shown below:
[10]:
data.daily_aggregate(df=DATAFRAMES["DF_MERGED"])
[10]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1981/001 | 9.85 | 2.518999 | NaN | 1.001954 |
| 1981/002 | 10.20 | 2.507289 | NaN | 0.997078 |
| 1981/003 | 10.00 | 2.495637 | NaN | 0.992233 |
| 1981/004 | 10.10 | 2.484073 | NaN | 0.987417 |
| 1981/005 | 9.99 | 2.472571 | NaN | 0.982631 |
| ... | ... | ... | ... | ... |
| 2017/361 | NaN | 4.418050 | NaN | 1.380227 |
| 2017/362 | NaN | 4.393084 | NaN | 1.372171 |
| 2017/363 | NaN | 4.368303 | NaN | 1.364174 |
| 2017/364 | NaN | 4.343699 | NaN | 1.356237 |
| 2017/365 | NaN | 4.319275 | NaN | 1.348359 |
13514 rows × 4 columns
It returns a dataframe indexed daily from 1 till 365/366 i.e., the days of the year
WEEKLY AGGREGATION
This function returns the weekly aggregate of the data passed into it. Its aggregates using the method passed or if one isnt given, its default is mean. Its that simple. Its functionality is shown below:
[11]:
data.weekly_aggregate(df=DATAFRAMES["DF_MERGED"]) # default method of aggregation is mean
[11]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1980-12-29 | 10.037500 | 2.501499 | NaN | 0.994671 |
| 1981-01-05 | 9.244286 | 2.438589 | NaN | 0.968506 |
| 1981-01-12 | 8.461429 | 2.361289 | NaN | 0.936405 |
| 1981-01-19 | 8.345714 | 2.287077 | NaN | 0.905643 |
| 1981-01-26 | 8.461429 | 2.215803 | NaN | 0.876150 |
| ... | ... | ... | ... | ... |
| 2017-11-27 | NaN | 5.165260 | NaN | 1.621172 |
| 2017-12-04 | NaN | 4.958076 | NaN | 1.554878 |
| 2017-12-11 | NaN | 4.760181 | NaN | 1.490809 |
| 2017-12-18 | NaN | 4.572103 | NaN | 1.429983 |
| 2017-12-25 | NaN | 4.393448 | NaN | 1.372291 |
1931 rows × 4 columns
It returns a dataframe indexed weekly from 0/1 till 52/53 i.e., the weeks of the year
[12]:
data.weekly_aggregate(df=DATAFRAMES["DF_MERGED"], method='sum') # here we aggregate by summing up all the values of the week.
[12]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1980-12-29 | 40.15 | 10.005998 | 0.0 | 3.978683 |
| 1981-01-05 | 64.71 | 17.070122 | 0.0 | 6.779542 |
| 1981-01-12 | 59.23 | 16.529020 | 0.0 | 6.554838 |
| 1981-01-19 | 58.42 | 16.009540 | 0.0 | 6.339498 |
| 1981-01-26 | 59.23 | 15.510619 | 0.0 | 6.133052 |
| ... | ... | ... | ... | ... |
| 2017-11-27 | 0.00 | 36.156822 | 0.0 | 11.348205 |
| 2017-12-04 | 0.00 | 34.706534 | 0.0 | 10.884148 |
| 2017-12-11 | 0.00 | 33.321266 | 0.0 | 10.435665 |
| 2017-12-18 | 0.00 | 32.004719 | 0.0 | 10.009884 |
| 2017-12-25 | 0.00 | 30.754137 | 0.0 | 9.606034 |
1931 rows × 4 columns
YEARLY AGGREGATION
This function returns the yearly aggregate of the data passed into it. Its aggregates using the method passed or if one isnt given, its default is mean. Its that simple. Its functionality is shown below:
[13]:
data.yearly_aggregate(df=DATAFRAMES["DF_MERGED"]) # default method of aggregation is mean
[13]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1981-01 | 8.815806 | 2.352880 | NaN | 0.932961 |
| 1981-02 | 7.780000 | 2.060198 | NaN | 0.811975 |
| 1981-03 | 6.896129 | 1.812920 | NaN | 0.710498 |
| 1981-04 | 7.981333 | 3.132911 | NaN | 0.821663 |
| 1981-05 | 43.538710 | 50.736276 | 10.111333 | 15.202072 |
| ... | ... | ... | ... | ... |
| 2017-08 | NaN | 32.222317 | NaN | 17.704763 |
| 2017-09 | NaN | 28.141430 | NaN | 14.315134 |
| 2017-10 | NaN | 7.698483 | NaN | 2.615914 |
| 2017-11 | NaN | 5.625516 | NaN | 1.770196 |
| 2017-12 | NaN | 4.712923 | NaN | 1.475527 |
444 rows × 4 columns
It returns a dataframe indexed yearly from the first year in your data till the last year.
[14]:
data.yearly_aggregate(df=DATAFRAMES["DF_MERGED"], method='median') # here we aggregate by finding the median of the values each year.
[14]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1981-01 | 8.590 | 2.350375 | NaN | 0.931876 |
| 1981-02 | 7.825 | 2.058540 | NaN | 0.811263 |
| 1981-03 | 6.860 | 1.811243 | NaN | 0.709789 |
| 1981-04 | 7.400 | 1.720303 | NaN | 0.634707 |
| 1981-05 | 15.800 | 5.115736 | 5.165 | 1.589602 |
| ... | ... | ... | ... | ... |
| 2017-08 | NaN | 31.761380 | NaN | 17.606670 |
| 2017-09 | NaN | 25.176370 | NaN | 12.297040 |
| 2017-10 | NaN | 6.744768 | NaN | 2.161909 |
| 2017-11 | NaN | 5.622982 | NaN | 1.767848 |
| 2017-12 | NaN | 4.705044 | NaN | 1.472965 |
444 rows × 4 columns
MONTHLY AGGREGATION
This function returns the monthly aggregate of the data passed into it. Its aggregates using the method passed or if one isnt given, its default is mean. Its that simple. Its functionality is shown below:
[15]:
data.monthly_aggregate(df=DATAFRAMES["DF_MERGED"]) # default method of aggregation is mean
[15]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1981-01 | 8.815806 | 2.352880 | NaN | 0.932961 |
| 1981-02 | 7.780000 | 2.060198 | NaN | 0.811975 |
| 1981-03 | 6.896129 | 1.812920 | NaN | 0.710498 |
| 1981-04 | 7.981333 | 3.132911 | NaN | 0.821663 |
| 1981-05 | 43.538710 | 50.736276 | 10.111333 | 15.202072 |
| ... | ... | ... | ... | ... |
| 2017-08 | NaN | 32.222317 | NaN | 17.704763 |
| 2017-09 | NaN | 28.141430 | NaN | 14.315134 |
| 2017-10 | NaN | 7.698483 | NaN | 2.615914 |
| 2017-11 | NaN | 5.625516 | NaN | 1.770196 |
| 2017-12 | NaN | 4.712923 | NaN | 1.475527 |
444 rows × 4 columns
It returns a dataframe indexed monthly from 1 till 12 i.e., the months of the year
[16]:
data.monthly_aggregate(df=DATAFRAMES["DF_MERGED"], method='max') # here we aggregate by returning the maximum value that month
[16]:
| Station1 | Station2 | |||
|---|---|---|---|---|
| QOMEAS | QOSIM1 | QOMEAS | QOSIM1 | |
| 1981-01 | 10.20 | 2.518999 | NaN | 1.001954 |
| 1981-02 | 8.51 | 2.186006 | NaN | 0.863836 |
| 1981-03 | 7.50 | 1.931955 | NaN | 0.759228 |
| 1981-04 | 15.30 | 17.241560 | NaN | 3.692734 |
| 1981-05 | 165.00 | 220.485800 | 36.6 | 97.054240 |
| ... | ... | ... | ... | ... |
| 2017-08 | NaN | 41.657390 | NaN | 21.751910 |
| 2017-09 | NaN | 116.866500 | NaN | 53.925540 |
| 2017-10 | NaN | 19.746100 | NaN | 8.157081 |
| 2017-11 | NaN | 6.091304 | NaN | 1.927400 |
| 2017-12 | NaN | 5.134585 | NaN | 1.611400 |
444 rows × 4 columns
STATISTICS AGGREGATION
This allows us to calculate the aggregate of all the simulations accross the dataframe for every datetime index. Its aggregates using the method passed or if one isnt given, its default is mean. Unlike the other methods of aggregation, we are also able to perform quantile calculations with this method of aggregation. Its that simple. Its functionality is shown below:
[17]:
data.stat_aggregate(df=DATAFRAMES["DF_MERGED"], method='max')
[17]:
| Station1 | Station2 | |
|---|---|---|
| MAX | MAX | |
| 1981-01-01 | 2.518999 | 1.001954 |
| 1981-01-02 | 2.507289 | 0.997078 |
| 1981-01-03 | 2.495637 | 0.992233 |
| 1981-01-04 | 2.484073 | 0.987417 |
| 1981-01-05 | 2.472571 | 0.982631 |
| ... | ... | ... |
| 2017-12-27 | 4.418050 | 1.380227 |
| 2017-12-28 | 4.393084 | 1.372171 |
| 2017-12-29 | 4.368303 | 1.364174 |
| 2017-12-30 | 4.343699 | 1.356237 |
| 2017-12-31 | 4.319275 | 1.348359 |
13514 rows × 2 columns
[18]:
data.stat_aggregate(df=DATAFRAMES["DF_MERGED"], method='q75')
[18]:
| Station1 | Station2 | |
|---|---|---|
| Q75 | Q75 | |
| 1981-01-01 | 2.518999 | 1.001954 |
| 1981-01-02 | 2.507289 | 0.997078 |
| 1981-01-03 | 2.495637 | 0.992233 |
| 1981-01-04 | 2.484073 | 0.987417 |
| 1981-01-05 | 2.472571 | 0.982631 |
| ... | ... | ... |
| 2017-12-27 | 4.418050 | 1.380227 |
| 2017-12-28 | 4.393084 | 1.372171 |
| 2017-12-29 | 4.368303 | 1.364174 |
| 2017-12-30 | 4.343699 | 1.356237 |
| 2017-12-31 | 4.319275 | 1.348359 |
13514 rows × 2 columns
PERIODIC/SEASONAL AGGREGATION
This allows us to return a specific period of time for every year or a select few years within a data set. Its allows you to essentially analyse a season or period every year without having to look through every day. For exmaple, lets say we want to isolate what the streamflow was like on the first 2 days of January every year, we would go..
[19]:
data.seasonal_period(df=DATAFRAMES["DF"], daily_period=('01-01', '01-02'))
[19]:
| QOMEAS_05BB001 | QOSIM_05BB001 | QOMEAS_05BA001 | QOSIM_05BA001 | |
|---|---|---|---|---|
| 1981-01-01 | 9.85 | 2.518999 | NaN | 1.001954 |
| 1981-01-02 | 10.20 | 2.507289 | NaN | 0.997078 |
| 1982-01-01 | 7.17 | 5.465301 | NaN | 2.429704 |
| 1982-01-02 | 7.02 | 5.433753 | NaN | 2.414755 |
| 1983-01-01 | 8.98 | 5.371416 | NaN | 2.441398 |
| ... | ... | ... | ... | ... |
| 2015-01-02 | NaN | 6.944578 | NaN | 1.615503 |
| 2016-01-01 | NaN | 3.686424 | NaN | 1.005240 |
| 2016-01-02 | NaN | 3.666536 | NaN | 0.999701 |
| 2017-01-01 | NaN | 2.700768 | NaN | 0.872387 |
| 2017-01-02 | NaN | 2.687161 | NaN | 0.867826 |
74 rows × 4 columns
Observe that for every year, we only get the first two days of the year. We can also specify specific years and not return every year. For example we can get the predicted values for the first week of summer for the years 1999, 2001 and 2005 as shown below:
[20]:
data.seasonal_period(df=DATAFRAMES["DF_SIMULATED"], daily_period=('06-21', '06-28'), years=[1999, 2001, 2005])
[20]:
| QOSIM_05BB001 | QOSIM_05BA001 | |
|---|---|---|
| 1999-06-21 | 155.64250 | 127.26980 |
| 1999-06-22 | 147.19820 | 74.31406 |
| 1999-06-23 | 80.06104 | 34.51432 |
| 1999-06-24 | 45.07025 | 38.66659 |
| 1999-06-25 | 129.59160 | 104.75280 |
| 1999-06-26 | 156.33070 | 51.86972 |
| 1999-06-27 | 166.94590 | 62.61042 |
| 1999-06-28 | 85.17047 | 27.95974 |
| 2001-06-21 | 42.08820 | 32.47492 |
| 2001-06-22 | 44.51981 | 29.40924 |
| 2001-06-23 | 41.09985 | 29.90357 |
| 2001-06-24 | 42.13543 | 39.85459 |
| 2001-06-25 | 94.21543 | 135.35250 |
| 2001-06-26 | 127.62240 | 35.24305 |
| 2001-06-27 | 39.27969 | 39.95149 |
| 2001-06-28 | 77.72211 | 67.53999 |
| 2005-06-21 | 51.65529 | 24.16283 |
| 2005-06-22 | 51.92735 | 28.55426 |
| 2005-06-23 | 94.62647 | 35.51184 |
| 2005-06-24 | 40.85018 | 16.88546 |
| 2005-06-25 | 90.53637 | 109.68720 |
| 2005-06-26 | 189.87700 | 46.47357 |
| 2005-06-27 | 67.25498 | 43.57297 |
| 2005-06-28 | 157.66590 | 80.75896 |
As you can see, we are able to get the first week of summer for those 3 years.
LONG TERM AGGREGATION
this allows us to compute the long-term seasonal aggregate values of a given DataFrame by applying the specified aggregation method to each day across all years in the provided time period. The resulting data is aggregated into a single year (1 to 365/366 days). This way we are able to see how the models perform year in year out compared to the actual recorded data - both aggregated as necessary. An example is shown below:
[21]:
data.long_term_seasonal(df=DATAFRAMES["DF"]) # As usual the default aggregation method is mean/average
[21]:
| QOMEAS_05BB001 | QOSIM_05BB001 | QOMEAS_05BA001 | QOSIM_05BA001 | |
|---|---|---|---|---|
| jday | ||||
| 1 | 9.446471 | 4.037666 | NaN | 1.130686 |
| 2 | 9.428125 | 4.014474 | NaN | 1.123915 |
| 3 | 9.660625 | 3.991451 | NaN | 1.117196 |
| 4 | 9.804375 | 3.968602 | NaN | 1.110529 |
| 5 | 9.787500 | 3.945921 | NaN | 1.103913 |
| ... | ... | ... | ... | ... |
| 362 | 9.942500 | 4.188140 | NaN | 1.169614 |
| 363 | 9.695000 | 4.163847 | NaN | 1.162533 |
| 364 | 9.633125 | 4.139735 | NaN | 1.155507 |
| 365 | 9.516875 | 4.115805 | NaN | 1.148535 |
| 366 | 9.870000 | 4.433329 | NaN | 1.191653 |
366 rows × 4 columns
We are also able to calculate quantiles, For example the 75th Quantile value for all years aggregated into a single year looks like:
[22]:
data.long_term_seasonal(df=DATAFRAMES["DF_SIMULATED"], method = 'Q75')
[22]:
| QOSIM_05BB001 | QOSIM_05BA001 | |
|---|---|---|
| jday | ||
| 1 | 4.830453 | 1.315370 |
| 2 | 4.801530 | 1.306986 |
| 3 | 4.772831 | 1.298670 |
| 4 | 4.744344 | 1.290422 |
| 5 | 4.716085 | 1.282241 |
| ... | ... | ... |
| 362 | 4.978491 | 1.372171 |
| 363 | 4.948421 | 1.364174 |
| 364 | 4.918590 | 1.356237 |
| 365 | 4.888982 | 1.348359 |
| 366 | 4.859608 | 1.323824 |
366 rows × 2 columns
Naturally, when dealing with multi model evaluations, we are able to perform statictics on the output of the long term seasonal aggregations. This was we are able to extract the mean, median, max, etc of the long term seasonal aggregations, leavong us with just the statistics. An example of this is shown below:
[23]:
data.stat_aggregate(df=data.long_term_seasonal(df=DATAFRAMES["DF_MERGED"], method = 'median'), method='median')
[23]:
| Station1 | Station2 | |
|---|---|---|
| MEDIAN | MEDIAN | |
| jday | ||
| 1 | 3.636044 | 1.001954 |
| 2 | 3.616069 | 0.997078 |
| 3 | 3.596241 | 0.992233 |
| 4 | 3.576548 | 0.987417 |
| 5 | 3.556993 | 0.982631 |
| ... | ... | ... |
| 362 | 3.767376 | 1.027792 |
| 363 | 3.746928 | 1.022093 |
| 364 | 3.726621 | 1.016435 |
| 365 | 3.706450 | 1.010818 |
| 366 | 4.164878 | 1.192406 |
366 rows × 2 columns
Note
All of these functions with their various means of aggregation are available as individual functions but they can also be generated right from the generate_dataframes() function if you know eaxactly what you’ll need from the beginning. It just requires specifying a few more parameters. These parameters are shown below:
[24]:
## Lets use a time period of 1981 to 1990 to demonstrate this
DATAFRAMES = data.generate_dataframes(csv_fpaths=path, warm_up=365, start_date = "1981-01-01", end_date = "1990-12-31",
# optional arguments
# you specify that you want an aggregated dataframe by passing 'True' into
# the respective parameter and then you pass in your preffered method of aggregation
# If you want daily aggregation
daily_agg = True, da_method = 'min',
# lets see a weekly aggregation
weekly_agg = True, wa_method = 'min', # we want the minimum value each week
# lets also see monthly aggregation
monthly_agg = True, ma_method = 'inst', # we want the maximum value each month
# lets also see yearly aggregation
yearly_agg = True, ya_method = 'sum', # we want the sum of all values each year
# lets see the stats aggregation
stat_agg = True, stat_method = 'q75'
# note that without inputing the respective methods,
# the functions will still default to mean as the method of aggregation
)
The start date for the Data is 1981-01-01
[25]:
DATAFRAMES = data.generate_dataframes(csv_fpaths=path, warm_up=365, start_date = "1981-01-01", end_date = "1990-12-31",
# seasonal aggregation
# obtaining the months of May till August from every year from 1981 to 1985
seasonal_p = True, sp_dperiod = ('05-01', '08-30'),
sp_subset = ('1981-01-01', '1985-12-31'),
# instead of sp_subset, we can also use years = [1981, 1982, 1983, 1984, 1985].
# long term seasonal aggregation
long_term = True, lt_method = ["q33.33", "median" ,'q75' ,'Q25' ,'q33' ],
# when using long term in the generate_dataframes function, we are able to pass
# in a list of methods of aggregation we want generated. BY dafault though it will
# always generate maximum, minimum and median value dataframes
)
The start date for the Data is 1981-01-01
Putting it all together, we have:
[26]:
DATAFRAMES = data.generate_dataframes(csv_fpaths=path, warm_up=365, start_date = "1981-01-01", end_date = "1990-12-31",
daily_agg = True, da_method = 'min',
weekly_agg = True, wa_method = 'min',
monthly_agg = True, ma_method = 'inst',
yearly_agg = True, ya_method = 'sum',
stat_agg = True, stat_method = 'q75',
seasonal_p = True, sp_dperiod = ('05-01', '08-30'), sp_subset = ('1981-01-01', '1985-12-31'),
long_term = True, lt_method = ["q33.33", "median" ,'q75' ,'Q25' ,'q33' ],
)
for key, value in DATAFRAMES.items():
print(f"{key}:\n{value}")
The start date for the Data is 1981-01-01
DF:
QOMEAS_05BB001 QOSIM_05BB001 QOMEAS_05BA001 QOSIM_05BA001
1981-01-01 9.85 2.518999 NaN 1.001954
1981-01-02 10.20 2.507289 NaN 0.997078
1981-01-03 10.00 2.495637 NaN 0.992233
1981-01-04 10.10 2.484073 NaN 0.987417
1981-01-05 9.99 2.472571 NaN 0.982631
... ... ... ... ...
1990-12-27 10.10 6.615961 NaN 1.737144
1990-12-28 9.50 6.573054 NaN 1.725025
1990-12-29 8.60 6.530500 NaN 1.713013
1990-12-30 8.20 6.488300 NaN 1.701107
1990-12-31 8.25 6.446449 NaN 1.689308
[3652 rows x 4 columns]
DF_OBSERVED:
QOMEAS_05BB001 QOMEAS_05BA001
1981-01-01 9.85 NaN
1981-01-02 10.20 NaN
1981-01-03 10.00 NaN
1981-01-04 10.10 NaN
1981-01-05 9.99 NaN
... ... ...
1990-12-27 10.10 NaN
1990-12-28 9.50 NaN
1990-12-29 8.60 NaN
1990-12-30 8.20 NaN
1990-12-31 8.25 NaN
[3652 rows x 2 columns]
DF_SIMULATED:
QOSIM_05BB001 QOSIM_05BA001
1981-01-01 2.518999 1.001954
1981-01-02 2.507289 0.997078
1981-01-03 2.495637 0.992233
1981-01-04 2.484073 0.987417
1981-01-05 2.472571 0.982631
... ... ...
1990-12-27 6.615961 1.737144
1990-12-28 6.573054 1.725025
1990-12-29 6.530500 1.713013
1990-12-30 6.488300 1.701107
1990-12-31 6.446449 1.689308
[3652 rows x 2 columns]
DF_MERGED:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
1981-01-01 9.85 2.518999 NaN 1.001954
1981-01-02 10.20 2.507289 NaN 0.997078
1981-01-03 10.00 2.495637 NaN 0.992233
1981-01-04 10.10 2.484073 NaN 0.987417
1981-01-05 9.99 2.472571 NaN 0.982631
... ... ... ... ...
1990-12-27 10.10 6.615961 NaN 1.737144
1990-12-28 9.50 6.573054 NaN 1.725025
1990-12-29 8.60 6.530500 NaN 1.713013
1990-12-30 8.20 6.488300 NaN 1.701107
1990-12-31 8.25 6.446449 NaN 1.689308
[3652 rows x 4 columns]
DF_DAILY:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
1981/001 9.85 2.518999 NaN 1.001954
1981/002 10.20 2.507289 NaN 0.997078
1981/003 10.00 2.495637 NaN 0.992233
1981/004 10.10 2.484073 NaN 0.987417
1981/005 9.99 2.472571 NaN 0.982631
... ... ... ... ...
1990/361 10.10 6.615961 NaN 1.737144
1990/362 9.50 6.573054 NaN 1.725025
1990/363 8.60 6.530500 NaN 1.713013
1990/364 8.20 6.488300 NaN 1.701107
1990/365 8.25 6.446449 NaN 1.689308
[3652 rows x 4 columns]
DF_WEEKLY:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
1980-12-29 9.85 2.484073 NaN 0.987417
1981-01-05 8.70 2.404939 NaN 0.954524
1981-01-12 8.24 2.328990 NaN 0.923008
1981-01-19 7.86 2.256059 NaN 0.892801
1981-01-26 8.10 2.186006 NaN 0.863836
... ... ... ... ...
1990-12-03 10.80 7.453629 NaN 1.975199
1990-12-10 8.70 7.112537 NaN 1.877937
1990-12-17 8.10 6.791226 NaN 1.786726
1990-12-24 8.20 6.488300 NaN 1.701107
1990-12-31 8.25 6.446449 NaN 1.689308
[523 rows x 4 columns]
DF_MONTHLY:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
1981-01 8.62 2.195846 NaN 0.867900
1981-02 7.20 1.940355 NaN 0.762678
1981-03 7.25 1.699932 NaN 0.664341
1981-04 15.30 3.859564 NaN 0.584523
1981-05 113.00 220.485800 28.20 96.363520
... ... ... ... ...
1990-08 33.60 40.431200 10.10 23.856810
1990-09 90.90 19.438340 30.50 6.175078
1990-10 21.10 9.648046 4.01 2.642092
1990-11 12.00 7.920140 3.90 2.109938
1990-12 8.25 6.446449 NaN 1.689308
[120 rows x 4 columns]
DF_YEARLY:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
1981-01 273.29 72.939293 0.00 28.921777
1981-02 217.84 57.685555 0.00 22.735300
1981-03 213.78 56.200513 0.00 22.025424
1981-04 239.44 93.987320 0.00 24.649901
1981-05 1349.70 1572.824547 303.34 471.264224
... ... ... ... ...
1990-08 1541.80 1520.330940 550.40 758.817000
1990-09 1007.10 855.418340 269.57 408.246703
1990-10 1035.10 361.582897 277.26 95.083195
1990-11 460.12 261.980893 3.90 70.673194
1990-12 324.60 220.982680 0.00 58.369321
[120 rows x 4 columns]
DF_CUSTOM:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
1981-05-01 13.8 4.924126 NaN 0.587910
1981-05-02 12.0 4.199440 3.01 0.592167
1981-05-03 10.9 2.605448 2.85 0.580767
1981-05-04 10.3 2.898053 2.72 0.582108
1981-05-05 10.4 3.357134 2.80 0.578170
... ... ... ... ...
1985-08-26 45.9 22.102120 13.00 13.373290
1985-08-27 43.6 21.293030 12.50 12.981310
1985-08-28 42.5 21.262790 12.40 13.329020
1985-08-29 41.6 21.753170 12.80 13.696490
1985-08-30 42.0 22.593580 13.20 16.211630
[610 rows x 4 columns]
LONG_TERM_MIN:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 7.17 1.802230 NaN 0.485006
2 7.02 1.794152 NaN 0.482805
3 7.10 1.786115 NaN 0.480617
4 7.31 1.778133 NaN 0.478442
5 7.64 1.770190 NaN 0.476279
... ... ... ... ...
362 7.25 1.835014 NaN 0.493944
363 7.29 1.826749 NaN 0.491689
364 7.27 1.818528 NaN 0.489450
365 7.21 1.810352 NaN 0.487221
366 10.30 2.939887 NaN 0.782759
[366 rows x 4 columns]
LONG_TERM_MAX:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 12.7 5.465301 NaN 2.441398
2 12.7 5.433753 NaN 2.426411
3 12.7 5.402424 NaN 2.411540
4 12.7 5.371336 NaN 2.396784
5 12.8 5.340471 NaN 2.382142
... ... ... ... ...
362 13.3 6.573054 NaN 2.502534
363 13.0 6.530500 NaN 2.487069
364 12.8 6.488300 NaN 2.471726
365 12.7 6.446449 NaN 2.456503
366 11.3 3.132862 NaN 0.925092
[366 rows x 4 columns]
LONG_TERM_MEDIAN:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 9.795 3.397196 NaN 0.961191
2 9.815 3.379185 NaN 0.956437
3 9.855 3.361293 NaN 0.951713
4 9.875 3.343524 NaN 0.947018
5 9.695 3.325880 NaN 0.942354
... ... ... ... ...
362 10.350 3.698825 NaN 1.020910
363 9.490 3.678319 NaN 1.015117
364 9.070 3.657953 NaN 1.009368
365 9.350 3.637736 NaN 1.003660
366 10.800 3.036374 NaN 0.853926
[366 rows x 4 columns]
LONG_TERM_Q33.33:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 9.519838 3.116782 NaN 0.883651
2 9.419919 3.100879 NaN 0.878571
3 9.599898 3.085059 NaN 0.873528
4 9.709991 3.069359 NaN 0.868522
5 9.400000 3.053748 NaN 0.863553
... ... ... ... ...
362 9.499817 3.373855 NaN 0.904356
363 8.849961 3.356529 NaN 0.899122
364 8.949970 3.339308 NaN 0.893926
365 8.919976 3.322208 NaN 0.888770
366 10.633300 3.004206 NaN 0.830199
[366 rows x 4 columns]
LONG_TERM_Q75:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 10.0375 4.156184 NaN 1.248870
2 10.2000 4.132273 NaN 1.242024
3 10.6500 4.108537 NaN 1.235226
4 10.3250 4.084971 NaN 1.228475
5 10.5225 4.061583 NaN 1.221772
... ... ... ... ...
362 11.2500 5.222542 NaN 1.627869
363 11.0000 5.192005 NaN 1.617079
364 10.8000 5.161703 NaN 1.606380
365 10.4950 5.131630 NaN 1.595775
366 11.0500 3.084618 NaN 0.889509
[366 rows x 4 columns]
LONG_TERM_Q25:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 9.1150 2.972452 NaN 0.881223
2 9.2175 2.956880 NaN 0.876179
3 9.3450 2.941408 NaN 0.871171
4 9.6875 2.926045 NaN 0.866200
5 9.4000 2.910776 NaN 0.861267
... ... ... ... ...
362 9.0425 3.241984 NaN 0.901784
363 8.7525 3.225315 NaN 0.896586
364 8.8750 3.208758 NaN 0.891427
365 8.8600 3.192306 NaN 0.886306
366 10.5500 2.988131 NaN 0.818343
[366 rows x 4 columns]
LONG_TERM_Q33:
Station1 Station2
QOMEAS QOSIM1 QOMEAS QOSIM1
jday
1 9.5038 3.111064 NaN 0.883555
2 9.4119 3.095175 NaN 0.878476
3 9.5898 3.079368 NaN 0.873434
4 9.7091 3.063681 NaN 0.868430
5 9.4000 3.048084 NaN 0.863463
... ... ... ... ...
362 9.4817 3.368631 NaN 0.904254
363 8.8461 3.351331 NaN 0.899021
364 8.9470 3.334136 NaN 0.893827
365 8.9176 3.317062 NaN 0.888672
366 10.6300 3.003569 NaN 0.829729
[366 rows x 4 columns]
DF_STATS:
Station1 Station2 \
MIN MAX MEDIAN Q75 MIN MAX
1981-01-01 2.518999 2.518999 2.518999 2.518999 1.001954 1.001954
1981-01-02 2.507289 2.507289 2.507289 2.507289 0.997078 0.997078
1981-01-03 2.495637 2.495637 2.495637 2.495637 0.992233 0.992233
1981-01-04 2.484073 2.484073 2.484073 2.484073 0.987417 0.987417
1981-01-05 2.472571 2.472571 2.472571 2.472571 0.982631 0.982631
... ... ... ... ... ... ...
1990-12-27 6.615961 6.615961 6.615961 6.615961 1.737144 1.737144
1990-12-28 6.573054 6.573054 6.573054 6.573054 1.725025 1.725025
1990-12-29 6.530500 6.530500 6.530500 6.530500 1.713013 1.713013
1990-12-30 6.488300 6.488300 6.488300 6.488300 1.701107 1.701107
1990-12-31 6.446449 6.446449 6.446449 6.446449 1.689308 1.689308
MEDIAN Q75
1981-01-01 1.001954 1.001954
1981-01-02 0.997078 0.997078
1981-01-03 0.992233 0.992233
1981-01-04 0.987417 0.987417
1981-01-05 0.982631 0.982631
... ... ...
1990-12-27 1.737144 1.737144
1990-12-28 1.725025 1.725025
1990-12-29 1.713013 1.713013
1990-12-30 1.701107 1.701107
1990-12-31 1.689308 1.689308
[3652 rows x 8 columns]
[ ]: