TL;DR This post shows you how to calculate CAGR for a metric in Pandas, while handling edge cases that may appear in real-life data. The final code can be found here.
Sharon just called. She has a meeting with the client tomorrow and it would be just great if she could get a simple analysis of the profit growth between 2018 and 2020 from the data set of companies that she just shared. Never mind that it's 17:00 (5PM) and you have promised your family to get home early today.
But hey, I'm heading out now, just give me a call if there is anything you need. Cya!
Frustrated, you take a deep breath. If Sharon needs the numbers, you better deliver them - after all, you really want that promotion.
Using your deep data analysis expertise, you realize that you can use CAGR (compound annual growth rate)! CAGR represents the smooth growth of a metric over a number of years and should be good enough for the meeting tomorrow.
Encouraged, you pull out Pandas version 1.2.3 and Jupyter Lab to get to work.
OK, so the dataframe is not too complicated. It contains profit numbers from a set of companies.
df = pd.DataFrame(
[
['Normal Ltd', 2018, 100],
['Normal Ltd', 2019, 200],
['Normal Ltd', 2020, 190],
['Founded Later Ltd', 2019, 190],
['Founded Later Ltd', 2020, 190],
['Closed Down Early Ltd', 2018, 100],
['Closed Down Early Ltd', 2019, 50],
['Negative Profits First Year Ltd', 2018, -100],
['Negative Profits First Year Ltd', 2019, 50],
['Negative Profits First Year Ltd', 2020, 150],
['Negative Profits Last Year Ltd', 2018, 50],
['Negative Profits Last Year Ltd', 2019, 150],
['Negative Profits Last Year Ltd', 2020, -100],
['Too Many Years Ltd', 2016, 25],
['Too Many Years Ltd', 2017, 25],
['Too Many Years Ltd', 2018, 50],
['Too Many Years Ltd', 2019, 40],
['Too Many Years Ltd', 2020, 30],
['Only One Year Ltd', 2020, 150],
['Started From Zero Ltd', 2018, 0],
['Started From Zero Ltd', 2019, 10],
['Started From Zero Ltd', 2020, 20],
],
columns=['company_name', 'year', 'profit']
)
However, you can already foresee at least one problem: Some did not exist in 2018! Meaning that you cannot compare all companies' 2018 numbers with their 2020 numbers. This might not be as easy as you first thought, so you send a text home to tell your family you'll be late.
Hm, you better start at the beginning and remove all data that is not relevant for the given time period. For instance, data about profits during 2016 is useless for this task.
relevant_years = df[
(df['year'] >= 2018) &
(df['year'] <= 2020)
]
A lot of unnecessary data is now gone, but your headache is not. You want to apply the CAGR formula to all of companies individually, so one row of data per company would be easy to work with, rather than one row per year as the dataframe is structured right now.
To accomplish this, you would need to extract each company's first and last year from the dataset. After some googling, you realize that there are aggregation functions called idxmin
and idxmax
that can help you with this. These aggregation functions take a group (a specific company in this this case) and return the indexes for the smallest (idxmin
) and largest (idxmax
) values. These indexes can then be used to retrieve the relevant rows from the smallest and largest years of the company.
idx_for_relevant_years = relevant_years.groupby(
# Create a group.
'company_name'
)['year'].agg(
# For each company, get the index to the earliest (idxmin)
# and latest (idxmax) year
['idxmin', 'idxmax']
)
That gives you something like this for each company, where the numbers you see (5 and 6) are indexes for the position in the original dataframe relevant_years
.
company_name | idxmin | idxmax |
---|---|---|
Closed Down Early Ltd | 5 | 6 |
... | ... | ... |
You can now use these indexes to retrieve the profit data from the first and last years.
# `.loc` is used to get values based on index
first_years = relevant_years.loc[idx_for_relevant_years['idxmin']]
last_years = relevant_years.loc[idx_for_relevant_years['idxmax']]
You then merge the dataframes together so that each row represents one company, with profit data from the first and last year.
merged_years = first_years.merge(
last_years,
# Merge the dataframes on the key "company_name"
on='company_name',
# When columns have the same name, apply these suffixes
# to the column names so that we can keep them both
suffixes=('_first', '_last')
)
company_name | year_first | profit_first | year_last | profit_last | |
---|---|---|---|---|---|
0 | Closed Down Early Ltd | 2018 | 100 | 2019 | 50 |
n | ... | ... | ... | ... | ... |
Inspecting the new dataframe, you find that one company went bankrupt before numbers for 2020 could be released. Should this company be included in the set? You give Sharon a call to make sure.
Oh you mean Closed Down Early Ltd? Yeah, they went bankrupt during 2019 and we don't really care about companies that don't exist anymore, so you can exclude them.
With your new dataframe, excluding them takes no more than one line of code.
# All companies must have 2020 data.
# Use `.copy()` to copy the slice of the datframe and avoid warnings.
active_companies = merged_years.loc[merged_years['year_last'] == 2020].copy()
Now that you have your data in a reasonable format, you feel ready to make the actual CAGR calculation. Before applying the formula, you create a new column called year_diff
that represents the difference between the last and first year in the dataframe. This data is used by the CAGR formula.
active_companies['year_diff'] = active_companies['year_last'] - active_companies['year_first']
Then after some swift googling, you find that Numpy has a handy power
function that can be used to calculate CAGR, given your dataframe format.
import numpy as np
active_companies['cagr'] = np.power(
# Profit of the last year divided by the first year profits
active_companies['profit_last'] / active_companies['profit_first'],
# Taken to the power of 1 divided by the time difference in years
(1 / active_companies['year_diff'])
) - 1
Excellent, the code appears to have worked! But wait a minute, some numbers produced are nonsensical...
company_name | ... | profit_first | profit_last | cagr |
---|---|---|---|---|
Negative Profits First Year Ltd | ... | -100 | 150 | nan |
Negative Profits Last Year Ltd | ... | 50 | -100 | nan |
What's going on here? It turns out that the CAGR formula mathematically cannot calculate CAGR when the first or last year of profits were negative.
company_name | ... | profit_first | profit_last | cagr |
---|---|---|---|---|
Started From Zero Ltd | ... | 0 | 20 | inf |
Furthermore, the CAGR formula is not able to calculate the growth when the first year of profits is 0, because a shift in any direction will represent infinity in Python.
company_name | ... | year_diff | cagr |
---|---|---|---|
Only One Year Ltd | ... | 0 | 0 |
Lastly, there is one company in the dataset that only has data for one year, where CAGR was calculated to 0
. It would make more sense if the CAGR value was NaN
to represent that growth cannot be calculated for the company in question.
As you bang your head against the desk, you realize that you can represent increases and declines in non-quantitative terms. What if you were to return I
and D
to represent Increase and Decline, when the CAGR formula cannot be applied?
But then again, you would need a way to apply custom logic to the dataframe. You google further and find that there is a dataframe function called apply
that you can use to execute a custom function for each row. That should do the trick.
With a sudden burst of energy, you start creating the function.
def calculate_cagr(row):
# If there is only one year, it is not possible to calculate CAGR.
# As such, we return NaN.
if row['year_diff'] == 0:
return np.nan
# Check if the first year profits is less or equal to 0 OR if
# the last year profits is less than 0.
# If so, handle the CAGR formula manually.
if row['profit_first'] <= 0 or row['profit_last'] < 0:
# If there was a growth in profits, return 'I' to represent 'Increase'
if row['profit_first'] < row['profit_last']:
return 'I'
elif row['profit_first'] > row['profit_last']:
# If there was a decline in profits, return 'D' to represent 'Decline'
return 'D'
# No difference in profits, so we return 0.
else:
return 0
# If this row does not represent a special case, apply the CAGR formula
return np.power(
row['profit_last'] / row['profit_first'],
(1 / row['year_diff'])
) - 1
Excellent, that looks like it will work. Let's try to apply
that function to each row.
active_companies['cagr_custom'] = active_companies.apply(
# The function to apply, the function must take a
# row from the dataframe as an argument
calculate_cagr,
# Whether to apply the function for each column or for each row.
# `axis=1` tells it to apply the function for each row.
axis=1
)
company_name | ... | cagr | cagr_custom |
---|---|---|---|
Founded Later Ltd | ... | 0 | 0.0 |
Negative Profits First Year Ltd | ... | nan | I |
Negative Profits Last Year Ltd | ... | nan | D |
Normal Ltd | ... | 0.38 | 0.38 |
Only One Year Ltd | ... | 0 | nan |
Started From Zero Ltd | ... | inf | I |
Too Many Years Ltd | ... | -0.23 | -0.23 |
Lo and behold! You managed to make a simple profit analysis using CAGR. It might have kept you away from your family, but at least you managed to create value for a client. Impressed by your own ingenuity, you give Sharon a call to tell her the good news.
Oh, the profit analysis? Yeah, I just spoke to the client, they don't want it anymore. But good for you that you managed to finish it!
With a strange feeling of self-hatred mixed with pride, you hang up the phone. Suddenly confronted with the value of time, you start packing up your things. Not only to leave the office, but to leave a toxic company culture for good. You turn off the lamp on your desk, say goodbye to all of your colleagues as you would any night, and then leave the building; with a slight smile on your face and with hope for tomorrow.
You can find all code here.