%%shell
jupyter nbconvert --to html FinalProject.ipynb
In my project, my objective is to examine the trends in cancer incidence rates across the United States, focusing on how these rates have changed over the years. A key aspect of my research involves exploring the potential relationship between cancer research funding and the incidence of different types of cancer. For instance, I'm interested in investigating if cancers with higher incidence counts tend to receive more funding. This inquiry is important for two reasons: firstly, it addresses public welfare concerns by focusing on the most prevalent cancers; secondly, from a pharmaceutical perspective, it's profitable to target cancers with a larger patient base.
My goal is to create predictive models for research funding allocation for various cancer types, using the data on their characteristics and trends. To achieve this, I plan to utilize time series analysis and regression models to forecast the funding patterns for cancer research. This approach aims to provide insights into how funding might be distributed in the future based on historical trends and current data.
I employ data from the Centers for Disease Control and Prevention (CDC) spanning 1999 to 2020. This dataset contains comprehensive information on cancer types and patients' demographic characteristics, such as age, gender, region, and cancer incidence rates. The funding dataset comes from National Cancer Institute, which provides funding data for specific cancer type from 2007 to 2018. Then I use the cancer incidences and funding data from 2007 to 2018 to predict funding in 2019.
#Load libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#I've placed the data I obtained from the Centers for Disease Control and Prevention (CDC) and National Cancer Institute into the Data folder of my github
!git clone https://github.com/LeleZW/Data_CMPS3160.git
!ls
Cloning into 'Data_CMPS3160'... remote: Enumerating objects: 31, done. remote: Counting objects: 100% (31/31), done. remote: Compressing objects: 100% (31/31), done. remote: Total 31 (delta 12), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (31/31), 261.46 KiB | 3.11 MiB/s, done. Resolving deltas: 100% (12/12), done. Data_CMPS3160 sample_data
#Now, read the .txt data using pd.read_csv.
df = pd.read_csv('Data_CMPS3160/United States and Puerto Rico Cancer Statistics, 1999-2020 Incidence.txt', sep='\t')
display(df.head())
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Year | Year Code | Count | Population | Crude Rate | |
---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | 2007.0 | 2007.0 | 21347.0 | 301231207.0 | 7.1 |
1 | NaN | Brain and Other Nervous System | 31010-31040 | 2008.0 | 2008.0 | 21753.0 | 304093966.0 | 7.2 |
2 | NaN | Brain and Other Nervous System | 31010-31040 | 2009.0 | 2009.0 | 21981.0 | 306771529.0 | 7.2 |
3 | NaN | Brain and Other Nervous System | 31010-31040 | 2010.0 | 2010.0 | 21904.0 | 309327143.0 | 7.1 |
4 | NaN | Brain and Other Nervous System | 31010-31040 | 2011.0 | 2011.0 | 22026.0 | 311583481.0 | 7.1 |
Several redundant columns exist in the data frame, and I will remove these columns in the subsequent code.
#Display the column names
df.columns.values
array(['Notes', 'Leading Cancer Sites', 'Leading Cancer Sites Code', 'Year', 'Year Code', 'Count', 'Population', 'Crude Rate'], dtype=object)
#Removing unwanted columns
df2 = df[['Leading Cancer Sites', 'Year', 'Crude Rate', 'Count']]
display(df2.head())
Leading Cancer Sites | Year | Crude Rate | Count | |
---|---|---|---|---|
0 | Brain and Other Nervous System | 2007.0 | 7.1 | 21347.0 |
1 | Brain and Other Nervous System | 2008.0 | 7.2 | 21753.0 |
2 | Brain and Other Nervous System | 2009.0 | 7.2 | 21981.0 |
3 | Brain and Other Nervous System | 2010.0 | 7.1 | 21904.0 |
4 | Brain and Other Nervous System | 2011.0 | 7.1 | 22026.0 |
Since there are spaces in the variable name, it's advisable to eliminate them.
#Renaming columns
df2 = df2.rename(columns={'Leading Cancer Sites':'LeadingCancerSites', 'Crude Rate': 'CrudeRate'})
display(df2.head())
LeadingCancerSites | Year | CrudeRate | Count | |
---|---|---|---|---|
0 | Brain and Other Nervous System | 2007.0 | 7.1 | 21347.0 |
1 | Brain and Other Nervous System | 2008.0 | 7.2 | 21753.0 |
2 | Brain and Other Nervous System | 2009.0 | 7.2 | 21981.0 |
3 | Brain and Other Nervous System | 2010.0 | 7.1 | 21904.0 |
4 | Brain and Other Nervous System | 2011.0 | 7.1 | 22026.0 |
#Change the float to int for Year and Count
df2.dtypes
df2.Year = np.int_(df2.Year)
df2.Count = np.int_(df2.Count)
display(df2.head())
LeadingCancerSites | Year | CrudeRate | Count | |
---|---|---|---|---|
0 | Brain and Other Nervous System | 2007 | 7.1 | 21347 |
1 | Brain and Other Nervous System | 2008 | 7.2 | 21753 |
2 | Brain and Other Nervous System | 2009 | 7.2 | 21981 |
3 | Brain and Other Nervous System | 2010 | 7.1 | 21904 |
4 | Brain and Other Nervous System | 2011 | 7.1 | 22026 |
Crude Rates are expressed as the number of cases reported each calendar year per 100,000 population.
Crude Rate = Count / Population * 100,000
df2.LeadingCancerSites
0 Brain and Other Nervous System 1 Brain and Other Nervous System 2 Brain and Other Nervous System 3 Brain and Other Nervous System 4 Brain and Other Nervous System ... 342 NaN 343 NaN 344 NaN 345 NaN 346 NaN Name: LeadingCancerSites, Length: 347, dtype: object
As observed, there are NaNs in the "Cancersites" variable. Our initial step is to remove these NaN values.
df3 = df2.dropna(subset=['LeadingCancerSites'])
df3.LeadingCancerSites
0 Brain and Other Nervous System 1 Brain and Other Nervous System 2 Brain and Other Nervous System 3 Brain and Other Nervous System 4 Brain and Other Nervous System ... 303 Urinary Bladder, invasive and in situ 304 Urinary Bladder, invasive and in situ 305 Urinary Bladder, invasive and in situ 306 Urinary Bladder, invasive and in situ 307 Urinary Bladder, invasive and in situ Name: LeadingCancerSites, Length: 308, dtype: object
#Now, import the funding data
funding = pd.read_excel('Data_CMPS3160/CancerFunding_ByType.xls')
display(funding.head())
Category Name | Total Funded Amount | Year | |
---|---|---|---|
0 | Anus | 6706020 | 2018 |
1 | Bladder | 40443790 | 2018 |
2 | Brain | 218336880 | 2018 |
3 | Breast | 574902164 | 2018 |
4 | Buccal Cavity | 13977656 | 2018 |
funding.dtypes
Category Name object Total Funded Amount int64 Year int64 dtype: object
#Renaming columns
funding2 = funding.rename(columns={'Category Name':'CategoryName', 'Total Funded Amount': 'TotalFundedAmount'})
display(funding2.head())
CategoryName | TotalFundedAmount | Year | |
---|---|---|---|
0 | Anus | 6706020 | 2018 |
1 | Bladder | 40443790 | 2018 |
2 | Brain | 218336880 | 2018 |
3 | Breast | 574902164 | 2018 |
4 | Buccal Cavity | 13977656 | 2018 |
First, we can look at all cancer incidences. I would like to investigate the most frequent cancer in terms of count and rate.
df3.groupby(['LeadingCancerSites'])['Count'].sum().sort_values(ascending=False)
LeadingCancerSites Breast 3361313 Lung and Bronchus 3067414 Prostate 2983809 Colon and Rectum 1995314 Melanoma of the Skin 1065496 Urinary Bladder, invasive and in situ 1027927 Non-Hodgkin Lymphoma 962989 Kidney and Renal Pelvis 846074 Corpus Uteri 705717 Leukemias 700454 Pancreas 656342 Thyroid 632882 Oral Cavity and Pharynx 605032 Liver 368664 Myeloma 348298 Stomach 327908 Brain and Other Nervous System 315747 Ovary 298379 Esophagus 241433 Cervix Uteri 178650 Larynx 175335 Gallbladder 56128 Name: Count, dtype: int64
df3.groupby(['LeadingCancerSites'])['Count'].sum().sort_values(ascending=False).plot.bar()
<Axes: xlabel='LeadingCancerSites'>
Regarding the cumulative incidence of cancer, breast cancer emerged as the most prevalent type between 2008 and 2020, affecting a total of 3,361,313 individuals.
df3.groupby(['LeadingCancerSites'])['CrudeRate'].mean().sort_values(ascending=False)
LeadingCancerSites Prostate 137.207143 Breast 75.878571 Lung and Bronchus 69.335714 Colon and Rectum 45.128571 Corpus Uteri 31.335714 Melanoma of the Skin 24.035714 Urinary Bladder, invasive and in situ 23.207143 Non-Hodgkin Lymphoma 21.742857 Kidney and Renal Pelvis 19.085714 Leukemias 15.807143 Pancreas 14.785714 Thyroid 14.292857 Oral Cavity and Pharynx 13.642857 Ovary 13.292857 Liver 8.300000 Cervix Uteri 7.957143 Myeloma 7.842857 Stomach 7.407143 Brain and Other Nervous System 7.150000 Esophagus 5.442857 Larynx 3.985714 Gallbladder 1.271429 Name: CrudeRate, dtype: float64
df3.groupby(['LeadingCancerSites'])['CrudeRate'].mean().sort_values(ascending=False).plot.bar()
<Axes: xlabel='LeadingCancerSites'>
On average, prostate cancer registers as the most commonly diagnosed type, with an incidence rate of 137 individuals per 100,000 being affected by prostate cancer annually over the period from 2008 to 2020.
Secondly, I aim to examine the variations in incidence rates for various types of cancer throughout these years.
import seaborn as sns
plt.figure(figsize=(18, 6))
sns.lineplot(data=df3, x='Year', y='CrudeRate', hue='LeadingCancerSites', ci=None, marker='o')
<ipython-input-17-2c7630684b89>:4: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.lineplot(data=df3, x='Year', y='CrudeRate', hue='LeadingCancerSites', ci=None, marker='o')
<Axes: xlabel='Year', ylabel='CrudeRate'>
Prostate cancer exhibits the highest incidence rate.
It looks each cancer is relatively stable from 2008 to 2019. However, there appears to be a drop in 2020. Hence, I will only use data from 2008 to 2019 for predictions.
As for the funding, we can also look at the change in total fundings related to cancer.
funding2.groupby(['CategoryName'])['TotalFundedAmount'].mean().sort_values(ascending=False).plot.bar()
<Axes: xlabel='CategoryName'>
Breast cancer receives the largest share of funding, accounting for nearly double the funding allocated to lung and prostate cancers combined. However, it's worth noting that despite the higher funding for breast cancer, prostate cancer has the highest incidence rate, as indicated in the previous figure.
funding2.groupby('Year').TotalFundedAmount.sum().plot.line()
<Axes: xlabel='Year'>
After reviewing the data, it became apparent that the lower funding in 2007 is attributed to numerous missing values for that year. Consequently, I have decided to exclude the 2007 values from my analysis.
funding3 = funding2[funding2['Year'] != 2007]
funding3.groupby('Year').TotalFundedAmount.sum().plot.line()
<Axes: xlabel='Year'>
Additionally, funding for cancer research is closely linked to the Gross Domestic Product (GDP). To provide a comprehensive analysis, I have included annual GDP data from 2008 to 2020, sourced from the U.S. Bureau of Economic Analysis. This inclusion allows for a better understanding of how economic fluctuations might influence cancer research funding over these years.
# GDP data
gdp_data = {
"Year": [2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020],
"GDP (in Billion USD)": [
14712.8, 14448.9, 14992.1, 15542.6, 16197.0, 16784.9, 17527.3, 18238.3,
18745.1, 19543.0, 20611.9, 21433.2, 21137.6
]
}
# Creating a DataFrame
gdp_df = pd.DataFrame(gdp_data)
#merge cancer incidence data and cpi data
df4 = pd.merge(df3, gdp_df, on='Year')
display(df4.head())
LeadingCancerSites | Year | CrudeRate | Count | GDP (in Billion USD) | |
---|---|---|---|---|---|
0 | Brain and Other Nervous System | 2008 | 7.2 | 21753 | 14712.8 |
1 | Breast | 2008 | 71.8 | 218323 | 14712.8 |
2 | Cervix Uteri | 2008 | 8.3 | 12756 | 14712.8 |
3 | Colon and Rectum | 2008 | 48.6 | 147709 | 14712.8 |
4 | Corpus Uteri | 2008 | 27.6 | 42626 | 14712.8 |
I am interested in how funding are allocated in different kinds of cancer. To visually emphasize the top five cancers receiving the most funding, I have employed distinct colors to highlight these categories in the chart.
grouped_data = funding3.groupby('CategoryName')['TotalFundedAmount'].sum().sort_values(ascending=False)
colors = ['red', 'green', 'blue', 'cyan', 'magenta']
rest_color = 'grey'
bar_colors = [colors[i] if i < len(colors) else rest_color for i in range(len(grouped_data))]
grouped_data.plot.bar(color=bar_colors)
plt.show()
Breast cancer ranks as the number one cancer in terms of receiving the highest funding, followed by lung cancer.
Next, I plan to merge the funding data with the cancer incidence data. However, before proceeding, it's important to ensure that both datasets use consistent naming conventions for a smooth merging process.
unique_categories_per_year = funding3.groupby('Year')['CategoryName'].unique()
for year, categories in unique_categories_per_year.items():
print(f"Year: {year}, Categories: {categories}")
Year: 2008, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Ear' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Thyroid' 'Uterine' 'Vascular Disease' "Wilm's Tumor"] Year: 2009, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2010, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2011, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2012, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2013, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2014, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2015, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2016, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2017, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"] Year: 2018, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity' 'Central Nervous System - Not Including Brain' 'Cervical Cancer' 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder' 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma' 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis' 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach' 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal' 'Vascular Disease' "Wilm's Tumor"]
df4['LeadingCancerSites'].unique()
array(['Brain and Other Nervous System', 'Breast', 'Cervix Uteri', 'Colon and Rectum', 'Corpus Uteri', 'Esophagus', 'Gallbladder', 'Kidney and Renal Pelvis', 'Larynx', 'Leukemias', 'Liver', 'Lung and Bronchus', 'Melanoma of the Skin', 'Myeloma', 'Non-Hodgkin Lymphoma', 'Oral Cavity and Pharynx', 'Ovary', 'Pancreas', 'Prostate', 'Stomach', 'Thyroid', 'Urinary Bladder, invasive and in situ'], dtype=object)
The incidence data contains fewer categories compared to the funding data. To address this, I will create new categories based on the types present in the incidence data.
#suming brain and Central Nervous System to Brain and Other Nervous System
funding4 = funding3.copy()
funding4['CategoryName'] = funding4['CategoryName'].replace(['Brain', 'Central Nervous System - Not Including Brain'], 'Brain and Other Nervous System')
#chaning name for Cervical Cancer
funding4['CategoryName'] = funding4['CategoryName'].replace('Cervical Cancer', 'Cervix Uteri')
#chaning name for Colon/Rectum
funding4['CategoryName'] = funding4['CategoryName'].replace('Colon/Rectum', 'Colon and Rectum')
#suming 'Kidney Cancer' and 'Kidney Disease' to Kidney and Renal Pelvis
funding4['CategoryName'] = funding4['CategoryName'].replace([ 'Kidney Cancer', 'Kidney Disease'], 'Kidney and Renal Pelvis')
#suming 'Leukemia' and 'Childhood Leukemia' to Leukemias
funding4['CategoryName'] = funding4['CategoryName'].replace([ 'Kidney Cancer', 'Kidney Disease'], 'Leukemias')
#chaning name for Liver Cancer
funding4['CategoryName'] = funding4['CategoryName'].replace('Liver Cancer', 'Liver')
#chaning name for Lung and Bronchus
funding4['CategoryName'] = funding4['CategoryName'].replace('Lung', 'Lung and Bronchus')
#chaning name for Melanoma of the Skin
funding4['CategoryName'] = funding4['CategoryName'].replace('Melanoma', 'Melanoma of the Skin')
#chaning name for Myeloma
funding4['CategoryName'] = funding4['CategoryName'].replace('Multiple Myeloma', 'Myeloma')
#chaning name for Non-Hodgkin Lymphoma
funding4['CategoryName'] = funding4['CategoryName'].replace('Non Hodgkin Lymphoma', 'Non-Hodgkin Lymphoma')
#chaning name for Oral Cavity and Pharynx
funding4['CategoryName'] = funding4['CategoryName'].replace('Pharynx', 'Oral Cavity and Pharynx')
#chaning name for Ovarian Cancer
funding4['CategoryName'] = funding4['CategoryName'].replace('Ovarian Cancer', 'Ovarian')
#suming 'Bladder', 'Urinary System' to Urinary Bladder
funding4['CategoryName'] = funding4['CategoryName'].replace([ 'Bladder', 'Urinary System'], 'Urinary Bladder')
funding5 = funding4.groupby(['Year', 'CategoryName'])['TotalFundedAmount'].sum().reset_index()
display(funding5.head(10))
Year | CategoryName | TotalFundedAmount | |
---|---|---|---|
0 | 2008 | Anus | 3120001 |
1 | 2008 | Brain and Other Nervous System | 153731417 |
2 | 2008 | Breast | 572597443 |
3 | 2008 | Buccal Cavity | 7357884 |
4 | 2008 | Cervix Uteri | 76788322 |
5 | 2008 | Childhood Leukemia | 48267255 |
6 | 2008 | Colon and Rectum | 273721430 |
7 | 2008 | Ear | 165314 |
8 | 2008 | Esophagus | 22441827 |
9 | 2008 | Eye | 3083451 |
#merge cancer incidence data and funding data
funding5 = funding5.rename(columns={'CategoryName':'LeadingCancerSites'})
df5 = pd.merge(funding5, df4, on=['LeadingCancerSites','Year'])
display(df5.head(10))
Year | LeadingCancerSites | TotalFundedAmount | CrudeRate | Count | GDP (in Billion USD) | |
---|---|---|---|---|---|---|
0 | 2008 | Brain and Other Nervous System | 153731417 | 7.2 | 21753 | 14712.8 |
1 | 2008 | Breast | 572597443 | 71.8 | 218323 | 14712.8 |
2 | 2008 | Cervix Uteri | 76788322 | 8.3 | 12756 | 14712.8 |
3 | 2008 | Colon and Rectum | 273721430 | 48.6 | 147709 | 14712.8 |
4 | 2008 | Esophagus | 22441827 | 5.4 | 16373 | 14712.8 |
5 | 2008 | Gallbladder | 975977 | 1.2 | 3567 | 14712.8 |
6 | 2008 | Kidney and Renal Pelvis | 86863366 | 17.5 | 53103 | 14712.8 |
7 | 2008 | Larynx | 244021 | 4.2 | 12652 | 14712.8 |
8 | 2008 | Liver | 74234589 | 6.8 | 20822 | 14712.8 |
9 | 2008 | Lung and Bronchus | 247569163 | 71.2 | 216423 | 14712.8 |
In this section, I aim to forecast the research funding for each cancer type in 2019 using time series modeling approaches based on data involving funding. Then I will use the data relating to count to test the relationship between funding and cancer count.
# let's make it a real date.
df5['Year'] = pd.to_datetime(df5['Year'].astype(str) + '-01-01')
df5.Year
0 2008-01-01 1 2008-01-01 2 2008-01-01 3 2008-01-01 4 2008-01-01 ... 182 2018-01-01 183 2018-01-01 184 2018-01-01 185 2018-01-01 186 2018-01-01 Name: Year, Length: 187, dtype: datetime64[ns]
# by setting the date to be the index, we can start slicing by date more easily.
df6 = df5.set_index('Year')
# let's also sort by date
df6 = df6.sort_index()
df6
LeadingCancerSites | TotalFundedAmount | CrudeRate | Count | GDP (in Billion USD) | |
---|---|---|---|---|---|
Year | |||||
2008-01-01 | Brain and Other Nervous System | 153731417 | 7.2 | 21753 | 14712.8 |
2008-01-01 | Breast | 572597443 | 71.8 | 218323 | 14712.8 |
2008-01-01 | Cervix Uteri | 76788322 | 8.3 | 12756 | 14712.8 |
2008-01-01 | Colon and Rectum | 273721430 | 48.6 | 147709 | 14712.8 |
2008-01-01 | Esophagus | 22441827 | 5.4 | 16373 | 14712.8 |
... | ... | ... | ... | ... | ... |
2018-01-01 | Oral Cavity and Pharynx | 2874430 | 14.8 | 48422 | 20611.9 |
2018-01-01 | Pancreas | 182103222 | 16.6 | 54380 | 20611.9 |
2018-01-01 | Prostate | 239139126 | 140.2 | 225721 | 20611.9 |
2018-01-01 | Stomach | 14206953 | 7.6 | 24858 | 20611.9 |
2018-01-01 | Thyroid | 13492177 | 14.3 | 46728 | 20611.9 |
187 rows × 5 columns
#calculating the rolling means
rolling_means = df6.groupby('LeadingCancerSites')['TotalFundedAmount'].rolling(window=3).mean()
rolling_means = rolling_means.reset_index()
print(rolling_means)
LeadingCancerSites Year TotalFundedAmount 0 Brain and Other Nervous System 2008-01-01 NaN 1 Brain and Other Nervous System 2009-01-01 NaN 2 Brain and Other Nervous System 2010-01-01 1.540251e+08 3 Brain and Other Nervous System 2011-01-01 1.603088e+08 4 Brain and Other Nervous System 2012-01-01 1.689569e+08 .. ... ... ... 182 Thyroid 2014-01-01 1.973038e+07 183 Thyroid 2015-01-01 2.262130e+07 184 Thyroid 2016-01-01 2.337942e+07 185 Thyroid 2017-01-01 2.327830e+07 186 Thyroid 2018-01-01 1.939083e+07 [187 rows x 3 columns]
#using the last three values to predict the 2019 funding for different cancer.
last_3_periods = rolling_means.groupby('LeadingCancerSites').tail(3)
average_last_3_periods = last_3_periods.groupby('LeadingCancerSites').mean()
print(average_last_3_periods)
TotalFundedAmount LeadingCancerSites Brain and Other Nervous System 2.043956e+08 Breast 5.379225e+08 Cervix Uteri 6.548022e+07 Colon and Rectum 2.167728e+08 Esophagus 2.938585e+07 Gallbladder 1.753896e+06 Kidney and Renal Pelvis 8.767646e+07 Larynx 8.442467e+05 Liver 7.418885e+07 Lung and Bronchus 2.898038e+08 Melanoma of the Skin 1.428203e+08 Myeloma 5.372338e+07 Oral Cavity and Pharynx 3.389576e+06 Pancreas 1.521435e+08 Prostate 2.337465e+08 Stomach 1.323088e+07 Thyroid 2.201618e+07
<ipython-input-31-47f1b1b6e5a1>:3: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. average_last_3_periods = last_3_periods.groupby('LeadingCancerSites').mean()
Breast cancer is projected to receive the highest funding in 2019, amounting to over half a billion, followed by lung and prostate cancer in funding allocation.
Before selecting a regression model, I would like to examine the correlation between the number of cancer cases and the funding received.
df5.plot.scatter(x="Count", y="TotalFundedAmount",
cmap="plasma", alpha=.5);
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py:1259: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored scatter = ax.scatter(
Initially, there is a positive correlation between cancer counts and funding for each specific cancer. However, it becomes evident that cancer counts can be categorized into three groups: counts below 100,000, counts between 100,000 to approximately 225,000, and counts exceeding 225,000. Since the dataset contains only around 200 observations, I have decided to utilize a linear regression model.
In this model, the outcome variable is the funding for each type of cancer, and the independent variables include cancer count and annual GDP as a control variable. The choice of using cancer count instead of incidence rate is based on the fact that funding is also influenced by profit motives. Higher cancer counts indicate greater demand and potentially higher profits, making count a relevant predictor in this context.
df_lm = pd.merge(funding5, df4, on=['LeadingCancerSites','Year'])
display(df_lm.head(10))
Year | LeadingCancerSites | TotalFundedAmount | CrudeRate | Count | GDP (in Billion USD) | |
---|---|---|---|---|---|---|
0 | 2008 | Brain and Other Nervous System | 153731417 | 7.2 | 21753 | 14712.8 |
1 | 2008 | Breast | 572597443 | 71.8 | 218323 | 14712.8 |
2 | 2008 | Cervix Uteri | 76788322 | 8.3 | 12756 | 14712.8 |
3 | 2008 | Colon and Rectum | 273721430 | 48.6 | 147709 | 14712.8 |
4 | 2008 | Esophagus | 22441827 | 5.4 | 16373 | 14712.8 |
5 | 2008 | Gallbladder | 975977 | 1.2 | 3567 | 14712.8 |
6 | 2008 | Kidney and Renal Pelvis | 86863366 | 17.5 | 53103 | 14712.8 |
7 | 2008 | Larynx | 244021 | 4.2 | 12652 | 14712.8 |
8 | 2008 | Liver | 74234589 | 6.8 | 20822 | 14712.8 |
9 | 2008 | Lung and Bronchus | 247569163 | 71.2 | 216423 | 14712.8 |
import statsmodels.api as sm
df_lm_ind = df_lm.dropna()[['Count', 'Year', 'GDP (in Billion USD)']][df5['Year'].dt.year < 2019]
df_lm_target = df_lm.dropna()['TotalFundedAmount'][df5['Year'].dt.year < 2019]
df_lm_ind
Count | Year | GDP (in Billion USD) | |
---|---|---|---|
0 | 21753 | 2008 | 14712.8 |
1 | 218323 | 2008 | 14712.8 |
2 | 12756 | 2008 | 14712.8 |
3 | 147709 | 2008 | 14712.8 |
4 | 16373 | 2008 | 14712.8 |
... | ... | ... | ... |
182 | 48422 | 2018 | 20611.9 |
183 | 54380 | 2018 | 20611.9 |
184 | 225721 | 2018 | 20611.9 |
185 | 24858 | 2018 | 20611.9 |
186 | 46728 | 2018 | 20611.9 |
187 rows × 3 columns
X = df_lm_ind
y = df_lm_target
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model
model.summary()
Dep. Variable: | TotalFundedAmount | R-squared (uncentered): | 0.862 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.859 |
Method: | Least Squares | F-statistic: | 381.8 |
Date: | Sat, 18 Nov 2023 | Prob (F-statistic): | 9.87e-79 |
Time: | 05:23:00 | Log-Likelihood: | -3647.4 |
No. Observations: | 187 | AIC: | 7301. |
Df Residuals: | 184 | BIC: | 7310. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Count | 1642.0067 | 68.518 | 23.964 | 0.000 | 1506.824 | 1777.189 |
Year | 1.478e+04 | 2.31e+04 | 0.640 | 0.523 | -3.08e+04 | 6.04e+04 |
GDP (in Billion USD) | -1320.3357 | 2706.871 | -0.488 | 0.626 | -6660.831 | 4020.160 |
Omnibus: | 41.373 | Durbin-Watson: | 1.546 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 66.974 |
Skew: | 1.166 | Prob(JB): | 2.86e-15 |
Kurtosis: | 4.777 | Cond. No. | 469. |
Using the coefficients(1642.0067***) derived from the Ordinary Least Squares (OLS) regression, I conducted an analysis to evaluate the association between cancer incidence and funding, which indicated a strong relationship.
For this project, I sourced cancer incidence data spanning from 1999 to 2020 from the Centers for Medicare & Medicaid Services (CMS), and cancer funding data covering 2008 to 2018 from the National Cancer Institute (NCI). I explored two primary inquiries. The initial query involved forecasting the funding allocation for various cancer types using a time series model, which predicted that breast cancer would continue to receive the most substantial funding in 2019. The second inquiry examined the potential correlation between cancer incidence and funding levels, where my analysis revealed a strong positive correlation.
A significant challenge in my research is the limited number of data points available for cancer funding. Having access to funding data over a more extended period would enhance my ability to detect and analyze variations in cancer funding trends. One potential solution is shifting the focus from a year-cancer framework to a physician-cancer one. This means examining the funding received by different specialists or scientists within the same year, offering a more granular view of how cancer funding is distributed. This approach could reveal more nuanced changes in funding patterns.
Additionally, incorporating more detailed characteristics of cancer, such as the distribution of cases across different genders and races, as well as survival rates for each cancer type, could provide further insights. This expanded dataset would allow me to explore whether cancer funding allocation shows any disparities based on gender or race. Such an analysis could be crucial in understanding and addressing potential biases in funding distribution.