In [ ]:
%%shell
jupyter nbconvert --to html FinalProject.ipynb

Cancer Incidences in the U.S.¶

Lele Zhao¶

Final Project for CMPS 3160¶

Link to my Github Page

Introduction¶

In my project, my objective is to examine the trends in cancer incidence rates across the United States, focusing on how these rates have changed over the years. A key aspect of my research involves exploring the potential relationship between cancer research funding and the incidence of different types of cancer. For instance, I'm interested in investigating if cancers with higher incidence counts tend to receive more funding. This inquiry is important for two reasons: firstly, it addresses public welfare concerns by focusing on the most prevalent cancers; secondly, from a pharmaceutical perspective, it's profitable to target cancers with a larger patient base.

My goal is to create predictive models for research funding allocation for various cancer types, using the data on their characteristics and trends. To achieve this, I plan to utilize time series analysis and regression models to forecast the funding patterns for cancer research. This approach aims to provide insights into how funding might be distributed in the future based on historical trends and current data.

I employ data from the Centers for Disease Control and Prevention (CDC) spanning 1999 to 2020. This dataset contains comprehensive information on cancer types and patients' demographic characteristics, such as age, gender, region, and cancer incidence rates. The funding dataset comes from National Cancer Institute, which provides funding data for specific cancer type from 2007 to 2018. Then I use the cancer incidences and funding data from 2007 to 2018 to predict funding in 2019.

Data Extraction, Transform, and Load (ETL)¶

In [4]:
#Load libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [5]:
#I've placed the data I obtained from the Centers for Disease Control and Prevention (CDC) and National Cancer Institute into the Data folder of my github
!git clone https://github.com/LeleZW/Data_CMPS3160.git
!ls
Cloning into 'Data_CMPS3160'...
remote: Enumerating objects: 31, done.
remote: Counting objects: 100% (31/31), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 31 (delta 12), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (31/31), 261.46 KiB | 3.11 MiB/s, done.
Resolving deltas: 100% (12/12), done.
Data_CMPS3160  sample_data
In [ ]:
#Now, read the .txt data using pd.read_csv.
df = pd.read_csv('Data_CMPS3160/United States and Puerto Rico Cancer Statistics, 1999-2020 Incidence.txt', sep='\t')
display(df.head())
Notes Leading Cancer Sites Leading Cancer Sites Code Year Year Code Count Population Crude Rate
0 NaN Brain and Other Nervous System 31010-31040 2007.0 2007.0 21347.0 301231207.0 7.1
1 NaN Brain and Other Nervous System 31010-31040 2008.0 2008.0 21753.0 304093966.0 7.2
2 NaN Brain and Other Nervous System 31010-31040 2009.0 2009.0 21981.0 306771529.0 7.2
3 NaN Brain and Other Nervous System 31010-31040 2010.0 2010.0 21904.0 309327143.0 7.1
4 NaN Brain and Other Nervous System 31010-31040 2011.0 2011.0 22026.0 311583481.0 7.1

Several redundant columns exist in the data frame, and I will remove these columns in the subsequent code.

In [ ]:
#Display the column names
df.columns.values
Out[ ]:
array(['Notes', 'Leading Cancer Sites', 'Leading Cancer Sites Code',
       'Year', 'Year Code', 'Count', 'Population', 'Crude Rate'],
      dtype=object)
In [ ]:
#Removing unwanted columns
df2 = df[['Leading Cancer Sites', 'Year', 'Crude Rate', 'Count']]
display(df2.head())
Leading Cancer Sites Year Crude Rate Count
0 Brain and Other Nervous System 2007.0 7.1 21347.0
1 Brain and Other Nervous System 2008.0 7.2 21753.0
2 Brain and Other Nervous System 2009.0 7.2 21981.0
3 Brain and Other Nervous System 2010.0 7.1 21904.0
4 Brain and Other Nervous System 2011.0 7.1 22026.0

Since there are spaces in the variable name, it's advisable to eliminate them.

In [ ]:
#Renaming columns
df2 = df2.rename(columns={'Leading Cancer Sites':'LeadingCancerSites', 'Crude Rate': 'CrudeRate'})
display(df2.head())
LeadingCancerSites Year CrudeRate Count
0 Brain and Other Nervous System 2007.0 7.1 21347.0
1 Brain and Other Nervous System 2008.0 7.2 21753.0
2 Brain and Other Nervous System 2009.0 7.2 21981.0
3 Brain and Other Nervous System 2010.0 7.1 21904.0
4 Brain and Other Nervous System 2011.0 7.1 22026.0
In [ ]:
#Change the float to int for Year and Count
df2.dtypes
df2.Year = np.int_(df2.Year)
df2.Count = np.int_(df2.Count)
display(df2.head())
LeadingCancerSites Year CrudeRate Count
0 Brain and Other Nervous System 2007 7.1 21347
1 Brain and Other Nervous System 2008 7.2 21753
2 Brain and Other Nervous System 2009 7.2 21981
3 Brain and Other Nervous System 2010 7.1 21904
4 Brain and Other Nervous System 2011 7.1 22026

Crude Rates are expressed as the number of cases reported each calendar year per 100,000 population.

Crude Rate = Count / Population * 100,000

In [ ]:
df2.LeadingCancerSites
Out[ ]:
0      Brain and Other Nervous System
1      Brain and Other Nervous System
2      Brain and Other Nervous System
3      Brain and Other Nervous System
4      Brain and Other Nervous System
                    ...              
342                               NaN
343                               NaN
344                               NaN
345                               NaN
346                               NaN
Name: LeadingCancerSites, Length: 347, dtype: object

As observed, there are NaNs in the "Cancersites" variable. Our initial step is to remove these NaN values.

In [ ]:
df3 = df2.dropna(subset=['LeadingCancerSites'])
df3.LeadingCancerSites
Out[ ]:
0             Brain and Other Nervous System
1             Brain and Other Nervous System
2             Brain and Other Nervous System
3             Brain and Other Nervous System
4             Brain and Other Nervous System
                       ...                  
303    Urinary Bladder, invasive and in situ
304    Urinary Bladder, invasive and in situ
305    Urinary Bladder, invasive and in situ
306    Urinary Bladder, invasive and in situ
307    Urinary Bladder, invasive and in situ
Name: LeadingCancerSites, Length: 308, dtype: object
In [ ]:
#Now, import the funding data
funding = pd.read_excel('Data_CMPS3160/CancerFunding_ByType.xls')
display(funding.head())
Category Name Total Funded Amount Year
0 Anus 6706020 2018
1 Bladder 40443790 2018
2 Brain 218336880 2018
3 Breast 574902164 2018
4 Buccal Cavity 13977656 2018
In [ ]:
funding.dtypes
Out[ ]:
Category Name          object
Total Funded Amount     int64
Year                    int64
dtype: object
In [ ]:
#Renaming columns
funding2 = funding.rename(columns={'Category Name':'CategoryName', 'Total Funded Amount': 'TotalFundedAmount'})
display(funding2.head())
CategoryName TotalFundedAmount Year
0 Anus 6706020 2018
1 Bladder 40443790 2018
2 Brain 218336880 2018
3 Breast 574902164 2018
4 Buccal Cavity 13977656 2018

Summary Statistics¶

First, we can look at all cancer incidences. I would like to investigate the most frequent cancer in terms of count and rate.

In [ ]:
df3.groupby(['LeadingCancerSites'])['Count'].sum().sort_values(ascending=False)
Out[ ]:
LeadingCancerSites
Breast                                   3361313
Lung and Bronchus                        3067414
Prostate                                 2983809
Colon and Rectum                         1995314
Melanoma of the Skin                     1065496
Urinary Bladder, invasive and in situ    1027927
Non-Hodgkin Lymphoma                      962989
Kidney and Renal Pelvis                   846074
Corpus Uteri                              705717
Leukemias                                 700454
Pancreas                                  656342
Thyroid                                   632882
Oral Cavity and Pharynx                   605032
Liver                                     368664
Myeloma                                   348298
Stomach                                   327908
Brain and Other Nervous System            315747
Ovary                                     298379
Esophagus                                 241433
Cervix Uteri                              178650
Larynx                                    175335
Gallbladder                                56128
Name: Count, dtype: int64
In [ ]:
df3.groupby(['LeadingCancerSites'])['Count'].sum().sort_values(ascending=False).plot.bar()
Out[ ]:
<Axes: xlabel='LeadingCancerSites'>

Regarding the cumulative incidence of cancer, breast cancer emerged as the most prevalent type between 2008 and 2020, affecting a total of 3,361,313 individuals.

In [ ]:
df3.groupby(['LeadingCancerSites'])['CrudeRate'].mean().sort_values(ascending=False)
Out[ ]:
LeadingCancerSites
Prostate                                 137.207143
Breast                                    75.878571
Lung and Bronchus                         69.335714
Colon and Rectum                          45.128571
Corpus Uteri                              31.335714
Melanoma of the Skin                      24.035714
Urinary Bladder, invasive and in situ     23.207143
Non-Hodgkin Lymphoma                      21.742857
Kidney and Renal Pelvis                   19.085714
Leukemias                                 15.807143
Pancreas                                  14.785714
Thyroid                                   14.292857
Oral Cavity and Pharynx                   13.642857
Ovary                                     13.292857
Liver                                      8.300000
Cervix Uteri                               7.957143
Myeloma                                    7.842857
Stomach                                    7.407143
Brain and Other Nervous System             7.150000
Esophagus                                  5.442857
Larynx                                     3.985714
Gallbladder                                1.271429
Name: CrudeRate, dtype: float64
In [ ]:
df3.groupby(['LeadingCancerSites'])['CrudeRate'].mean().sort_values(ascending=False).plot.bar()
Out[ ]:
<Axes: xlabel='LeadingCancerSites'>

On average, prostate cancer registers as the most commonly diagnosed type, with an incidence rate of 137 individuals per 100,000 being affected by prostate cancer annually over the period from 2008 to 2020.

Secondly, I aim to examine the variations in incidence rates for various types of cancer throughout these years.

In [ ]:
import seaborn as sns

plt.figure(figsize=(18, 6))
sns.lineplot(data=df3, x='Year', y='CrudeRate', hue='LeadingCancerSites', ci=None, marker='o')
<ipython-input-17-2c7630684b89>:4: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.lineplot(data=df3, x='Year', y='CrudeRate', hue='LeadingCancerSites', ci=None, marker='o')
Out[ ]:
<Axes: xlabel='Year', ylabel='CrudeRate'>

Prostate cancer exhibits the highest incidence rate.

It looks each cancer is relatively stable from 2008 to 2019. However, there appears to be a drop in 2020. Hence, I will only use data from 2008 to 2019 for predictions.

As for the funding, we can also look at the change in total fundings related to cancer.

In [ ]:
funding2.groupby(['CategoryName'])['TotalFundedAmount'].mean().sort_values(ascending=False).plot.bar()
Out[ ]:
<Axes: xlabel='CategoryName'>

Breast cancer receives the largest share of funding, accounting for nearly double the funding allocated to lung and prostate cancers combined. However, it's worth noting that despite the higher funding for breast cancer, prostate cancer has the highest incidence rate, as indicated in the previous figure.

In [ ]:
funding2.groupby('Year').TotalFundedAmount.sum().plot.line()
Out[ ]:
<Axes: xlabel='Year'>

After reviewing the data, it became apparent that the lower funding in 2007 is attributed to numerous missing values for that year. Consequently, I have decided to exclude the 2007 values from my analysis.

In [ ]:
funding3 =  funding2[funding2['Year'] != 2007]
funding3.groupby('Year').TotalFundedAmount.sum().plot.line()
Out[ ]:
<Axes: xlabel='Year'>

Additionally, funding for cancer research is closely linked to the Gross Domestic Product (GDP). To provide a comprehensive analysis, I have included annual GDP data from 2008 to 2020, sourced from the U.S. Bureau of Economic Analysis. This inclusion allows for a better understanding of how economic fluctuations might influence cancer research funding over these years.

In [ ]:
# GDP data
gdp_data = {
    "Year": [2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020],
    "GDP (in Billion USD)": [
        14712.8, 14448.9, 14992.1, 15542.6, 16197.0, 16784.9, 17527.3, 18238.3,
        18745.1, 19543.0, 20611.9, 21433.2, 21137.6
    ]
}

# Creating a DataFrame
gdp_df = pd.DataFrame(gdp_data)
In [ ]:
#merge cancer incidence data and cpi data
df4 = pd.merge(df3, gdp_df, on='Year')
display(df4.head())
LeadingCancerSites Year CrudeRate Count GDP (in Billion USD)
0 Brain and Other Nervous System 2008 7.2 21753 14712.8
1 Breast 2008 71.8 218323 14712.8
2 Cervix Uteri 2008 8.3 12756 14712.8
3 Colon and Rectum 2008 48.6 147709 14712.8
4 Corpus Uteri 2008 27.6 42626 14712.8

I am interested in how funding are allocated in different kinds of cancer. To visually emphasize the top five cancers receiving the most funding, I have employed distinct colors to highlight these categories in the chart.

In [ ]:
grouped_data = funding3.groupby('CategoryName')['TotalFundedAmount'].sum().sort_values(ascending=False)
colors = ['red', 'green', 'blue', 'cyan', 'magenta']
rest_color = 'grey'
bar_colors = [colors[i] if i < len(colors) else rest_color for i in range(len(grouped_data))]
grouped_data.plot.bar(color=bar_colors)
plt.show()

Breast cancer ranks as the number one cancer in terms of receiving the highest funding, followed by lung cancer.

Next, I plan to merge the funding data with the cancer incidence data. However, before proceeding, it's important to ensure that both datasets use consistent naming conventions for a smooth merging process.

In [ ]:
unique_categories_per_year = funding3.groupby('Year')['CategoryName'].unique()
for year, categories in unique_categories_per_year.items():
    print(f"Year: {year}, Categories: {categories}")
Year: 2008, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Ear' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Thyroid' 'Uterine' 'Vascular Disease' "Wilm's Tumor"]
Year: 2009, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2010, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer'
 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma'
 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2011, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer'
 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma'
 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2012, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer'
 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma'
 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2013, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma' 'Kidney Cancer'
 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer' 'Lung' 'Melanoma'
 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2014, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2015, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2016, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2017, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
Year: 2018, Categories: ['Anus' 'Bladder' 'Brain' 'Breast' 'Buccal Cavity'
 'Central Nervous System - Not Including Brain' 'Cervical Cancer'
 'Childhood Leukemia' 'Colon/Rectum' 'Esophagus' 'Eye' 'Gallbladder'
 'Head and Neck' 'Heart' 'Hodgkins disease' 'Kaposi Sarcoma'
 'Kidney Cancer' 'Kidney Disease' 'Larynx' 'Leukemia' 'Liver Cancer'
 'Lung' 'Melanoma' 'Multiple Myeloma' 'Nervous System' 'Neuroblastoma'
 'Non Hodgkins Lymphoma' 'Ovarian Cancer' 'Pancreas' 'Parathyroid' 'Penis'
 'Pharynx' 'Pituitary' 'Prostate' 'Salivary Glands' 'Sarcoma' 'Stomach'
 'Testes' 'Thyroid' 'Urinary System' 'Uterine' 'Vaginal'
 'Vascular Disease' "Wilm's Tumor"]
In [ ]:
df4['LeadingCancerSites'].unique()
Out[ ]:
array(['Brain and Other Nervous System', 'Breast', 'Cervix Uteri',
       'Colon and Rectum', 'Corpus Uteri', 'Esophagus', 'Gallbladder',
       'Kidney and Renal Pelvis', 'Larynx', 'Leukemias', 'Liver',
       'Lung and Bronchus', 'Melanoma of the Skin', 'Myeloma',
       'Non-Hodgkin Lymphoma', 'Oral Cavity and Pharynx', 'Ovary',
       'Pancreas', 'Prostate', 'Stomach', 'Thyroid',
       'Urinary Bladder, invasive and in situ'], dtype=object)

The incidence data contains fewer categories compared to the funding data. To address this, I will create new categories based on the types present in the incidence data.

In [ ]:
#suming brain and Central Nervous System to Brain and Other Nervous System
funding4 = funding3.copy()
funding4['CategoryName'] = funding4['CategoryName'].replace(['Brain', 'Central Nervous System - Not Including Brain'], 'Brain and Other Nervous System')
#chaning name for Cervical Cancer
funding4['CategoryName'] = funding4['CategoryName'].replace('Cervical Cancer', 'Cervix Uteri')
#chaning name for Colon/Rectum
funding4['CategoryName'] = funding4['CategoryName'].replace('Colon/Rectum', 'Colon and Rectum')
#suming  'Kidney Cancer' and 'Kidney Disease' to Kidney and Renal Pelvis
funding4['CategoryName'] = funding4['CategoryName'].replace([ 'Kidney Cancer', 'Kidney Disease'], 'Kidney and Renal Pelvis')
#suming  'Leukemia' and 'Childhood Leukemia' to Leukemias
funding4['CategoryName'] = funding4['CategoryName'].replace([ 'Kidney Cancer', 'Kidney Disease'], 'Leukemias')
#chaning name for Liver Cancer
funding4['CategoryName'] = funding4['CategoryName'].replace('Liver Cancer', 'Liver')
#chaning name for Lung and Bronchus
funding4['CategoryName'] = funding4['CategoryName'].replace('Lung', 'Lung and Bronchus')
#chaning name for Melanoma of the Skin
funding4['CategoryName'] = funding4['CategoryName'].replace('Melanoma', 'Melanoma of the Skin')
#chaning name for Myeloma
funding4['CategoryName'] = funding4['CategoryName'].replace('Multiple Myeloma', 'Myeloma')
#chaning name for Non-Hodgkin Lymphoma
funding4['CategoryName'] = funding4['CategoryName'].replace('Non Hodgkin Lymphoma', 'Non-Hodgkin Lymphoma')
#chaning name for Oral Cavity and Pharynx
funding4['CategoryName'] = funding4['CategoryName'].replace('Pharynx', 'Oral Cavity and Pharynx')
#chaning name for Ovarian Cancer
funding4['CategoryName'] = funding4['CategoryName'].replace('Ovarian Cancer', 'Ovarian')
#suming  'Bladder', 'Urinary System' to Urinary Bladder
funding4['CategoryName'] = funding4['CategoryName'].replace([ 'Bladder', 'Urinary System'], 'Urinary Bladder')



funding5 = funding4.groupby(['Year', 'CategoryName'])['TotalFundedAmount'].sum().reset_index()
display(funding5.head(10))
Year CategoryName TotalFundedAmount
0 2008 Anus 3120001
1 2008 Brain and Other Nervous System 153731417
2 2008 Breast 572597443
3 2008 Buccal Cavity 7357884
4 2008 Cervix Uteri 76788322
5 2008 Childhood Leukemia 48267255
6 2008 Colon and Rectum 273721430
7 2008 Ear 165314
8 2008 Esophagus 22441827
9 2008 Eye 3083451
In [ ]:
#merge cancer incidence data and funding data
funding5 = funding5.rename(columns={'CategoryName':'LeadingCancerSites'})
df5 = pd.merge(funding5, df4, on=['LeadingCancerSites','Year'])
display(df5.head(10))
Year LeadingCancerSites TotalFundedAmount CrudeRate Count GDP (in Billion USD)
0 2008 Brain and Other Nervous System 153731417 7.2 21753 14712.8
1 2008 Breast 572597443 71.8 218323 14712.8
2 2008 Cervix Uteri 76788322 8.3 12756 14712.8
3 2008 Colon and Rectum 273721430 48.6 147709 14712.8
4 2008 Esophagus 22441827 5.4 16373 14712.8
5 2008 Gallbladder 975977 1.2 3567 14712.8
6 2008 Kidney and Renal Pelvis 86863366 17.5 53103 14712.8
7 2008 Larynx 244021 4.2 12652 14712.8
8 2008 Liver 74234589 6.8 20822 14712.8
9 2008 Lung and Bronchus 247569163 71.2 216423 14712.8

Predictions & Test¶

In this section, I aim to forecast the research funding for each cancer type in 2019 using time series modeling approaches based on data involving funding. Then I will use the data relating to count to test the relationship between funding and cancer count.

  1. Time Series Model
In [ ]:
# let's make it a real date.
df5['Year'] = pd.to_datetime(df5['Year'].astype(str) + '-01-01')
df5.Year
Out[ ]:
0     2008-01-01
1     2008-01-01
2     2008-01-01
3     2008-01-01
4     2008-01-01
         ...    
182   2018-01-01
183   2018-01-01
184   2018-01-01
185   2018-01-01
186   2018-01-01
Name: Year, Length: 187, dtype: datetime64[ns]
In [ ]:
# by setting the date to be the index, we can start slicing by date more easily.
df6 = df5.set_index('Year')
# let's also sort by date
df6 = df6.sort_index()
df6
Out[ ]:
LeadingCancerSites TotalFundedAmount CrudeRate Count GDP (in Billion USD)
Year
2008-01-01 Brain and Other Nervous System 153731417 7.2 21753 14712.8
2008-01-01 Breast 572597443 71.8 218323 14712.8
2008-01-01 Cervix Uteri 76788322 8.3 12756 14712.8
2008-01-01 Colon and Rectum 273721430 48.6 147709 14712.8
2008-01-01 Esophagus 22441827 5.4 16373 14712.8
... ... ... ... ... ...
2018-01-01 Oral Cavity and Pharynx 2874430 14.8 48422 20611.9
2018-01-01 Pancreas 182103222 16.6 54380 20611.9
2018-01-01 Prostate 239139126 140.2 225721 20611.9
2018-01-01 Stomach 14206953 7.6 24858 20611.9
2018-01-01 Thyroid 13492177 14.3 46728 20611.9

187 rows × 5 columns

In [ ]:
#calculating the rolling means
rolling_means = df6.groupby('LeadingCancerSites')['TotalFundedAmount'].rolling(window=3).mean()
rolling_means = rolling_means.reset_index()
print(rolling_means)
                 LeadingCancerSites       Year  TotalFundedAmount
0    Brain and Other Nervous System 2008-01-01                NaN
1    Brain and Other Nervous System 2009-01-01                NaN
2    Brain and Other Nervous System 2010-01-01       1.540251e+08
3    Brain and Other Nervous System 2011-01-01       1.603088e+08
4    Brain and Other Nervous System 2012-01-01       1.689569e+08
..                              ...        ...                ...
182                         Thyroid 2014-01-01       1.973038e+07
183                         Thyroid 2015-01-01       2.262130e+07
184                         Thyroid 2016-01-01       2.337942e+07
185                         Thyroid 2017-01-01       2.327830e+07
186                         Thyroid 2018-01-01       1.939083e+07

[187 rows x 3 columns]
In [ ]:
#using the last three values to predict the 2019 funding for different cancer.
last_3_periods = rolling_means.groupby('LeadingCancerSites').tail(3)
average_last_3_periods = last_3_periods.groupby('LeadingCancerSites').mean()
print(average_last_3_periods)
                                TotalFundedAmount
LeadingCancerSites                               
Brain and Other Nervous System       2.043956e+08
Breast                               5.379225e+08
Cervix Uteri                         6.548022e+07
Colon and Rectum                     2.167728e+08
Esophagus                            2.938585e+07
Gallbladder                          1.753896e+06
Kidney and Renal Pelvis              8.767646e+07
Larynx                               8.442467e+05
Liver                                7.418885e+07
Lung and Bronchus                    2.898038e+08
Melanoma of the Skin                 1.428203e+08
Myeloma                              5.372338e+07
Oral Cavity and Pharynx              3.389576e+06
Pancreas                             1.521435e+08
Prostate                             2.337465e+08
Stomach                              1.323088e+07
Thyroid                              2.201618e+07
<ipython-input-31-47f1b1b6e5a1>:3: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  average_last_3_periods = last_3_periods.groupby('LeadingCancerSites').mean()

Breast cancer is projected to receive the highest funding in 2019, amounting to over half a billion, followed by lung and prostate cancer in funding allocation.

  1. Regression Model

Before selecting a regression model, I would like to examine the correlation between the number of cancer cases and the funding received.

In [ ]:
df5.plot.scatter(x="Count", y="TotalFundedAmount",
                      cmap="plasma", alpha=.5);
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py:1259: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  scatter = ax.scatter(

Initially, there is a positive correlation between cancer counts and funding for each specific cancer. However, it becomes evident that cancer counts can be categorized into three groups: counts below 100,000, counts between 100,000 to approximately 225,000, and counts exceeding 225,000. Since the dataset contains only around 200 observations, I have decided to utilize a linear regression model.

In this model, the outcome variable is the funding for each type of cancer, and the independent variables include cancer count and annual GDP as a control variable. The choice of using cancer count instead of incidence rate is based on the fact that funding is also influenced by profit motives. Higher cancer counts indicate greater demand and potentially higher profits, making count a relevant predictor in this context.

In [ ]:
df_lm = pd.merge(funding5, df4, on=['LeadingCancerSites','Year'])
display(df_lm.head(10))
Year LeadingCancerSites TotalFundedAmount CrudeRate Count GDP (in Billion USD)
0 2008 Brain and Other Nervous System 153731417 7.2 21753 14712.8
1 2008 Breast 572597443 71.8 218323 14712.8
2 2008 Cervix Uteri 76788322 8.3 12756 14712.8
3 2008 Colon and Rectum 273721430 48.6 147709 14712.8
4 2008 Esophagus 22441827 5.4 16373 14712.8
5 2008 Gallbladder 975977 1.2 3567 14712.8
6 2008 Kidney and Renal Pelvis 86863366 17.5 53103 14712.8
7 2008 Larynx 244021 4.2 12652 14712.8
8 2008 Liver 74234589 6.8 20822 14712.8
9 2008 Lung and Bronchus 247569163 71.2 216423 14712.8
In [ ]:
import statsmodels.api as sm
df_lm_ind = df_lm.dropna()[['Count', 'Year', 'GDP (in Billion USD)']][df5['Year'].dt.year < 2019]
df_lm_target = df_lm.dropna()['TotalFundedAmount'][df5['Year'].dt.year < 2019]
df_lm_ind
Out[ ]:
Count Year GDP (in Billion USD)
0 21753 2008 14712.8
1 218323 2008 14712.8
2 12756 2008 14712.8
3 147709 2008 14712.8
4 16373 2008 14712.8
... ... ... ...
182 48422 2018 20611.9
183 54380 2018 20611.9
184 225721 2018 20611.9
185 24858 2018 20611.9
186 46728 2018 20611.9

187 rows × 3 columns

In [ ]:
X = df_lm_ind
y = df_lm_target

model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model
model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: TotalFundedAmount R-squared (uncentered): 0.862
Model: OLS Adj. R-squared (uncentered): 0.859
Method: Least Squares F-statistic: 381.8
Date: Sat, 18 Nov 2023 Prob (F-statistic): 9.87e-79
Time: 05:23:00 Log-Likelihood: -3647.4
No. Observations: 187 AIC: 7301.
Df Residuals: 184 BIC: 7310.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Count 1642.0067 68.518 23.964 0.000 1506.824 1777.189
Year 1.478e+04 2.31e+04 0.640 0.523 -3.08e+04 6.04e+04
GDP (in Billion USD) -1320.3357 2706.871 -0.488 0.626 -6660.831 4020.160
Omnibus: 41.373 Durbin-Watson: 1.546
Prob(Omnibus): 0.000 Jarque-Bera (JB): 66.974
Skew: 1.166 Prob(JB): 2.86e-15
Kurtosis: 4.777 Cond. No. 469.


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Using the coefficients(1642.0067***) derived from the Ordinary Least Squares (OLS) regression, I conducted an analysis to evaluate the association between cancer incidence and funding, which indicated a strong relationship.

Conclusions¶

For this project, I sourced cancer incidence data spanning from 1999 to 2020 from the Centers for Medicare & Medicaid Services (CMS), and cancer funding data covering 2008 to 2018 from the National Cancer Institute (NCI). I explored two primary inquiries. The initial query involved forecasting the funding allocation for various cancer types using a time series model, which predicted that breast cancer would continue to receive the most substantial funding in 2019. The second inquiry examined the potential correlation between cancer incidence and funding levels, where my analysis revealed a strong positive correlation.

Limitations¶

A significant challenge in my research is the limited number of data points available for cancer funding. Having access to funding data over a more extended period would enhance my ability to detect and analyze variations in cancer funding trends. One potential solution is shifting the focus from a year-cancer framework to a physician-cancer one. This means examining the funding received by different specialists or scientists within the same year, offering a more granular view of how cancer funding is distributed. This approach could reveal more nuanced changes in funding patterns.

Additionally, incorporating more detailed characteristics of cancer, such as the distribution of cases across different genders and races, as well as survival rates for each cancer type, could provide further insights. This expanded dataset would allow me to explore whether cancer funding allocation shows any disparities based on gender or race. Such an analysis could be crucial in understanding and addressing potential biases in funding distribution.