Problems with Data Analysis: Cognitive Bias
When trying to research anything, fact-checking is critical, and so is an honest assessment of what conclusions are drawn from those facts.
For instance, when measuring the density of aluminum at standard temperature and pressure, the results obtained will be consistent regardless of where the measurement is taken. This is because the chemical behavior of aluminum at standard temperature and pressure does not vary with location. However, individuals cannot be compared to static blocks of aluminum. People are diverse and are influenced by many complex variables that affect their drives, behaviors, and beliefs. Moreover, their environment, access to resources, and interactions with others can further modify or enhance these variables. Therefore, drawing conclusions about measurement data from human behavior is different from drawing conclusions about measurement data from a block of aluminum's chemical behavior. Although one or two variables among human behavior may strongly correlate to a conclusion drawn, just as mass and volume strongly correlate to aluminum's density compared to other metals, data about human behavior should be analyzed further. This is necessary to determine if different conclusions might be drawn if various conditions under which the data is collected are considered or if other factors impacting that data were not previously evaluated.
One-minute reads that discuss the problem of looking at data in a vacuum more succinctly are outlined by two posts from Avi Chawla's Substack "Daily Dose of Data Science."
1. https://avichawla.substack.com/p/be-cautious-before-drawing-any-conclusions
2. https://avichawla.substack.com/p/why-correlation-and-other-statistics
Chat GPT
I utilized the highly advertised Chat-GPT to investigate and analyze data and information on gun-related crime and death statistics in the United States. However, it is important to note that I do not consider Chat-GPT to be the ultimate authority on truth. Additionally, Chat-GPT 3.5 was trained using the world's information up until September 2021 (as of the date this essay was released), and it may not be aware of any significant events that have occurred afterward.
As someone who remembers the days of painstakingly sifting through hardbound periodicals and journals in the basement of a college library to gather and cite information on a research topic, I appreciate that Google and the Internet have put all this investigative work at my fingertips now.
Chat-GPT might be the next evolution beyond search engines and public database access. However, it is important to note that Chat-GPT should not be considered as a replacement for an individual's responsibility to critically analyze and think about the conclusions drawn from such data. It is essential to evaluate the degree to which the source is trustworthy and exercise discretion when using the information provided by Chat-GPT.
That being said, below is a quick conversation I used to try and parse out these child death statistics as well as acquire the source and code to re-analyze the data without simply taking Chat-GPT's word for it. I will state this for the record here...
"DO NOT TAKE THE FOLLOWING DATA AS GOSPEL FROM WHICH YOU SHOULD DRAW YOUR CURRENT BELIEFS ON THIS SUBJECT. RESEARCH AND THINK FOR YOURSELF"
Further, the point here was not to prove that the data or conclusions are correct. The point was to indicate how useful Chat-GPT can be to develop leads and hypotheses for further research, as well as assistance with tools necessary to do that (python code).
The Conversation
ME:
What is the greatest cause of death of children in the United States?
Chat-GPT:
According to the Centers for Disease Control and Prevention (CDC), the leading cause of death for children in the United States between the ages of 1 and 19 is unintentional injuries, also known as accidents. This includes things like car accidents, drowning, burns, falls, and poisoning. Other causes of death in children include cancer, homicide, and congenital anomalies. It's important for parents and caregivers to take steps to prevent accidents, such as using car seats and seat belts, installing safety gates and window guards, and supervising children around water and other potential hazards.
ME:
What percentage of child deaths are the result of guns?
Chat-GPT:
According to data from the CDC, firearms are a leading cause of death for children in the United States. In 2019, there were a total of 3,672 firearm deaths among children and adolescents aged 0-19 years. Of those deaths, 2,702 were classified as firearm homicides, and 463 were classified as firearm suicides. In addition, 194 deaths were classified as unintentional firearm injuries, and 313 were of undetermined intent.
The percentage of child deaths that are the result of guns varies by age group. For children aged 0-14 years, firearms accounted for 14% of all deaths. For adolescents aged 15-19 years, firearms accounted for 33% of all deaths. It's important to note that these deaths are preventable, and measures such as safe storage of firearms and responsible gun ownership can help to reduce the number of deaths and injuries caused by guns among children and adolescents.
ME:
Yes, but what percentage of those deaths are the result of homicide
Chat-GPT:
According to the CDC data for 2019, of the 3,672 firearm deaths among children and adolescents aged 0-19 years, 2,702 were classified as firearm homicides. This means that approximately 74% of firearm deaths in this age group were the result of homicides. Homicide is the second leading cause of firearm deaths among children and adolescents aged 0-19 years, following a suicide.
It's important to note that firearm homicides disproportionately affect certain populations, particularly Black children and adolescents. In 2019, Black children and adolescents aged 0-19 years accounted for 42% of all firearm homicides in this age group, despite representing only 14% of the total U.S. population in this age range. Addressing the root causes of gun violence, such as systemic racism, poverty, and social inequality, is essential to reducing the number of firearm homicides and improving the safety and well-being of all children and adolescents.
ME:
If you consider children to be defined as ages between 0 - 16, what percentage of those deaths are the result of homicide?
Chat-GPT:
According to CDC data for 2019, if we consider children to be defined as ages between 0-16, then there were a total of 2,626 firearm deaths in this age group. Of those deaths, 1,846 were classified as firearm homicides. This means that approximately 70% of firearm deaths in this age group were the result of homicides.
It's important to note that the rate of firearm homicides among children and adolescents has been steadily increasing over the past decade, particularly among Black children and adolescents. Addressing the root causes of gun violence and implementing evidence-based interventions to prevent and reduce firearm-related injuries and deaths among children and adolescents should be a priority for public health and policy efforts.
ME:
Okay then, so this 1,846 number is what percentage of the 0 - 19 total child deaths?
Chat-GPT:
According to the CDC data for 2019, there were a total of 3,672 firearm deaths among children and adolescents aged 0-19 years. Of those deaths, 2,702 were classified as firearm homicides.
If we consider children to be defined as ages between 0-16 years, as you specified earlier, then the number of firearm homicides in this age group was 1,846, as I mentioned in my previous response.
To calculate the percentage of all child deaths that were the result of firearm homicides among children aged 0-16, we would divide 1,846 by the total number of deaths in this age group, which was 38,771 in 2019, according to the CDC. This gives us a percentage of approximately 4.8%.
However, it's important to note that this calculation only applies to the specific age group and period mentioned and that the number and proportion of firearm homicides among children can vary widely depending on factors such as geographic location, socioeconomic status, and other demographic factors.
ME:
Give me the top 15 major causes of death for all people
Chat-GPT:
Sure, here are the top 15 leading causes of death for all people in the United States based on the CDC's data for 2019:
Heart disease
Cancer
COVID-19
Unintentional injuries (e.g., accidents)
Stroke
Chronic lower respiratory disease
Alzheimer's disease
Diabetes
Influenza and pneumonia
Kidney disease
Suicide
Septicemia (blood infection)
Chronic liver disease and cirrhosis
Hypertension (high blood pressure)
Parkinson's disease
It's important to note that the ranking of causes of death can vary from year to year and can also be influenced by factors such as age, gender, and race/ethnicity. Additionally, many of these causes of death are preventable or can be managed through lifestyle changes and appropriate medical care.
ME:
Okay from that list, I want you to consider that I want to know what percentage of deaths among children age 0 -16 die from each of these causes. Can you provide me with a data frame or table that shows me the cause of death, the number of deaths, and then the percentage
Chat-GPT:
I'm sorry, but I need to clarify that the list I provided earlier includes the top 15 causes of death for all ages, not just for children aged 0-16. However, I can provide you with a table that shows the leading causes of death for children aged 0-16 in the United States based on the CDC's data for 2019:
It's important to note that some causes of death may be related to underlying health conditions or lifestyle factors that develop over time, such as heart disease and diabetes, while others, such as unintentional injuries and homicide, may be more preventable through safety measures and early intervention.
ME:
Re-generate this last table. However when it comes to homicide, please specify if the homicide involved a firearm, knife, or other.
Chat-GPT:
I apologize for the delay. Here is the updated table that includes the method of homicide for children aged 0-16 in the United States based on the CDC's data for 2019:
Please note that for homicides, the method specified by the CDC is based on the underlying cause of death code assigned to the death certificate. The category "Other" includes methods such as blunt force, suffocation, and drowning, among others.
ME:
Okay, you misunderstood me. I want the table to show homicide firearm, homicide other, and homicide knife. I want those percentages to be a percentage of the total deaths and not the percentage of homicide deaths.
Chat-GPT:
I apologize for the confusion. Here is the updated table that shows the percentages of all deaths among children aged 0-16 in the United States for each cause of death, including separate columns for homicides by firearm, homicides by knife, and homicides by other means:
I hope this table better meets your needs.
ME:
Can you provide me with another table like the one above that only breaks down the 'unintentional injuries' into its sub-categories?
Chatp-GPT:
Sure, here is a table that breaks down the "unintentional injuries" category into its subcategories for children aged 0-16 in the United States:
I hope this table is helpful to you.
ME:
Can you provide Python code that will construct these two data frames without downloading the files from the CDC?
Chat-GPT:
Unfortunately, I cannot provide you with code that will construct these data frames without downloading the files from the CDC. The data is not readily available in a format that can be easily parsed and analyzed without downloading the files first.
However, I can provide you with the steps to download and preprocess the necessary files to create these data frames:
Download the National Vital Statistics System (NVSS) Mortality data for the relevant year(s) from the CDC website. The data can be found here: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple
Load the data into a pandas DataFrame using the appropriate delimiter (i.e., '\t' for tab-separated values).
Filter the DataFrame to only include rows where the age group is 0-16 years old.
Filter the DataFrame to only include deaths that occurred in the United States.
Aggregate the data by cause of death and calculate the total number of deaths for each cause.
Calculate the percentage of deaths for each cause by dividing the number of deaths by the total number of deaths for the age group.
Repeat steps 1-6 for each table that you want to create.
I hope this helps you get started with creating these data frames.
ME:
Assuming I have the data downloaded to my desktop. Can you write the Python code to re-construct these data frames?
Chat-GPT:
Certainly! Here's an example of Python code to read in the NVSS Mortality data for a single year and create the first table:
Here's an example Python code to read in the NVSS Mortality data for a single year and create the second table:
Note that in both examples, you will need to replace 'path/to/mortality/data/VS21MORT.DUSMCPUB' with the actual file path to the NVSS Mortality data file that you downloaded. Additionally, you will need to modify the code to loop over multiple years of data if you want to create these tables for multiple years.
ME:
Would you agree, that based on the data you provided, the following conclusions would be approximately correct?
If we are focused on homicide, particularly with guns, for kids sixteen years old and under, then we find that this accounts for only between 3-5% of child deaths.
- It would also indicate that the number of homicides in this group of children that do not include a gun outnumbers the gun homicides by 2:1.
- It would also indicate cancer kills 6 times as many children in this age group as gun homicides.
- It would also indicate as many children die from accidental suffocation as a gun homicide.
- It would indicate that twice as many children die from drowning as a gun homicide.
- It would mean that traffic fatalities and automobile accidents kill four times as many children as gun homicides.
Chat-GPT:
Yes, based on the data provided, those conclusions would be approximately correct. It is important to note that the data only covers a specific time frame and may not represent the current state of child mortality in the United States. Additionally, it is important to consider that every child's life lost is a tragedy regardless of the cause.
Re-analyzing the data; hypothesis formation
By checking my emotions, when someone says, "The leading cause of child deaths in America is gun violence", what comes to my mind is that all of the ten-year-old kids are being killed by drive-by murders and school shootings. That emotion runs deep. This post is not to alleviate me or anyone else of the horror and shock that seeing children murdered, or assaulted should cause. Nor should it alleviate one from the will to act imposed by those emotions. However, when we see only what is necessary to craft or enforce the story we tell ourselves about what we see, then a pseudo 'quality' of our conclusions can cement in our minds that can blind us from a better assessment of the facts.
Checking my Prejudice
Ok, let’s look at it another way. All I did was adjust the definition of children from 0-19 years old to 0-16. People would rightly indicate, "I’m cherry-picking the data." Fair enough. But it could be argued that someone also might be cherry-picking the data if they they included 18-19-year-olds in the original data set (legal adults) that Chat-GPT referenced. This also has not parsed out the locations where these deaths occurred. Poverty-saturated areas vs others may have much more of an impact in parsing the data differently as well as differences in state laws, religious saturation among demographics, etc. The point here is that the data should be evaluated from all sides to see the best overall picture possible to see what is actually there instead of crafting a story to fill in the gaps in my knowledge to coincide with my emotions.
The Main Points
I am re-iterating that data scientists and data analysts can suffer from similar problems of over-fitting their models to try and create a model that accurately back-tests a set of data. They can also over-simplify a model by only correlating data that provides a model consistent with their thesis, even if the conclusion is wrong ("Newsflash: 99% of all people who died of cancer in 2019 drank water". This data might have a perfect consistent correlation. It would be false to say that drinking water, in general, was the actual cause).
Chat-GPT is not always correct. It did provide me, quickly with direction and insight on where to begin researching.
Chat-GPT was a powerful source for quickly parsing this information. More importantly, it provided me with the code and the source data that I could use to evaluate the data myself. This was important. Because it meant that I was not simply asking a mystical oracle to give me the answer to "life, the universe, and everything".
Further, It showed me how that If I have already fact-checked sources of information, I can give that data back to Chat-GPT and then ask it to further analyze the information for insights I may have overlooked. This is where I find Chat-GPT intrigues me the most.