This dataset, obtained from the California Health and Human Services, consists of 114,635 records and 22 columns, detailing product names along with reported hazardous chemicals. Key columns feature the brand name, chemical name, and the report date. As indicated on their website, the dataset encompasses records from 2007 up to the present, with the latest entry cited as of October 2023. My analysis focused on applying three statistical methods: manual data normalization of frequency, Kernal Density Estimate (KDE), and bivariate KDE.
For my analysis, I aimed to answer four questions:
First, how can the data be normalized to determine “common” chemicals among different product categories?
Second, what is KDE of inital report verus latest report dates?
Third, what is the bi-variate KDE relationship between the number of chemicals in a product and report date?
Fourth, what is the KDE of the top 6 reported chemicals over time?
1. Stacked box plot of normalized data to determine common chemicals:
The methodology encompassed grouping products by primary categories and chemicals, calculating the frequency of each chemical within its primary category, then calculating the percentage by dividing by the total chemical count for each primary category, and visually representing these percentages to identify trends. This manual normalization is one way of finding a quantitative “common” chemical.
To prepare for visualization, the data was refined to focus on the top 5 categories and top 3 chemicals by frequency percentage, and chemical names were simplified for clarity. A custom color map was created for the selected chemicals, and a pivot table facilitated a smoother stacked bar plot visualization. This visualization aimed to highlight the most common hazardous chemicals across the product categories and the common colors applied for unique chemical names across the 5 categories does successfully portray normalized commonalities and differences.
Key observations:
• Notably, titanium dioxide emerged as a prevalent chemical, classified as a possible carcinogen when inhaled in powder form.
• The second was Silica crystalline which can be a group 1 carcinogen depending on its form
2. KDE comparison of initial report date and most recent report dates:
Utilizing the seaborn KDE plot option, the two different date columns were juxtaposed to interpret trends. The KDE function represents the probability density function (PDF), which indicates the relative likelihood of the continuous variable input taking on a value. Unlike my manual normalized distribution above which is a percentage, the KDE unit is a ratio of probabilities to the range of values. Therefore the whole area under the curve for a variable is equivalent to 100%. KDE applies a arithmetic “smoothing” technique determined by the bandwidth denominator which can be manually adjusted with bw_adjust. The key to applying said smoothing technique is not only applying an appropriate bandwidth for visualization but also ensuring the variables are continous and therefore infinitely divisible.
Key observations:
• There have been less chemical reports past the 2018 peak as indicated by the right tail
• The two date columns demonstrate a lot of overlap in which logically the initial report date and follow-up report date spiking and dropping around similar dates
• Around 2016 there is a gradual build up of both report dates coinciding with the known rise of makeup and beauty influencers on social media
• The website notes that not all products containing toxicants may be included to due companies failing to report, therefore the tail and drops after 2018 may not indicate a true decline in hazardous chemical usage
3. Bivariate KDE of chemical count by product and inital report date:
Determining if over time products have reported less, more, or the same number of chemicals per product utilizing a bivariate KDE plot. Similar to a univariate KDE, this bivariate KDE application is also a density plot, but each axes is a designated continous variable and therefore the density is a joint density value focused on correlation. The two axes are the chemical creation date as a year in the x-axis and the number of chemicals counted in a product on the y-axis. The gradient bar was added to the right as a legend.
Key observations:
• Overall the number of hazardous chemicals reported has primairly been 1-2 chemicals per product from 2006 to 2022 as shown with the long KDE bands
• Overtime the two concentrated bands in number of chemicals reported per product parallel the two peaks in KDE of initial report dates and follow up report dates from the previous plot
4. KDE top 6 reported chemicals over time:
In order to understand the temportal trends of the 6 most frquent chemicals reported, a KDE subplot was implemented for each. The dataset was queried to determine the top 6 based on the count of occurances. The dataset was then filtered for the top 6. Next, the year was extracted from the initial report date column and appended as a new column to use as the x-axis. The plot generation made use of two important functions; a for loop to compute the KDE plot values for each chemical and flatten() to transform the resulting array into a one dimensional array which makes subplotting easier. I kept the long original version of each chemical name because the subplot layout could handle them.
Key observations:
• Unlike previous KDE plots, the chemical reports by chemical demonstrate less correlation to the common peaks and concentration bands around 2009 and 2018
• Chemicals like titanium dioxide and mica demonstrate more consistent reporting over time and therefore suggest a relatively steady usage in products
• The most peculiar chemical plot is butylated hydroxyanisole which shows pronounced peaks, which may indicate changes in industry standards or regulation changes
The Code
Determining the frequency as a percentage for each primary category
Visualization 1, stacked bar plot of normalized data
Visualization 2, KDE of two date variables
Visualization 3, bivariate KDE of chemical count and creation date
Visualization 4, KDE of top 6 chemicals
Data Sources:
https://data.chhs.ca.gov/dataset/chemicals-in-cosmetics
https://www.cancer.org/cancer/risk-prevention/understanding-cancer-risk/known-and-probable-human-carcinogens.html