Matplotlib - The Power of Plots

"Visual storytelling of one kind or another has been around since caveman were drawing on the walls." Frank Darabont

Background

This respository apply a Python Matplotlib to visualize a real-world pharmaceutical data. The data is sourced from Pymaceuticals Inc., a burgeoning pharmaceutical company based out of San Diego. Pymaceuticals specializes in anti-cancer pharmaceuticals. In its most recent efforts, it began screening for potential treatments for squamous cell carcinoma (SCC), a commonly occurring form of skin cancer.

These analysis used a complete data from their most recent animal study in two datasets in CSV format. Data set one is Mouse_metadata.csv wich includes 249 mice identified data with SCC tumor growth were treated through a variety of drug regimens, and their Sex, Age_months and Weight (g) identified. The other dataset is Study_results.csv file which includes the results of the study in each columns Mouse I,Timepoint,Tumor Volume (mm3), and Metastatic Sites.

The purpose of this study was to compare the performance of Pymaceuticals' drug of interest, Capomulin, versus the other treatment regimens. The analysis also generated all of the table and figures needed for the technical, and top-level summary report of the study. For this analysis both datasets imported, merged,cleaned and the aggregate data diplayed in to Python Pandas dataframes, visualized in Matplotlib, and other libraries used in order to make a stastical analysis. The project is conducted in Jupyter notebook to showcase, and communicate the analysis report the following link is created: Jupyter Notebook Viewer

Observable Trends

The bar graph showed the Drug Regimen Capomulin has the maximum mice number (230), and Zoniferol has the smaller mice number (182).By removing duplicates the total number of mice is 248. The total count of mice by gender also showed that 124 female mice and 125 male mice.
The correlation between mouse weight, and average tumor volume is 0.84. It is a strong positive correlation, when the mouse weight increases the average tumor volume also increases.
The regression analysis helped us to understand how much the average tumor volume (dependent variable) will change when weight of mice change(independent variables). The R-squared value is 0.70, which means 70% the model fit the data, wich is fairely good to predict the data from the model. Higher R-squared values represent smaller differences between the observed data, and the fitted value. 70% the model explains all of the variation in the response variable around its mean.
From the selected treatments Capomulin and Ramicane reduces the size of tumors better.

Solutions

Data Cleaning

The data was loaded, read, combined, duplicate removed, and the head (5 rows on the top) of cleaned data out put looks as follows

	Mouse ID	Drug Regimen	Sex	Age_months	Weight (g)	Timepoint	Tumor Volume (mm3)	Metastatic Sites
0	k403	Ramicane	Male	21	16	0	45.000000	0
1	k403	Ramicane	Male	21	16	5	38.825898	0
2	k403	Ramicane	Male	21	16	10	35.014271	1
3	k403	Ramicane	Male	21	16	15	34.223992	1
4	k403	Ramicane	Male	21	16	20	32.997729	1

Summary statistics

A summary statistics table was generated by using two techniques one is by creating multiple series, and putting them all together at the end, and the other method produces everything in a single groupby function. The summery statistic table consis the mean, median, variance, standard deviation, and SEM of the tumor volume for each drug regimen. The summery stastics tables looks as follws:

	Mean	Median	Variance	Standard Deviation	SEM
Drug Regimen
Capomulin	40.675741	41.557809	24.947764	4.994774	0.329346
Ceftamin	52.591172	51.776157	39.290177	6.268188	0.469821
Infubinol	52.884795	51.820584	43.128684	6.567243	0.492236
Ketapril	55.235638	53.698743	68.553577	8.279709	0.603860
Naftisol	54.331565	52.509285	66.173479	8.134708	0.596466
Placebo	54.033581	52.288934	61.168083	7.821003	0.581331
Propriva	52.320930	50.446266	43.852013	6.622085	0.544332
Ramicane	40.216745	40.673236	23.486704	4.846308	0.320955
Stelasyn	54.233149	52.431737	59.450562	7.710419	0.573111
Zoniferol	53.236507	51.818479	48.533355	6.966589	0.516398

Bar and Pie Charts

Two identical bar charts was generated by using both Pandas's DataFrame.plot() and Matplotlib's pyplot that shows the number of total mice for each treatment regimen throughout the course of the study.

The Bar Cahrts looks as follows:

Bar Chart on the Number of Mice per Treatment (Pandas's `DataFrame.plot()`)

Bar Chart on the Number of Mice per Treatment (Matplotlib's `pyplot`)

Two identical pie plot was generated by using both Pandas's DataFrame.plot() and Matplotlib's pyplot that shows the distribution of female or male mice in the study.

Pi Chart on the distribution of female or male mice in the study (Pandas's `DataFrame.plot()`)

Pi Chart on the distribution of female or male mice in the study (Matplotlib's `pyplot`)

Quartiles, Outliers and Boxplots

The final tumor volume of each mouse across four of the most promising treatment regimens was created: Capomulin, Ramicane, Infubinol, and Ceftamin. Afterward the quartiles, IQR, and potential outliers across all the four treatment regimens was quantitatively determined.

Capomulin Final Tumor Volume

	Mouse ID	Timepoint	Drug Regimen	Sex	Age_months	Weight (g)	Tumor Volume (mm3)	Metastatic Sites
0	b128	45	Capomulin	Female	9	22	38.982878	2
1	b742	45	Capomulin	Male	7	21	38.939633	0
2	f966	20	Capomulin	Male	16	17	30.485985	0
3	g288	45	Capomulin	Male	3	19	37.074024	1
4	g316	45	Capomulin	Female	22	22	40.159220	2

Capomulin Quartiles and IQR

Capomulin_tumors = Capomulin_merge["Tumor Volume (mm3)"]

quartiles =Capomulin_tumors.quantile([.25,.5,.75])
lowerq = quartiles[0.25]
upperq = quartiles[0.75]
iqr = upperq-lowerq


print(f"The lower quartile of Capomulin tumors: {lowerq}")
print(f"The upper quartile of Capomulin tumors: {upperq}")
print(f"The interquartile range of Capomulin tumors: {iqr}")
print(f"The median of Capomulin tumors: {quartiles[0.5]} ")

The output looks as follws:

Capomulin Outliers using upper and lower bounds

lower_bound = lowerq - (1.5*iqr)
upper_bound = upperq + (1.5*iqr)

print(f"Values below {lower_bound} could be outliers.")
print(f"Values above {upper_bound} could be outliers.")

The output looks as follws:

Ramicane Final Tumor Volume

	Mouse ID	Timepoint	Drug Regimen	Sex	Age_months	Weight (g)	Tumor Volume (mm3)	Metastatic Sites
0	a411	45	Ramicane	Male	3	22	38.407618	1
1	a444	45	Ramicane	Female	10	25	43.047543	0
2	a520	45	Ramicane	Male	13	21	38.810366	1
3	a644	45	Ramicane	Female	7	17	32.978522	1
4	c458	30	Ramicane	Female	23	20	38.342008	2

Ramicane Quartiles and IQR

Ramicane_tumors = Ramicane_merge["Tumor Volume (mm3)"]

quartiles =Ramicane_tumors.quantile([.25,.5,.75])
lowerq = quartiles[0.25]
upperq = quartiles[0.75]
iqr = upperq-lowerq


print(f"The lower quartile of Ramicane tumors is: {lowerq}")
print(f"The upper quartile of Ramicane tumors is: {upperq}")
print(f"The interquartile range of Ramicane tumors is: {iqr}")
print(f"The median of Ramicane tumors is: {quartiles[0.5]} ")

The output looks as follws:

Ramicane Outliers using upper and lower bounds

lower_bound = lowerq - (1.5*iqr)
upper_bound = upperq + (1.5*iqr)

print(f"Values below {lower_bound} could be outliers.")
print(f"Values above {upper_bound} could be outliers.")

The output looks as follws:

Infubinol Final Tumor Volume

	Mouse ID	Timepoint	Drug Regimen	Sex	Age_months	Weight (g)	Tumor Volume (mm3)	Metastatic Sites
0	a203	45	Infubinol	Female	20	23	67.973419	2
1	a251	45	Infubinol	Female	21	25	65.525743	1
2	a577	30	Infubinol	Female	6	25	57.031862	2
3	a685	45	Infubinol	Male	8	30	66.083066	3
4	c139	45	Infubinol	Male	11	28	72.226731	2

Infubinol Quartiles and IQR

Infubinol_last = Infubinol_df.groupby('Mouse ID').max()['Timepoint']
Infubinol_vol = pd.DataFrame(Infubinol_last)
Infubinol_merge = pd.merge(Infubinol_vol, Combined_data, on=("Mouse ID","Timepoint"),how="left")
Infubinol_merge.head()

The output looks as follws:

Infubinol Outliers using upper and lower bounds

lower_bound = lowerq - (1.5*iqr)
upper_bound = upperq + (1.5*iqr)

print(f"Values below {lower_bound} could be outliers.")
print(f"Values above {upper_bound} could be outliers.")

The output looks as follws:

Ceftamin Final Tumor Volume

	Mouse ID	Timepoint	Drug Regimen	Sex	Age_months	Weight (g)	Tumor Volume (mm3)	Metastatic Sites
0	a275	45	Ceftamin	Female	20	28	62.999356	3
1	b447	0	Ceftamin	Male	2	30	45.000000	0
2	b487	25	Ceftamin	Female	6	28	56.057749	1
3	b759	30	Ceftamin	Female	12	25	55.742829	1
4	f436	15	Ceftamin	Female	3	25	48.722078	2

Ceftamin Quartiles and IQR

Ceftamin_tumors = Ceftamin_merge["Tumor Volume (mm3)"]

quartiles = Ceftamin_tumors.quantile([.25,.5,.75])
lowerq = quartiles[0.25]
upperq = quartiles[0.75]
iqr = upperq-lowerq

print(f"The lower quartile of treatment Cap: {lowerq}")
print(f"The upper quartile of temperatures is: {upperq}")
print(f"The interquartile range of temperatures is: {iqr}")
print(f"The the median of temperatures is: {quartiles[0.5]} ")

The output looks as follws:

Ceftamin Outliers using upper and lower bounds

lower_bound = lowerq - (1.5*iqr)
upper_bound = upperq + (1.5*iqr)

print(f"Values below {lower_bound} could be outliers.")
print(f"Values above {upper_bound} could be outliers.")

The output looks as follws:

Box and Whisker Plot

A box and whisker plot of the final tumor volume for all four treatment regimens was generated, and a potential outliers highlighted by using color, and style.

A box and whisker plot looks as follws:

Line and Scatter Plots

Line Plot

A line plot created on selected mouse (b742) that was treated with Capomulin, and generate a line plot of time point versus tumor volume for that mouse.

A line plot looks as follws:

Scatter Plot

A scatter plot of mouse weight versus average tumor volume for the Capomulin treatment regimen was created.

A scatter plot looks as follws:

Correlation and Regression

A correlation coefficient, and linear regression analysis was conducted between mouse weight and average tumor volume for the Capomulin treatment. A Plot of the linear regression model created on top of the previous scatter plot.

Correlation

corr=round(st.pearsonr(avg_capm_vol['Weight (g)'],avg_capm_vol['Tumor Volume (mm3)'])[0],2)
print(f"The correlation between mouse weight and average tumor volume is {corr}")

A line plot looks as follws: ![Correlation Coefficient Out put](Images/correlation coefficient.png)

Regression

x_values = avg_capm_vol['Weight (g)']
y_values = avg_capm_vol['Tumor Volume (mm3)']

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept

print(f"slope:{slope}")
print(f"intercept:{intercept}")
print(f"rvalue (Correlation coefficient):{rvalue}")
print(f"pandas (Correlation coefficient):{corr}")
print(f"stderr:{stderr}")

line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

print(line_eq)

A linear regression output looks as follws:

Adding a linear regression line to the scatter plot

fig1, ax1 = plt.subplots(figsize=(15, 10))
plt.scatter(x_values,y_values,s=175, color="blue")
plt.plot(x_values,regress_values,"r-")
plt.xlabel('Weight(g)',fontsize =14)
plt.ylabel('Average Tumore Volume (mm3)',fontsize =14)
ax1.annotate(line_eq, xy=(20, 40), xycoords='data',xytext=(0.8, 0.95), textcoords='axes fraction',horizontalalignment='right', verticalalignment='top',fontsize=30,color="red")
plt.savefig("../Images/linear_regression.png", bbox_inches = "tight")
plt.show()

A linear regression plot looks as follws:

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
Images		Images
Pymaceuticals		Pymaceuticals
.DS_Store		.DS_Store
README.md		README.md
matplotlib_grading_rubric.pdf		matplotlib_grading_rubric.pdf

ermiasgelaye/Matplotlib-Challenge

Folders and files

Latest commit

History

Repository files navigation

Matplotlib - The Power of Plots

"Visual storytelling of one kind or another has been around since caveman were drawing on the walls." Frank Darabont

Background

Observable Trends

Table of Contents

Solutions

Data Cleaning

Summary statistics

Bar and Pie Charts

Bar Chart on the Number of Mice per Treatment (Pandas's DataFrame.plot())

Bar Chart on the Number of Mice per Treatment (Matplotlib's pyplot)

Pi Chart on the distribution of female or male mice in the study (Pandas's DataFrame.plot())

Pi Chart on the distribution of female or male mice in the study (Matplotlib's pyplot)

Quartiles, Outliers and Boxplots

Capomulin Final Tumor Volume

Capomulin Quartiles and IQR

Capomulin Outliers using upper and lower bounds

Ramicane Final Tumor Volume

Ramicane Quartiles and IQR

Ramicane Outliers using upper and lower bounds

Infubinol Final Tumor Volume

Infubinol Quartiles and IQR

Infubinol Outliers using upper and lower bounds

Ceftamin Final Tumor Volume

Ceftamin Quartiles and IQR

Ceftamin Outliers using upper and lower bounds

Box and Whisker Plot

Line and Scatter Plots

Line Plot

Scatter Plot

Correlation and Regression

Correlation

Regression

Adding a linear regression line to the scatter plot

Copyright

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Bar Chart on the Number of Mice per Treatment (Pandas's `DataFrame.plot()`)

Bar Chart on the Number of Mice per Treatment (Matplotlib's `pyplot`)

Pi Chart on the distribution of female or male mice in the study (Pandas's `DataFrame.plot()`)

Pi Chart on the distribution of female or male mice in the study (Matplotlib's `pyplot`)

Packages