Project-Insurance
Background Project
Health insurance is one of the things that should be considered because it is related to future planning needs. Health insurance users are required to pay a regular amount of money (premium) to the insurance company. The premium is processed by the insurance company to pay the health bills of the insured user. Determining the premium value is a challenge for the insurance company considering that there are many factors that can affect and increase the risk profile of the user. user's risk profile. Through this project, you will be asked to help analyze the variables that have a relationship with the health bills received by each user. You will be given data containing user's personal data such as age, gender, user's residence, number of children insured, bmi value, whether or not the user smokes. of the user.
Analysis Instructions
Using the basic science of probability, you are expected to analyze scientifically to find variables related to health bills.
Step #1 - Descriptive Statistic Analysis
We start this analysis process with the most basic thing, which is summarizing the characters based data such as finding averages & data distribution. You can choose from the 5 questions below to explore the data. Some of the things you can answer are
- Average age of users
- Average BMI value of users who smoke
- What is the average age in the data?
- What is the average BMI value of those who smoke?
- Is the variance of the smoker and non-smoker charges data the same?
- Is the average age of female and male smokers the same?
- Which is higher, the average health bill of smokers or non-smokers?
- Which is higher, the average health bill of smokers whose BMI is above 25 or non-smokers whose BMI is above 25.
- Which is higher, a male or female?
- Which BMI is higher, a smoker or a non-smoker?
Step #2 - Categorical Variable Analysis (PMF)
Next, to deepen the analysis, you can identify opportunities for certain conditions that could potentially have a certain health bill amount. You can choose from the 5 questions below to check the conditions in the data. Some of the things you can answer are
- Which gender has the highest bill?
- The probability distribution of charges in each region
- Does each region have the same proportion of people?
- Which is the higher proportion of smokers or non-smokers?
- What is the probability that a person who is female is known to be a smoker?
- What is the probability that a person is a male known to be a smoker?
- What is the shape of the distribution of bills from each region?
Step #3 - Continuous Variable Analysis (CDF)
The variables in our data are not all categorical, to understand the possible conditions of continuous variables on health bills, we can perform CDF analysis on the data. Some of the things you can answer are
- Find the probability of a large bill based on BMI
- Find the probability that a smoker with a BMI above 25 will get a health bill above 16,700.
- What is the probability that a random person with a health bill above 16.7k is a smoker?
- Which is more likely to happen a. A person with BMI above 25 gets a health bill above 16.7k, or b. A person with a BMI below 25 gets a health bill above 16.7k
- Which is more likely to happen a. A smoker with a BMI above 25 gets a health bill above 16.7k, or b. A non-smoker with a BMI above 25 getting a health bill above 16.7k
Step #4 - Variable Correlation Analysis
After answering the conditions that are more likely to have high health bills from the previous step. We can also look for correlations between these conditions and health bills. and health bills. Correlation analysis will be required here.
Step #5 - Hypothesis Testing
In the last step, we look for whether there is sufficient statistical evidence for the claims or hypotheses about the health bill. You must check 3 hypotheses about the population characteristics of the data. The hypotheses that must be tested are
- Smokers' health bills are higher than non-smokers' health bills.
- Health bills with BMI above 25 are higher than health bills with BMI below 25. One other hypothesis, you can choose one of the hypotheses below, or you can make up your own another hypothesis
- BMI of men and women are the same
- Men's health bills are higher than women's
- Proportion of smokers differs by region
Project Outcome
After you have done all that, we want you to be able to analyze & summarize the results in a short report & presentation. Save the short report along with the work files in the github repository link and record your presentation via youtube. Provide the project repository link and presentation youtube link in the submission form.
Dataset & Tools
Dataset
The dataset provided is a personal health bill. This data has 7 variables with variable charges indicating the amount of health. The description of each column of the dataset is as follows:
- Age: age of the primary policyholder.
- Sex: insurance contractor gender, female, male.
- BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, objective index of body weight(kg/m^2) using the ratio of height to weight, ideally 18.5 to 24.9
- Smoker status: Whether the policyholder is a smoker or a non-smoker.
- Children: number of children covered by health insurance.
- Region of residence: the beneficiary's residential area in the U.S. - northeast, southeast, southwest, northwest.
- Charges: individual medical costs are billed by health insurance.
Tools
This project uses any tools to calculate, analyze, and plot data:
- Tableau
- Python
- Github
Framework to Describe These Projects
Goal: The main objective of this project was to analyze and identify the variables that are associated with the health bills received by individuals. The project aimed to uncover patterns and relationships within the dataset, particularly focusing on personal attributes such as age, BMI value, smoking status, gender, and territory.
Impact: Through the analysis of various factors and variables, the project provided valuable insights into the determinants of medical costs. By examining the relationships between personal characteristics and health bills, the project contributed to a better understanding of the factors driving healthcare expenses.
Challenges: The project encountered challenges in handling and processing the dataset to derive meaningful insights. Ensuring accurate data representation, dealing with potential outliers, and interpreting the implications of variable relationships were some of the challenges faced during the analysis.
Interesting Findings: The project yielded several interesting findings: • The presence of a strong correlation between smoking behavior and higher medical costs, particularly in combination with a high BMI. • Older individuals tended to have higher medical costs compared to younger ones. • Geographic region, particularly the Southeast region, exhibited unique patterns in medical costs distribution within certain percentiles. • A nearly balanced gender distribution among the dataset participants.
Conclusion: The project successfully achieved its goal of analyzing variables associated with health bills, shedding light on significant factors influencing medical costs. The insights gained can be used for informed decision-making in healthcare policy and individual financial planning. The use of tools such as Tableau Public and Python programming language facilitated efficient data analysis and visualization, aiding in the exploration of the dataset and extraction of meaningful conclusions.
Log in or sign up for Devpost to join the conversation.