Customer-complaint-analysis

Radhin krishna
Aug 2, 2024
3 min read

Edurekha machine Data Science & Machine learning

For code solutions and dataset:

Introduction:

This project was completed during the Edureka Data Science and Machine Learning internship. It encompasses all the fundamentals of machine learning, including the development of a Python program that initially cleans the data, followed by analysis and pre-processing. Additionally, it involves feature engineering and the establishment of a program to evaluate the performance of various machine-learning models. such as ensemble models, regression models, decision trees, etc..........

Scenario:

Product review is the most basic function/factor in resolving customer issues and increasing the sales growth of any product. We can understand their mindset toward our service without asking each customer. When consumers are unhappy with some aspect of a business, they reach out to customer service and might raise a complaint. Companies try their best to resolve the complaints that they receive. However, it might not always be possible to appease every customer. Here, we will analyze data, and with the help of different algorithms, we will find the best classification of customer category so that we can predict our test data.

Objective:

Use Python libraries such as Pandas for data operations, Seaborn, and Matplotlib for data visualization and EDA tasks, Sklearn for model building and performance visualization, and based on the best model, make a prediction for the test file and save the output. The main objective is to predict whether our customer is disputed or not with the help of the given data.

Dataset Description:

Customers faced some issues and tried to report their problems to customer care. Dispute: This is our target variable based on train data; we have two groups, one with a dispute with the bank and another with no issue.

Date received: The day the complaint was received. Product: different products offered by the bank (credit cards, debit cards, different types of transaction methods, accounts, locker services, and money-related)

Sub-product: loan, insurance, other mortgage options Issue: Complaint of customers Company public response: Company’s response to consumer complaint

Company: Company name State: State where the customer lives (different state of USA)

ZIP code: Where the customer lives Submitted via: Register complaints via different platforms (online web, phone, referral, fax, post mail)

Date sent to company: The day the complaint was registered

Timely response?: Yes/no Consumer disputed?: yes/no (target variable)

Complaint ID: unique to each consumer

Tasks to be performed:

The following tasks are to be performed:

Data Cleaning & Pre-processing

Read the data from the Excel file.
Check the data type for both data (test file and train file)
Do missing value analysis and drop columns where more than 25% of data is missing
Extracting Day, Month, and Year from the Date Received Column and creating new fields for a month, year, and day
Calculate the Number of Days the Complaint was with the Company and create a new field as “Days held”
Drop "Date Received"," Date Sent to Company", "ZIP Code", and "Complaint ID" fields
Imputing Null value in “State” by Mode
with the help of the days we calculated above, create a new field 'Week_Received' where we calculate the week based on the day of receiving.
store data of disputed people into the “disputed_cons” variable for future tasks

Code solution

Data Analysis and Visualization

Plot bar graph of the total no of disputes of consumers

Plot bar graph of the total no of disputes products-wise

Plot bar graph of the total no of disputes with Top Issues by Highest Disputes

Plot a bar graph of the total no of disputes by State with Maximum number of Disputes

Plot bar graph of the total no of disputes Submitted Via different source

Plot a bar graph of the total no of disputes where the Company's Response to the Complaints

Plot a bar graph of the total no of disputes where the Company's Response Leads to Disputes

Plot a bar graph of the total no of disputes. Whether there are Disputes Instead of Timely Response

Plot bar graph of the total no of disputes over Year Wise Complaints

Plot bar graph of Top Companies with Highest Complaints

Convert all negative days held to zero (it is the time taken by the authority that can't be negative)

Data pre-processing and Machine learning model selection

Splitting the Data Sets Into X and Y by the dependent and independent variables (data selected by PCA)
Build given models and measure their test and validation accuracy:

o LogisticRegression

o DecisionTreeClassifier

o RandomForestClassifier

o AdaBoostClassifier

o GradientBoostingClassifier

o KNeighborsClassifier

o XGBClassifier

• Whoever gives the most accurate result uses it and predicts the outcome is selected

Conclusion

Based on the accuracy and validation scores, the Logistic Regression model demonstrates the highest accuracy, while AdaBoost, XGBoost, and Gradient Boosting also exhibit strong performance on the test data. Therefore, selecting one of these algorithms would be an ideal solution.