Cyclistic bike-share analysis case study!

Radhin krishna
Jun 29, 2024
6 min read

Google Data analytic's certification capstone project

Introduction

This case study concerns a fictional company, Cyclistic, and meets different characters and team members. To answer the key business questions.

This report delves into a critical question: How do casual riders and annual members utilize Cyclistic's services differently? Understanding these usage patterns holds the key to unlocking a successful marketing strategy. By leveraging data insights, we can identify trends and tailor compelling arguments that convert casual riders into loyal annual members.

Through clear and insightful data visualizations, this report will equip the Cyclistic marketing team with actionable recommendations and provide executives with compelling evidence to support those recommendations. By the end of this analysis, we'll have a roadmap for attracting new annual members and ensuring Cyclistic's continued success in Chicago's dynamic transportation landscape.

prepare

Where and how is data stored?

The data has been made available by Motivate International Inc. under this license.

Data License Agreement | Divvy Bikes

This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data privacy issues prohibit you from using riders’ personally identifiable information.

Are there issues with bias or credibility in this data? Does your data ROCCC?

How did you verify the data’s integrity?

Here, ROCCC stands for:

Reliable: The data comes directly from Cyclistic, presumably tracking actual bike trips.

Original: This is first-party data collected by Cyclistic itself.

Comprehensive: The data likely includes most, if not all, bike trips within the chosen timeframe.

Current: Depending on the chosen timeframe (entire year or one quarter), the data could be outdated.

Cited: The case study mentions Motivate International Inc. provides the data under a license.

The case study mentions the data is in ".csv" format, likely comma-separated values. This means each row represents a single bike trip, with columns containing details like trip duration, start/end times, and user type (casual or annual member).

There could still be potential bias:

Sampling bias: If the data doesn't capture all Cyclistic users equally (e.g., missing data from a specific area), it might be skewed.
Data entry errors: Human error during data collection could introduce inconsistencies

How did you verify the data’s integrity?

While I can't directly verify the data, some methods could be used:

Data cleaning: Checking for missing values, inconsistencies, and outliers during analysis.
Cross-referencing: If possible, compare the data with other sources (e.g., ridership reports).

How does it help you answer your question?

The data on trip duration, start/end times, and user types allow us to compare how casual and annual members use Cyclistic bikes. You can identify patterns like:

Do casual riders take shorter trips on average?
Are there peak usage times for each user type?
Do annual members use bikes in specific areas more often?

These insights can inform marketing strategies to target casual riders based on their usage habits.

Are there any problems with the data?

Missing data: Incomplete trip information could skew results.
Outdated data: Using older data might not reflect current user behavior.
Data accuracy: Errors during data collection could impact reliability.

Process

What tools are you choosing and why?

Spreadsheets ( Excel): basic cleaning tasks (sorting, filtering, removing duplicates), and initial exploration. However, large datasets can become cumbersome.

Split the time and date into different attributes and create a custom column to calculate the trip duration in minutes.

Spreadsheet Calculations:

Ride Length:

Create a new column "ride_length".
Use formula =D2-C2 (assuming started_at in C2, ended_at in D2) to calculate difference.
Format the column as Time (HH:MM:SS).

Day of Week:

Create a new column "day_of_week".
Use formula =WEEKDAY(C2,1) (assuming started_at in C2) to determine weekday.
Format the column as General or number with no decimals (1=Sunday, 7=Saturday).

Programming Languages (Python, R): Offer greater flexibility and power for manipulating, cleaning, and analyzing large datasets. Libraries like Pandas (Python) or dplyr (R) streamline data wrangling tasks.

Data Analysis Platforms (Tableau): Provide visual interfaces for data cleaning, transformation, and analysis. They can be user-friendly but may require additional scripting for complex tasks.

Have you ensured your data’s integrity? What steps have you taken to ensure that your data is clean?

identifying Issues:

Using descriptive statistics and visualizations to identify outliers, missing values, and inconsistencies.

Using the pandas library in Python the data set is thoroughly examined and outliers and missing values are either imputed or removed collab file:

Cleaning Techniques:

Handle missing values: Remove rows with too many missing entries, impute missing values with statistical methods, or categorize them as a separate category.
Correct inconsistencies: Fix typos, standardize formats (e.g., date formats), and address any data entry errors.
Address outliers: Depending on the context, outliers could be genuine or indicate errors. Analyze them and decide if removal or correction is necessary.
Remove duplicates: Ensure each trip record is unique.

How can you verify that your data is clean and ready to analyze?

Data Profiling: Generate summary statistics to understand the distribution of values in each column.

Cross-referencing: If possible, compare the data with other sources (e.g., internal reports) to identify any discrepancies.

Visualizations: Create charts and graphs to explore data relationships and identify any remaining anomalies.

Python Code for Preprocessing & Validation

Excel File After Pre-Processing

Analysis

1.Excel

1. Open Excel and Clean Up Your Data:

Launch Microsoft Excel.
Combine and Clean: In your data file, identify and combine relevant columns into a single worksheet for easier analysis. This might involve merging duplicate entries or splitting a single column into multiple based on content.
Clean Data: Look for inconsistencies, typos, and missing values. Correct errors, remove irrelevant data and ensure consistent formatting (e.g., dates in the same format).

2. Explore the Data:

Descriptive Analysis: Get a high-level understanding of your data. This might involve calculating basic statistics like:
Mean: Average ride length (SUM of ride_length / number of rides)
Max: Longest ride length (use the MAX function)
Mode: Most frequent day of the week for rides (use the MODE function)

3. Leverage Pivot Tables:

Pivot tables are powerful tools for summarizing and visualizing data. Create a pivot table to analyze your data in different ways. Here are a few examples:
Average Ride Length by Rider Type:
Rows: Drag "member_casual" (separating members and casual riders)
Values: Drag "ride_length" and choose "Average" as the calculation
Average Ride Length by Day:
Columns: Drag "day_of_week" to see results by day
Rows: Drag "member_casual" (maintain rider type separation)
Values: Drag "ride_length" and choose "Average" as the calculation
Number of Rides by Day:
Similar to average ride length by day, but replace "ride_length" with "trip_id" and choose "Count" as the calculation (shows the number of rides)

4. Analyze Another File (Optional):

Open another data file (e.g., different season) and repeat steps 2 and 3. This helps observe potential seasonal trends.

5. Merge and Export Data:

Once you've analyzed individual files, use your chosen tool (Excel, database, R Studio) to merge them into a full year view for further analysis.
Export Summary: After merging, export a summary file containing the data you'll use for further in-depth analysis in your chosen tool.

2.SQL

Using Ms SQL for Data Analysis:

1. Import Your Data:

MS SQL doesn't directly open spreadsheets. You'll need to import your data file (e.g., CSV, Excel) into a SQL table. This can be done using several methods:
SQL Server Import and Export Wizard: A graphical tool within SQL Server Management Studio (SSMS) to import data.
BULK INSERT: A T-SQL statement for efficient data import from various file formats.

2. Explore Your Data:

Use queries to get an overview of your data. Here are some examples:
Total Number of Rows: SELECT COUNT(*) FROM your_table_name;
Distinct Values: SELECT DISTINCT column_name FROM your_table_name;
Maximum Value: SELECT MAX(column_name) FROM your_table_name;
Minimum Value: SELECT MIN(column_name) FROM your_table_name;
Mean Value: SELECT AVG(column_name) FROM your_table_name;

3. Combine Data with JOINs (if necessary):

If your data is spread across multiple tables, use JOIN statements to combine them based on a shared relationship (e.g., common ID). Here's a basic JOIN syntax:

SELECT *
FROM table1
INNER JOIN table2 ON table1.column_name = table2.column_name;

4. Create Summary Statistics:

Use aggregation functions like SUM, AVG, COUNT, etc., to calculate summary statistics. Here's an example:

SELECT
    member_casual,
    AVG(ride_length) AS average_ride_length,
    COUNT(*) AS total_rides
FROM your_table_name
GROUP BY member_casual

This query calculates average ride length and total rides grouped by member type (member/casual).

5. Investigate Trends and Save Results:

Write queries to identify interesting trends in your data. For example, analyzing average ride length by day:

SELECT
    day_of_week,
    AVG(ride_length) AS average_ride_length
FROM your_table_name
GROUP BY day_of_week;

Once you discover trends, create a new table to store these insights using INSERT statements:

INSERT INTO trend_analysis (trend_name, trend_details)
VALUES ('Average Ride Length by Day',

This saves the trend for "Average Ride Length by Day" with details retrieved from your previous query.

3. R Programming

Using R for Data Analysis:

1. Open RStudio and Import Your Data:

Launch RStudio.
There are several ways to import data in R. Here are two common methods:
Using the read.csv() function: This function reads data from comma-separated value (CSV) files. For example:

data <- read.csv("your_data.csv", header = TRUE)

2. Clean and Prepare Your Data:

Explore the data structure using str(data).
Identify and address inconsistencies in column names or data types (e.g., convert character columns to numeric if appropriate).
Handle missing values using techniques like removal or imputation (filling in missing values).
You can use R functions and packages for data manipulation. Here are some examples:
Check for missing values: sum(is.na(data$column_name))
Remove rows with missing values: data <- na.omit(data)
Change data type: data$column_name <- as.numeric(data$column_name)

3. Merge Dataframes (if necessary):

If your data is spread across multiple dataframes, use functions like rbind or merge to combine them based on a shared variable.

4. Conduct Descriptive Analysis:

Use R functions to summarize your data. Here are some examples:
Summary statistics: summary(data$column_name) provides basic statistics like mean, median, minimum, and maximum.
Frequency tables: table(data$column_name) shows the frequency of each unique value in a column.
Visualization: R offers various packages (ggplot2, base R graphics) for creating informative visualizations like histograms, scatter plots, and boxplots to explore data distribution and relationships.

5. Export a Summary File:

Once you've cleaned and analyzed your data, you can export a summary file for further analysis in other tools. Here are common options:
CSV: write.csv(summary_data, "summary_data.csv")
Excel: Packages like openxlsx allow you to write data to Excel files.

Note: This is a basic overview. R offers a vast ecosystem of packages and functionalities for in-depth data cleaning, transformation, analysis, and visualization.

Cyclistic bike-share analysis case study!

Introduction

prepare

Process

Analysis

Share

Recent Posts

Comments