Cyclistic Case Study Analysis 2020
Using Spreadsheets, Google Bigquery (SQL), and Tableau
By Cristian Arevalo
Last Updated November 14, 2023
Table of Contents
- Cyclistic Case Study Analysis 2020
- Table of Contents
- About This Project
- Introduction
- Defining The Problem
- About Data Sources
- Prep Work
- Step 1: Collect Data
- Step 2: Wrangle Data and Combine into a Single Table
- Step 3: Clean Up and Add Data to Prepare for Analysis
- Step 4: Perform Analysis
- Data Summary
- Data Story: Cyclistic Bikeshare Analysis of 2020
- Recommendations
- Final Thoughts
About This Project
This case study is my capstone project for the Google Data Analatics Professional Certificated (via Coursera).
Here, I assume the role of a junior data analyst working in a marketing analyst team at Cylistic, a fictional bike-share company stationed in Chicago based on Divvy bikes. The objective is to understand how casual riders and annual members use Cyclistic bikes differently by identifying trends and more. These insights will help the marketing team to develop new marketting strategies to convert casual riders into annual members. But first, Cyclistic executives must be compelled with data insights and professional data visualizations.
The main tools I use are spreadsheets/excel, SQL and Tableau. Here are the highlights:
A more in-depth breakdown of the case study scenario is included below, followed by my full report.
The analysis is based on the Divvy case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here)
Additionally, all data cleaning and analysis is accomplish through the use of SQL in Google CLoud BigQuery (Sandbox Edition). Python is used to upload large datasets (over 10MBs), the rest are manually uploaded through the site from csv files. Spreadsheets (Excel) is used to inspect data closely after being broken down through SQL queries. Finally, Tableau is used to provided compelling visuals.
Introduction
Cyclistic is a bike-share company based in Chicago, USA. In 2016, Cyclistic Launched a successful bike-share program offering bike rentals throughout the city. Since then, the program has expanded to a fleet over 5,800 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Based on Divvy Bikes Stationed in Chicago
Customers are divided into two classes: casual riders (those who purchase either the single-ride or full-day passes) and Cyclistic members (who purchase annual memberships).
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. The flexible pricing plans (single-ride passes, full-day passes, and annual memberships) aided in bringing in new customers.
Although pricing flexibility helps in attracking new customers, Cyclistic’s finance analyst have concluded that annual members are much more profitable than casual riders. The director of marketting (Lily Moreno) believes that increasing the number of annual members will be the key to future growth. Since casual riders have chosen Cyclistic for their mobility needs and are already aware of the annual membership program, a marketing strategy aimed at converting casual riders into members is to be held.
Defining The Problem
Lily, the director of marketing, and the marketing analytics team are looking for ways to convert Cyclistic’s casual riders into annual members. The main problem is designing marketing strategies to accomplishing this task by identifying and understading the difference between casual riders & member riders in their usage of Cyclistic Bikeshare program. There are three questions that will guide the marketing team in their future marketing program.
The marketing team would like to know:
-
How do annual members and casual riders use Cyclistic bikes differently?
-
Why would casual riders buy Cyclistic annual memberships?
-
How can Cyclistic use digital media to influence casual riders to become members?
Specifically, my focus will revolve around on how do annual members and casual riders use Cyclistic bikes differently.
By looking at the data, we will be able to first get a broad sense of certain patterns that are occurring in the two different groups. Understanding the differences will provide more accurate customer profiles for each group. These insights will help the marketing analytics team design high quality targeted marketing for converting casual riders into members. For the Cyclistic executive team, these insights will help Cyclistic maximize the number of annual members and will fuel future growth for the company.
Business Task
Analyze historical bike trip data to identify trends in how annual members and casual riders use Cyclistic bikes differently.
About Data Sources
-
Historical trip data is publicly available here (Note: The datasets have different names because Cyclistic is a fictional company).
-
The data has been made available by Motivate International Inc. under this liscense.
- Data is reliable, original, comprehensive, and cited. Since only trips occuring in 2020 are studied, the data is not current. It mostly ROCCCs! However, it is not adviseable to make full business decisions based on this data.
- For business decisions, it is recommended to use current data (from 2022/23).
- The data collected contains ride ids, rideable type, start/end timestamps, station names & ids, latitude & longitude, and usertype. Overall, 13 parameters.
Prep Work
The data is downloaded, skimmed, prepared, cleaned, and manipulated. This section deals with all the dirty work prior to the actual analysis and visualization. The prep-work lays the foundation of our data problem solving, and takes more than twice the time and effort as the analysis.
If you are more interested with the analysis, and results, you can jump ahead to Data Summary or if you want a brief peak at how the analysis is performed please check out Performing Analysis.
Step 1: Collect Data
-
Download Divvy datasets containing all trip data occuring in 2020 (Jan-Dec).
-
Uploaded Divvy datasets (csv files) individuall through browser in Google Cloud BigQuery (Sandbox).
-
Files that were too large were uploaded via this python script (shown here).
Step 2: Wrangle Data and Combine into a Single Table
Preliminary Inspection
Once data is uploaded, it is important to compare schemas (e.g. column names and data type) for each of the tables and inspect the tables through preview to look for incongruencies. Here are the results:
-
Though column names matched, column types differed accross tables.
- The reason being that I did not format the dataframe in python prior to uploading (useful for the future).
-
From Jan 2020 to Nov 2020, station ids were purely numeric. On Dec 2020, alphanumeric station ids were added; however, on several occasions their previous numeric ids were also used.
-
It is necessary to change data type for station ids from INT64 to STRING.
-
There is also a need for new and unique ids for the stations.
-
Fixing Column Data Type
The following query is ran for multiple tables to correct column types.
CREATE OR REPLACE TABLE `project.dataset.table` AS (
SELECT * REPLACE (
CAST(started_at AS TIMESTAMP) AS started_at,
CAST(ended_at AS TIMESTAMP) AS ended_at,
CAST(start_station_id AS STRING) AS start_station_id,
CAST(end_station_id AS STRING) AS end_station_id
)
FROM `project.dataset.table`);
Combine Trip Data
The bike trip data is divided into 10 tables, however we interested in looking at all rides occuring during 2020.
The following query is used to create a new table, called divvy_trips_2020 for reference, containing all trip data occuring in 2020 is created.
- Additionally, this query makes use of naming conventions (specifically, similar names) to combine all data from 2020 into a single table.
CREATE TABLE IF NOT EXISTS `case-study1-bike-share.divvy_trips_2020_analysis.divvy_trips_2020`
SELECT * FROM `case-study1-bike-share.divvy_trips_2020_data.divvy_trips_2020_*`;
Inspecting Combined Trip Data
Briefly skiming the new table shows that:
-
Column names and type look great (they match).
-
Station names and ids need to be checked for duplicates (to remove them).
-
There is no trip duration among the columns to compare casual riders with members (there is a need to create the field and scrutinized it for abnormal ride lengths).
-
Check for distinct values across columns (all good here: 2 usertypes, 3 ridable types; station information needs to be verified seperatly, latitude and longitude values are not unique per station).
There is still a need to check for nulls or missing values.
Step 3: Clean Up and Add Data to Prepare for Analysis
Key Problems and Solutions
The first part of tidying data is look for issues/concerns regarding the data. Here are some of them:
- Some station names can have more than one id.
-
Create new unique station ids.
-
Prior to December ids were unique intergers, afterward alphanumeric ids were added but old ids were still being used.
-
The result, the station id is sometimes in accordance with the previous data, at times missing, and at others a combination of alphanumeric characters.
-
-
Some station names have duplicates that end with a word or symbol in parathesis.
-
Establish a single unique name per station.
-
There are two sources of duplicates: those that end with “(*)” and those that end with “(Temp)”.
-
- There are some station names that corresponds to quality checks or other.
-
Filter these stations when interest on insights only about customers
-
Relevant NEW IDS: 310, 311, 312, 631, 455, and 45
- HQ QR (id=310)
- Hubbard st bike checking (Lbs-wh-test) (id=311)
- Hubbard_test_lws (id=312)
- Watson Testing-divvy (id=633)
- Base-2132 W Hubbard Warehouse (id=45)
- Mt1-Eco5.1-01 (id=455)
-
- The data can only be aggregated at the ride-level, which is too granular.
- Add additional columns of data, such as the weekday & month when trips begin. This provide additional opportunities to aggregate the data.
- There is no field measuring trip duration.
-
Create a new column indicating ride_length and Scrutinize ride duration.
-
This presents more opportunities to compare user behavior between casual and member riders.
-
- Additionally, there are some rides were trip durations are negative.
-
Remove bad data.
-
This includes several hundred rides where Divvy took bikes out of circulation for Quality Control reasons.
-
This may also correspond to early cancellation times of rides by users.
-
- Some crucial data is missing in columns with Null values.
- These instances are represented by Null values and it is important to be aware of them for data cleaning.
Checking For Missing Data
First order of business is finding out if anything important is missing from the data, so we check which columns contain NULLS using the following query.
- The source for the following query comes from stackoverflow
SELECT column_name, COUNT(1) AS nulls_count
FROM `case-study1-bike-share.divvy_trips_2020_analysis.divvy_trips_2020` as table1,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(table1), r'"(\w+)":null')) column_name
GROUP BY column_name
ORDER BY nulls_count DESC
Results
After checking for nulls or missing data, we discover that some important information is missing from the trips: start/end station names & ids.
- Without this information we can’t track the trips, which means we need to remove these instances. Before proceeding to do delete rows, we need to identify if there is anything else we need look out for when cleaning the data.
The table below showcases which columns have missing information and how many rows are missing.
column_name | nulls_count |
---|---|
end_station_id | 111342 |
end_station_name | 110881 |
start_station_id | 95282 |
start_station_name | 94656 |
end_lat | 4255 |
end_lng | 4255 |
Checking For Duplicate Data
A common problem often encounter while cleaning data is duplicate names, which we’ll tackle in this section. Specifically, we will be looking for duplicate station names or multiple ids for one station using the following query.
WITH all_stations AS (
SELECT start_station_name, start_station_id, end_station_name, end_station_id
FROM `case-study1-bike-share.divvy_trips_2020_analysis.divvy_trips_2020`
)
SELECT DISTINCT station_name, station_id
FROM ( (SELECT DISTINCT start_station_name AS station_name, start_station_id AS station_id FROM all_stations)
UNION ALL (SELECT DISTINCT end_station_name AS station_name, end_station_id AS station_id FROM all_stations)
)
GROUP BY station_name, station_id
ORDER BY station_name
Results
After querying a list of stations with their respective ids, the outcome is downloaded as a csv file to view in EXCEL (unfortunately, some station names were cut off when viewing the table in BigQuery).
-
A quick and top-close inspection confirms the source of duplicate data: “(*)” and “(Temp)”.
-
Additionally, multiple ids are often found for the same station.
Below are two primary examples showcasing duplicates by either having the different ids or names.
station_name | station_id |
---|---|
Damen Ave & Walnut (Lake) St | 656 |
Damen Ave & Walnut (Lake) St | KA17018054 |
Damen Ave & Walnut (Lake) St (*) | 656 |
Wentworth Ave & Cermak Rd | 120 |
Wentworth Ave & Cermak Rd | 13075 |
Wentworth Ave & Cermak Rd (Temp) | 120 |
Creating Table About Stations
Due to the present duplicate data occuring in start & end stations, there is a need to fix the problem to not have excess information and be concise. A way to address these stations is by creating a new table as a list of station information. The desired parameters will be station_name, station_id, lat, lng, and later down the line we will add the number of users per member_casual as member_riders and casual_riders (to display overall station activity for marketing purposes).
A summary of the steps taken to create a new table showcasing the station info is presented below (meanwhile, the full query will be available here):
-
The new table will be called divvy_stations_2020.
-
We begin by identifying unique station names from start & end station names, fixing duplicates to only present unique station names.
-
Then, we create unique station ids that correspond to only one station.
-
Finally, we average the latitude and longitude values of all ride trips corresponding to the stations to determine the lat and lng geo-location of each station.
-
Now that we have the most basic station information, we need consider the trip data from 2020 where we notice that the geo-location can be different for the same station.
-
Latitude / Longitude values for the stations are included in the same row as each ride’s information. Since each bike has its own GPS device, there is slight variance in the lat/long values of every station per ride. However, each station can only have one unique geo-location, so we take mean value for all respective lat/long values of a station.
-
These values are then used to replaced the start and end station lat/long values for each ride.
-
Results
The full table is available here: cleaned and uncleaned. Below are the first 5 results.
station_name | station_id | lat | lng |
---|---|---|---|
2112 W Peterson Ave | 1 | 41.991161729691882 | -87.683592172455633 |
63rd St Beach | 2 | 41.781052497804261 | -87.57612809187053 |
900 W Harrison St | 3 | 41.8747676018918 | -87.649804773961 |
Aberdeen St & Jackson Blvd | 4 | 41.877720779404484 | -87.654792138860145 |
Aberdeen St & Monroe St | 5 | 41.880414761538468 | -87.655516799346827 |
Aberdeen St & Randolph St | 6 | 41.884114663376479 | -87.654270784691533 |
Cleaning Operations
Now that we have established a solid foundation of information about each station, we will use this data (station_name, station_id, lat, lng) to replace the trips corresponding information about start & end stations. The full query is available here and the verification of the results is available here, however below I will discussion a brief overview of the steps taken.
-
The goal of this new query is to create a new version of divvy_trip_data (v2), where certain cases are excluded, to make analysis of the trip data easier.
- Rides are removed when:
- Station names are missing (meaning they are NULL) from start or end station names because these rides cannot be tracked.
- The new station ids will be assigned based on their station names.
- Trip durations that are less than 60 seconds because these ride lengths are similar to a false start.
- Moreover, the first minute of all rides are charged a fix fee (even if rides are less than a minute).
- Trip durations are greater than 24 hours because Cyclistic bikes are expected to be returned to an appropriate station once they are no longer being used.
- Keeping a bike longer than a day is synonymous to stealing it and such cases result in an additional fee.
- Station names are missing (meaning they are NULL) from start or end station names because these rides cannot be tracked.
- Station names are altered to prevent duplicate names.
-
Duplicates end with a word or symbol in parathesis.
-
This makes implementing new station ids smoother.
-
- Using divvy_stations_2020, station information is updated for both start & end stations.
-
Specifically, the following parameters are updated (8 parameters total):
- Station name
- Station id
- Starting latitude
- Startting longitude
-
- Additional data is added to aggregate trip info more easily during analysis
-
The following fields of data are implemented:
- Starting month number (1-12) as month_num
- Starting month name as starting_month
- Starting day of the week (number) as weekday_num [1-7]
- Starting day of the week as weekday [Sun-Sat]
- Trip duration as ride_length measured in seconds
-
Results
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casuaL | month_num | starting_month | weekday_num | weekday | ride_length |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99C72397983C1DF5 | docked_bike | 2020-08-10 23:49:04.000000 UTC | 2020-08-11 12:49:16.000000 UTC | Lake Park Ave & 35th St | 351 | Lake Park Ave & 35th St | 351 | 41.831275652901688 | -87.6087894057451 | 41.831275652901688 | -87.6087894057451 | casual | 8 | Aug | 2 | Mon | 46812 |
08306A0B4F8B9C44 | docked_bike | 2020-08-10 23:49:04.000000 UTC | 2020-08-11 00:56:14.000000 UTC | Lake Park Ave & 35th St | 351 | Lake Park Ave & 35th St | 351 | 41.831275652901688 | -87.6087894057451 | 41.831275652901688 | -87.6087894057451 | casual | 8 | Aug | 2 | Mon | 4030 |
7A3E9EC138E2D0AD | docked_bike | 2020-08-10 23:49:17.000000 UTC | 2020-08-11 00:18:37.000000 UTC | Stockton Dr & Wrightwood Ave | 586 | Kedzie Ave & Milwaukee Ave | 333 | 41.931312260664328 | -87.638733542147918 | 41.929576232244784 | -87.707870112443771 | casual | 8 | Aug | 2 | Mon | 1760 |
5858B7C1FC4A89A2 | docked_bike | 2020-08-10 23:49:28.000000 UTC | 2020-08-11 00:57:04.000000 UTC | Lake Park Ave & 35th St | 351 | Lake Park Ave & 35th St | 351 | 41.831275652901688 | -87.6087894057451 | 41.831275652901688 | -87.6087894057451 | casual | 8 | Aug | 2 | Mon | 4056 |
CDF2EA7B3F5ED5FF | docked_bike | 2020-08-10 23:49:48.000000 UTC | 2020-08-11 00:39:13.000000 UTC | Halsted St & 59th St | 286 | Halsted St & 59th St | 286 | 41.787533838414639 | -87.644880773373984 | 41.787533838414639 | -87.644880773373984 | casual | 8 | Aug | 2 | Mon | 2965 |
Verified
Number of quality check uses in cleaned trip data
member_casual | quality_check_users |
---|---|
casual | 71 |
member | 197 |
From the following stations:
station_name | station_id |
---|---|
Base - 2132 W Hubbard Warehouse | 45 |
HQ QR | 310 |
Hubbard St Bike Checking (Lbs-Wh-Test) | 311 |
Hubbard_Test_Lws | 312 |
Mtl-Eco5.1-01 | 455 |
Watson Testing - Divvy | 633 |
Manipulating Data
After the cleaning operation, our data is nearly ready for analysis. But before that, we manipulate it to ease our analysis.
- The total number of members and casual riders visiting any station holds potential in identifying key differences among Cyclistic users. We can figure out the total rides from starting and ending stations by grouping data and counting the rows like this query.
Results: Top 5 Starting Stations
station_name | station_id | lat | lng | total_rides_from | total_rides_to | members_from_st | casuals_from_st | members_to_st | casuals_to_st |
---|---|---|---|---|---|---|---|---|---|
Clark St & Elm St | 135 | 41.902963895113132 | -87.631312305029311 | 31012 | 31180 | 19886 | 11126 | 20565 | 10615 |
Kingsbury St & Kinzie St | 345 | 41.889192992430722 | -87.638501411531365 | 21882 | 21337 | 16260 | 5622 | 16454 | 4883 |
St. Clair St & Erie St | 564 | 41.894421578032492 | -87.622700546409931 | 21188 | 23853 | 15559 | 5629 | 17410 | 6443 |
Dearborn St & Erie St | 208 | 41.894004025551475 | -87.629318830077878 | 23559 | 24134 | 15453 | 8106 | 16039 | 8095 |
Broadway & Barry Ave | 56 | 41.937610986894825 | -87.644098303068489 | 23167 | 24057 | 15372 | 7795 | 15787 | 8270 |
Step 4: Perform Analysis
Now that the preliminary work is done we can begin conducting a detailed analysis on the data. But first we need to recall the goal: to discover how do annual members and casual riders use Cyclistic bikes differently. The full query used for performing analysis is provided here.
Specifically, we are interested in analyzing user behavior in the following ways:
-
Weekly
-
Monthly
-
Time of day
-
Type of rideable
-
And geographically
Cyclistic Weekly Analysis
Riders behave very different based on weekly trips. During the weekly analysis, we scrutinize the following parameters:
-
Usertype (member / casual)
-
Weekday (number [1-7] & day [Sun-Sat])
-
Average ride length (in secs & time duration)
-
Max ride length
-
Minimum ride length
-
Number of rides
-
And part of total rides
Resulting Table
This table is available for download here.
member_casual | weekday_num | weekday | avg_ride_len_secs | avg_ride_len | max_ride_len | min_ride_len | num_rides | part_of_tot_rides |
---|---|---|---|---|---|---|---|---|
casual | 1 | Sun | 2520 | 00:42:00 | 23:55:54 | 00:01:01 | 242713 | 0.072880308537139046 |
casual | 2 | Mon | 2239 | 00:37:19 | 23:59:49 | 00:01:01 | 131039 | 0.039347553490740761 |
casual | 3 | Tue | 2101 | 00:35:01 | 23:54:26 | 00:01:01 | 127485 | 0.038280381083243054 |
casual | 4 | Wed | 2061 | 00:34:21 | 23:57:59 | 00:01:01 | 141395 | 0.04245718698878418 |
casual | 5 | Thu | 2073 | 00:34:33 | 23:48:04 | 00:01:01 | 150913 | 0.045315191202223465 |
casual | 6 | Fri | 2183 | 00:36:23 | 23:58:33 | 00:01:01 | 188455 | 0.056588063043044826 |
casual | 7 | Sat | 2427 | 00:40:27 | 23:59:54 | 00:01:01 | 296754 | 0.089107394658012387 |
member | 1 | Sun | 1044 | 00:17:24 | 23:29:04 | 00:01:01 | 254494 | 0.076417831928453209 |
member | 2 | Mon | 857 | 00:14:17 | 23:17:37 | 00:01:01 | 276207 | 0.082937672807462157 |
member | 3 | Tue | 855 | 00:14:15 | 23:56:39 | 00:01:01 | 298331 | 0.089580926139898681 |
member | 4 | Wed | 861 | 00:14:21 | 23:39:52 | 00:01:01 | 311855 | 0.093641826432245057 |
member | 5 | Thu | 870 | 00:14:30 | 23:46:15 | 00:01:01 | 310516 | 0.093239760069375219 |
member | 6 | Fri | 901 | 00:15:01 | 23:52:08 | 00:01:01 | 307400 | 0.092304107502756513 |
member | 7 | Sat | 1036 | 00:17:16 | 23:59:43 | 00:01:01 | 292739 | 0.087901796116621467 |
Cyclistic Monthly Analysis
Riders also behave very differently based on monthly trips. During the monthly analysis, we scrutinize the following parameters:
-
Usertype (member / casual)
-
Month (number [1-12] & day [Jan-Dec])
-
Average ride length (in secs & time duration)
-
Max ride length
-
Minimum ride length
-
Number of rides
-
And part of total rides
Resulting Table
This table is available for download here.
member_casual | month_num | starting_month | avg_ride_len_secs | avg_ride_len | max_ride_len | min_ride_len | num_rides | part_of_tot_rides |
---|---|---|---|---|---|---|---|---|
casual | 1 | Jan | 2273 | 00:37:53 | 23:48:04 | 00:01:05 | 7678 | 0.002305500772303723 |
casual | 2 | Feb | 2364 | 00:39:24 | 23:55:55 | 00:01:06 | 12206 | 0.0036651396752721077 |
casual | 3 | Mar | 2474 | 00:41:14 | 23:53:10 | 00:01:01 | 24418 | 0.0073320809922000924 |
casual | 4 | Apr | 2372 | 00:39:32 | 23:58:33 | 00:01:01 | 23331 | 0.0070056835788770725 |
casual | 5 | May | 2678 | 00:44:38 | 23:53:20 | 00:01:01 | 85914 | 0.0257977068705004 |
casual | 6 | Jun | 2554 | 00:42:34 | 23:55:42 | 00:01:01 | 152794 | 0.045880005861340854 |
casual | 7 | Jul | 2686 | 00:44:46 | 23:58:54 | 00:01:01 | 265559 | 0.079740359415499409 |
casual | 8 | Aug | 2273 | 00:37:53 | 23:59:49 | 00:01:01 | 277794 | 0.083414207025441583 |
casual | 9 | Sep | 1961 | 00:32:41 | 23:59:33 | 00:01:01 | 212041 | 0.063670316392296658 |
casual | 10 | Oct | 1692 | 00:28:12 | 23:59:54 | 00:01:01 | 120624 | 0.03622020384974789 |
casual | 11 | Nov | 1763 | 00:29:23 | 23:55:03 | 00:01:01 | 72106 | 0.021651528873109178 |
casual | 12 | Dec | 1436 | 00:23:56 | 23:41:27 | 00:01:01 | 24289 | 0.0072933456965987411 |
member | 1 | Jan | 652 | 00:10:52 | 23:20:02 | 00:01:01 | 134865 | 0.040496400319971558 |
member | 2 | Feb | 666 | 00:11:06 | 23:52:08 | 00:01:01 | 125383 | 0.037649205956467535 |
member | 3 | Mar | 784 | 00:13:04 | 23:41:42 | 00:01:01 | 114351 | 0.034336587498528658 |
member | 4 | Apr | 1094 | 00:18:14 | 23:56:39 | 00:01:01 | 60215 | 0.018080975384770603 |
member | 5 | May | 1172 | 00:19:32 | 23:29:04 | 00:01:01 | 111415 | 0.03345498418158626 |
member | 6 | Jun | 1107 | 00:18:27 | 23:46:15 | 00:01:01 | 184961 | 0.055538907052105881 |
member | 7 | Jul | 1050 | 00:17:30 | 23:52:12 | 00:01:01 | 276113 | 0.082909447088186761 |
member | 8 | Aug | 980 | 00:16:20 | 22:05:05 | 00:01:01 | 318269 | 0.0955677813623774 |
member | 9 | Sep | 901 | 00:15:01 | 22:59:48 | 00:01:01 | 279042 | 0.083788948489863971 |
member | 10 | Oct | 830 | 00:13:50 | 23:59:43 | 00:01:01 | 211750 | 0.063582936771986634 |
member | 11 | Nov | 811 | 00:13:31 | 23:36:24 | 00:01:01 | 147124 | 0.044177454496537247 |
member | 12 | Dec | 748 | 00:12:28 | 22:49:28 | 00:01:01 | 88054 | 0.026440292394429805 |
Cyclistic Time of Day Analysis
Riders appear to follow notable trends based on the start of their trips. During the monthly analysis, we scrutinize the following parameters:
-
Usertype (member / casual)
-
Weekday (number [1-7] & day [Sun-Sat])
-
Time of day (in hours)
-
Number of rides
-
And part of total rides
Resulting Table
This table is available for download here.
member_casual | weekday_num | weekday | time_of_day | num_rides | part_of_tot_rides |
---|---|---|---|---|---|
casual | 1 | Sun | 0 | 5504 | 0.0016527059456576832 |
casual | 1 | Sun | 1 | 3812 | 0.0011446429987004157 |
casual | 1 | Sun | 2 | 2264 | 0.00067981945148419239 |
casual | 1 | Sun | 3 | 1176 | 0.00035312176455185967 |
casual | 1 | Sun | 4 | 714 | 0.00021439535704934336 |
casual | 1 | Sun | 5 | 591 | 0.00017746170310386825 |
casual | 1 | Sun | 6 | 1128 | 0.000338708631304845 |
casual | 1 | Sun | 7 | 1912 | 0.0005741231410060848 |
casual | 1 | Sun | 8 | 3546 | 0.0010647702186232094 |
casual | 1 | Sun | 9 | 6924 | 0.0020790944708818674 |
casual | 1 | Sun | 10 | 11461 | 0.0034414358363340676 |
casual | 1 | Sun | 11 | 17117 | 0.0051397833706072977 |
casual | 1 | Sun | 12 | 20244 | 0.0060787389469284412 |
casual | 1 | Sun | 13 | 23164 | 0.0069555378861218342 |
casual | 1 | Sun | 14 | 24292 | 0.0072942465174266795 |
casual | 1 | Sun | 15 | 24548 | 0.0073711165614107578 |
casual | 1 | Sun | 16 | 22629 | 0.0067948915051394831 |
casual | 1 | Sun | 17 | 19534 | 0.0058655446843163487 |
casual | 1 | Sun | 18 | 16201 | 0.0048647327444767676 |
casual | 1 | Sun | 19 | 12597 | 0.0037825466565134152 |
casual | 1 | Sun | 20 | 9089 | 0.0027291868350440922 |
casual | 1 | Sun | 21 | 5993 | 0.0017995397406116453 |
casual | 1 | Sun | 22 | 4722 | 0.001417891983175069 |
casual | 1 | Sun | 23 | 3551 | 0.0010662715866697735 |
casual | 2 | Mon | 0 | 2210 | 0.0006636046765813009 |
… | … | … | … | … | … |
Cyclistic Rideable Analysis
Riders have a higher tendency to use certain rideable types than others. This analysis aims to ascertain which rideables are more favored. During the analysis we look out for the following parameters:
-
Rideable Type (classic / docked / electric bikes)
-
Usertype (member / casual)
-
Number of rides
-
And part of total rides
Resulting Table
This table is available for download here
rideable_type | member_casual | num_rides | part_of_tot_rides |
---|---|---|---|
classic_bike | casual | 11158 | 0.0033504529327122875 |
classic_bike | member | 58496 | 0.017564805050361891 |
docked_bike | casual | 1125692 | 0.33801559981455104 |
docked_bike | member | 1784073 | 0.53571003898752545 |
electric_bike | casual | 141904 | 0.042610026255924395 |
electric_bike | member | 208973 | 0.062749076958924968 |
Cyclistic Geographic Analysis
Members and casual riders concentrate their trips on different destinations. Members spreadout throughout Chicago, meanwhile casual users are more focus near the pier. Moreover, when looking at the starting stations, riders follow a notable trend based on their start
Riders appear to follow notable trends based on the start of their trips. During the monthly analysis, we scrutinize the following parameters:
-
Station Name
-
Station Id
-
Latitude & longitude
-
Total Trips From Station
-
Trips By Annual Members
-
Trips By Casual Users
-
Fraction of Members From Daily Trips At Station (normalized to 1)
-
Fraction of Casual Riders From Daily Trips At Station (normalized to 1)
Resulting Table
This table is available for download here.
Data Summary
The data can be summarized as below:
-
Total number of stations is 683.
- Out 3.54 million rows of trip data, 3.3 million rides remain.
-
Trips without a start or end station name are removed because without station names we can’t track the trips. Additionally, this information is used to assign new station ids.
-
Trips less than a min are also removed because they can be considered a false start of sorts. Moreover, the price is set for all rides within a minute.
-
Trips more than a day are removed because all rideables must be returned after they are used. Rides longer than a day can be considered stolen and users are even charged an additional fee for such cases.
-
-
Out of the 3.54 million rides, 61.6% were taken by subscribers and 38.4% were taken by casual customers
- Average ride duration for a casual biker is 46 minutes long, meanwhile the average ride length for a member is 12 minutes long.
Table: Summary of Trips
summary_of | started_with | ended_with | eliminated | retained (%) |
---|---|---|---|---|
rides | 3,541,683 | 3,330,296 | 211,387 | 94.031 |
stations | 696 | 683 | 13 | 98.132 |
Table: Summary of Riders
member_casual | num_riders | percent | avg_ride |
---|---|---|---|
casual | 1,278,754 | 38.40 | 37:55 |
member | 2,051,542 | 61.60 | 15:15 |
Data Story: Cyclistic Bikeshare Analysis of 2020
For the full interactive Tableau Story please visit here!.
Below you will find key insights pertaining to each dashboard within the story, you are more than welcome to have a look.
Customer Usage
Seasonal trends played a major role in the number of rides Cyclistic experienced. During the year 2020, casual riders accounted for 38.4% of total rides while member riders made up a whopping 61.6% of total trips. Summer time brought a huge increase in the number of casual riders, in addition to overall ridership. Meanwhile, during winter time, by far the least popular season, there is a major drop-off in rides. This makes sense because bike riding is more suited for warm weather, however it important to plan out events, activities, or promotional content during Cyclistic’s most popular time: summer.
By taking advantage of the seasonal trend, and the fact that casual riders are already familiar with the Cyclistic program and using Cyclistic bikes, this can be a good opportunity to convert them to annual members.
It is also important to note user preference when it comes to which rideable popularity. Docked bikes are by far the most popular but electric bikes also receive their fair share of attention. Moving forward, focusing on these bikes will continue to yield more customers.
Monthly Analysis
Taking a deeper dive into how customers use Cyclistic bikes throughout the year, we consider the number of trips taken & their average ride length on a month-to-month basis.
Seasonal Trend
We confirm once more the favorite season for riders is summer. Moreover, a rising trend is seen during early spring, when the number of rides begin increasing around May. Peak usage occurs during August, then we see a falling trend up during fall, ending with the least favorite cycling period: winter.
Best Time For Marketting Campaign
Since we know that riders prefer biking during summer time and we see a large influx of trips taken during late Spring, this means a sufficient number of users are aware and have experience Cyclistics bikeshare services. Thus, early May is the optimal time to begin advertising.
Weekly Analysis
Number of Rides
When looking at the number of rides, casual riders show a dramatic predisposition to favor weekend for biking. Meanwhile, annual members are overall consitent in how many rides they take throughout the week; their highest usage day occurs midweek.
Ride Length
Saturday and Sunday are favored by both groups for longer rides, revealing a U-shape pattern when ride duration decreases during the week. Moreover, just like during the monthly analysis, casual riders have nearly double the ride time of members.
Insight
Despite their high inclination towards riding during week, casual riders still take plentiful bike trips during week. This shows promise when attempting to convert to members. Of course, usage time plays a role in their conversion rate. Shorter trips indicate a purposeful usage (of knowing where you are going), while longer trips can be associete with exploring or touring. But with so many riders, Cyclistic is bound gain notably more members by targetting those key individuals.
Targetting Group
Casual Riders with ride times whose average ride duration nears that of members (~15mins) and uses the Cyclistic bikeshare program consistently.
Potential Secondary Marketting Campaign
Due to casual riders being more prominent during the weekend, a potential way to increase the number Cyclistic users is by inviting even more casual riders. Specific bonus packages can be tailure for leasure riders, such a Museum pass, beach activities, coastal group rides, discounts using Cyclistic for certain activities, and more!
Daily Analysis
Knowing when riders take trips is important in determining Cyclistic use cases, so we take a deeper dive into how often riders take trips throughout the week, except we consider their experience on an hour-to-hour basis.
Popular Times
During the week, Cyclistic experiences peak usage times at 8AM & 5PM, hours associated with commuting to-and-from work. During the weekend, popular times occur from 11AM to 5PM suggesting that rides are taken for leasure.
Insights
Members most likely use Cyclistic bikes to comute, and while casual riders favor weekends, there are plenty of users who take trips during similar periods as members. These casual riders are already using Cyclistic services similar to members, and thus have a higher chance of being converted.
Geographical Analysis
Majority of trips are taken downtown Chicago, regardless of user type. However, stations with more activy from casual users are near the pier or spreadout across stations. Popular stations not near the pier and near the downtown area are dominated by annual members.
This means that near the pier, there are a lot of riders leasurely spending their time. Still, plenty of casual riders exist along the cost line and within stations with numerious trips (downtown) which holds promise in conversion.
Insights The prior two targetting groups still hold promise in increasing the number users. Moreover, if Cyclistic invents a leasure member group we can convert casual riders who enjoy weekend rides into members. This is of couse in addition to casual riders who share the use cases with current annual members (of 2020).
Conclusion
Recommendations
By focusing on seasonality, usage time, and usage case, we can formulate successful marketting strategies to convert casual riders to members.
Seasonality
In Spring the demand by casual riders increases, peaking during the summer. This is an opportune time to host a targeted advertising campaign and promotional activities.
Launch the marketing campaign in early Spring to draw-in incoming casual riders and run through the summer to convert them into annual members.
Schedule different promotions and offers to highlight the value of becoming an annual member instead of remaining a casual rider. Discounts for signing up now, raise awareness that summer will soon arrive and be ready.
Usage Time
Daily usage revealed that both groups peak during the evening (~5PM), however in the morning only members showed high activity. This could mean that most casual riders are not aware of the potential benefits of using Cyclistic rides for commuting early in the morning.
Use digital media to raise awareness about the benefits of using Cyclistic bikeshare for morning and evening comutes.
The goal is to get casual riders to spend more time on Cyclistic bikes, increasing the odds of upgrading their membership status.
Usage Case
Many casual riders share the pattern of enjoying leasure rides during the weekends. Proof is their drastic change in number of rides when compared to weekdays. So why not tap into that potential?
Create a new package deal for annual members, catoring to customers who want to spend leasure time riding Cyclistic Bikes.
Adding these new benefits, for members and incoming casual riders, specifically tailored to entertainment and joyful activites will improve conversion rate and member retention in the longer.
Use digital channels to market and promote these new benefits. Possibly hosting summer events to guide casual riders into how they can enjoy leasure rides with others.
Final Thoughts
From the Cyclistic case study, the analysis results showed that casual riders could be categorized into two groups: those that shared similarities with members and those that used Cyclistic bikes for leasure or entertainment.
The marketting campaign holds promise in converting casual users to annual members. Though not easy, with targetted advertising via email and social media, Cyclistic can improve the conversion rate by raising awareness of their benefits and tailuring new benefits to bring more riders
Thank you for taking the time to read this case study
Working on the capstone project was a good opportunity to learn about the real-world data analysis process. I learned a ton about data preparation and cleaning by actually working on those processes. Checking for data integrity and reliability is crucial before anything.
It’s been fun finding out how data analysis can inform business decisions and lead to effect actions. Until next time, have fun analyzing!