Google Data Analytics Capstone: Cyclistic Bike-Share Analysis using BigQuery SQL and Tableau
How Does a Bike-Share Navigate Speedy Success?
The Cyclistic Case Study is one of the capstone projects for completion of Google Data Analytics, a professional certificate program offered by Google and Coursera. I started the course last July 2022 and just finished recently. I started this course to further my career as a data analyst and enhance my programming and project management skills. Google recommended students to create a porfolio showcasing the techniques and skills we learned throughout the course. They provided case studies for students to analyze and include in the their portfolio.
In this case study, I am a junior data analyst working as part of the marketing analyst team at Cyclistic, a bike-share company in Chicago. I am tasked to gain data insights, identify trends, and provide professional data visualizations using historical data provided by the company on how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.
About the Company
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
- How doa annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use difital media to influence casual riders to become members?
Business Use Case:
The main objective is to design marketing strategies aimed at converting casual riders to annual members by understanding how they differ
- The most used ride type
- Aggregated 7-day usage behavior of each rider type
- Monthly, yearly and quarterly usage behavior of each rider type
- Most popular stations
- Bike ride length
- Difference between casual riders and annual members
- Lily Moreno, Marketing Director
- Cyclistic Executive Team
- Cyclistic Marketing Analytics Team
About the data
The data is owned and provided by Cyclistic (first-party) and is stored securely in an Amazon Web Server. The dataset list is being updated regularly and contains historical data ranging from as far as 2013 up to 2022. Data is originally in zip files and is kept in csv files separated by year and month. To better fit the business task and as the normal annual business calendar is from January to December, I will use the data with dates ranging from January 2021 to December 2021.
The data list is publicly available here.
After downloading and unzipping the files, the datasets are stored in my local drive. I renamed the files following a common naming convention. There are 12 CSV files in total.
Initial Data Dictionary
All the data files contain structured data, organized in rows(records) and columns(fields). The rows represents each trip and the columns are labeled according to the data each column represents. Each record includes the following fields:
- Data size: large
- # of fields: 13
Data Cleaning and Manipulation
For large datasets such as these, it is more appropriate and efficient to use a tool suited for handling large datasets. In this case, I used SQL instead of Google Sheets/MS Excel. I chose Google BigQuery for this. I can use Google Sheets or Excel but it will take a lot of time processing each dataset, more so on an aggregated dataset.
To start, I uploaded the 12 csv files from a bucket I created through Google Cloud Storage to the BigQuery project I made. I opted to make temporary tables through the WITH clause than to make a new permanent table and constantly update its values. Here are the steps I did:
- Merged the 12 csv files into one temporary dataset
- There is inconsistency with the station IDs with the station names so I did not include the station IDs in merging. It’s tedious to re-assign station IDs.
- Calculated ‘ride_length_minute’
- Calculated through TIMESTAMP_DIFF function. This function substracts timestamps and returns the unit of time indicated. In this case, I used MINUTE.
- I did this instead of the HH:MM:SS format to avoid unnecessary steps.
- Calculated the ‘day_of_week’ that each ride started
- Used EXTRACT to extract the the day of the week from ‘started_at’ date. This returns a number so I used CASE expression to assign values for each number. Example: 1 is ‘Sunday’
- I did this because DAYOFWEEK will not work without the EXTRACT funciton.
- Calculated the ‘month’ that each ride started
- Used EXTRACT to extract the the month from ‘started_at’ date. This returns a number so I used CASE expression to assign values for each number. Example: 1 is ‘January’
- I did this because MONTH will not work without the EXTRACT function.
- Cleaned station names with TRIM and REPLACE
- Removed trailing spaces and unnecessary characters
- Cleaned latitude and longitude with AVG and ROUND
- Originally, the latitude and longitude columns have inconsistent decimal points. The solution is to round up to 6 decimal points and average them.
- Joined appropriate tables created with WITH clause
- Joined the aggregated data to the station data. Aggregated data contains data cleaning from days of week and month while station data contains data cleaning from station information such as lat, long, and names.
- Save as a permanent table/dataset
- In BigQuery, you can save queries and tables into CSV files, JSON, Google Sheet or a BigQuery table. As the data is too large to save as CSV locally, I chose to save inside BigQuery as a new permanent table.
- Named the new table ‘cyclistic_trips_year_2021’
The full data cleaning query can be found here.
Data Analysis — Exploratory through BigQuery
Total Recorded Trips for Year 2021
How many trips were recorded for the year 2021?
From the above query, the total recorded trips from January to December of year 2021 is 5,595,063. That is a lot!
Total Trips (Member VS Casual)
What is the percentage of member and casual riders in terms of trips?
Of the 5,595,063 total trips, 54.8% were from annual members while 45.2% were from casual riders.
Average Ride Lengths (Member VS Casual)
How do these groups differ in terms of ride length?
Casual riders average ride takes around 18 minutes more than member riders. Why is it? Let’s find out.
Max Ride Lengths
Why is it that the difference between the average ride lengths of casual and member riders are so high?
As we can see from the query result, the longest ride duration for 1 casual rider was 55,944 minutes or 932 hours or 38 days. This high number significantly impacted the average ride length for casual riders, thus the high difference of average trip duration between the two rider types. On the other hand, the longest trip for annual riders was 1,559 minutes or about 26 hours.
I also tried to investigate further if there were more than one outliers. It turned out there were more than one outliers. Since this is the case, I calculated the median ride
Median Ride Lengths
A more accurate way of finding the difference between ride duration of casual and annual riders.
I queried the median ride duration ignoring the null values for a more accurate result. Casual riders takes 6 minutes more ride duration than annual riders do.
Busiest Day for Rides
What day of the week has the most rides?
The busiest day for member riders is on Wednesday while Saturday for casual riders.
Median Ride Length per Day
What is the median ride duration per day of each rider group?
It can be seen that the highest median ride length for both casual riders and members is on Saturday and Sunday. This could mean that both rider group have more time to ride or go for a trip during weekends than on weekdays. Casual riders also has longer ride duration compared to members.
Total Rides per Day
How many rides per day does each rider group take?
From the result, it can be seen that most trips occurred during weekends(Saturday and Sunday) for both members and casual riders. But if we look at it further, there are higher trips taken by members than the casual riders during weekends. This could mean members most often use the service during weekdays and casual riders most often use the service during weekends.
Most Popular Start Stations
What station is the most popular for each rider type?
Voila! It seems like the trips made by the riders start from different regions. We’ll look into this better through a visualization. From the top 10 start station for members and casual riders, there is only one station where both riders’ top 10 start station is the same. It is the Wells St & Concord Ln.
Most Used Bike Type
What is the most popular bike riders use?
There are three bike types riders may use. Riders may use either a classic bike, an electric bike or a docked bike.
The query showed that riders use classic bikes more than the two bike types. Next, I investigated the most popular bike type by members and casual riders.
The above remains true where classic bikes is the most used bike for each rider group. Another interesting thing here is that only one annual rider used a docked bike. Casual riders used docked bikes more than members.