Friday, February 10, 2023

Google Data Analytics Capstone: Cyclistic Case Study Using PostgreSQL pgAdmin4, Tableau & Anaconda

INTRODUCTION

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Cyclistic categorized the riders as Casual who purchased single-ride or full-day passes and Members who purchased annual memberships.

Cyclistic’s marketing strategy relied heavily on building general awareness and appealing to broad consumer segments. Marketing team created flexibility on pricing plans like single-ride passes, full-day passes, and annual memberships. But there is no effort for an upselling plan to convert the casual riders into members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.

Ms. Moreno, Director of Marketing Team, has a clear goal: Design marketing strategies aimed at converting casual riders into annual members. As a Junior Analyst in the Marketing Team, I was assigned to answer: How do annual members and casual riders use Cyclistic bikes differently?

BUSINESS TASK

Identify, analyze, and assess the bike usage trends and patterns of casual riders/members for the Cyclistic future marketing program.

DATA SOURCES

Internal data of Cyclistics was used in this case study under Motivate International Inc, under license. It is the original data from the source. A 12-month (November 2021 to October 2022) historical dataset was downloaded and stored in Google Drive and organized in a YYYYMM_CompanyName manner. To avoid sampling bias we will use the dataset that will represent the population and anonymization was used to implement data privacy.

Reliable — it is first-party data. It is owned by Cyclistic.

Original — Dataset is straight from the source. Although already anonymized the other data is still intact.

Comprehensive — There is some nulls but the importance of it can be checked in “Process” stage.

Current — Data is up to date and has the current information of the cyclers’ usage.

Cited — The City of Chicago owns all right, title, and interest in the Data. Thus, this is a reliable data source.

Since the finance analysts already concluded the annual memberships are much more profitable, anonymized data provided will help to focus on the main task which is to define the different usage of the bike renters.

CLEANING OR MANIPULATION OF DATA

Since the data exceeds the allowable complimentary BigQuery sandbox, I loaded the dataset in pgAdmin4 to use SQL. Tableau Public was used to create an interactive visualization or a dashboard and for an easier way to check the relations between columns, I used Anaconda.

STEP 1: Creating a table named bike.

Since all of our downloaded files have similar features, it will be logical to create only one table. Using this kind command, you can easily manipulate the features into correct data types.

CREATE TABLE IF NOT EXISTS public.bike
(
ride_id character varying COLLATE pg_catalog."default" NOT NULL,
rideable_type character varying COLLATE pg_catalog."default",
started_at timestamp without time zone,
ended_at timestamp without time zone,
start_station_name character varying COLLATE pg_catalog."default",
start_station_id character varying COLLATE pg_catalog."default",
end_station_name character varying COLLATE pg_catalog."default",
end_station_id character varying COLLATE pg_catalog."default",
start_lat numeric COLLATE pg_catalog."default",
start_lng numeric COLLATE pg_catalog."default",
end_lat numeric COLLATE pg_catalog."default",
end_lng numeric COLLATE pg_catalog."default",
member_casual character varying COLLATE pg_catalog."default",
CONSTRAINT ride_pkey PRIMARY KEY (ride_id)
)

Ensure to use the correct data type for future data aggregation/manipulation and Tableau visualization.

STEP 2: Copying and merging data.

We will not use any JOIN or UNION command because we created only one table. Stacking the 12-month in the ride table created.

COPY bike
--FROM '/Users/wilmalapuz/Downloads/202111_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202112_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202201_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202202_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202203_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202204_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202205_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202206_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202207_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202208_Cyclistic.csv'
--FROM '/Users/wilmalapuz/Downloads/202209_Cyclistic.csv'
FROM '/Users/wilmalapuz/Downloads/202210_Cyclistic.csv'
DELIMITER ','
CSV Header;

STEP 3: Data Familiarization.

Schema of the table created.

SELECT *
FROM information_schema.columns
WHERE table_name = 'bike';

Checking the total rows of the table

SELECT COUNT(*)
FROM bike

Getting the head of the table

SELECT *
FROM bike
LIMIT 10

We need to know how many distinct observation per column

SELECT 
COUNT(DISTINCT ride_id) AS rideid
,COUNT(DISTINCT rideable_type) AS ridetype
,COUNT(DISTINCT started_at) AS startat
,COUNT(DISTINCT ended_at) AS endedat
,COUNT(DISTINCT start_station_name) AS startstation
,COUNT(DISTINCT start_station_id) AS startid
,COUNT(DISTINCT end_station_name) AS endtstation
,COUNT(DISTINCT end_station_id) AS endid
,COUNT(DISTINCT start_lat) AS startlat
,COUNT(DISTINCT start_lng) AS startlng
,COUNT(DISTINCT end_lat) AS endlat
,COUNT(DISTINCT end_lng) AS endlng
,COUNT(DISTINCT member_casual) AS membercasual
FROM bike

Since the count and count distinct of ride_id is the same, we will not use any delete duplicate function for this dataset.

How many nulls per column

select 
sum(case when ride_id is null then 1 else 0 end) rideid
,sum(case when rideable_type is null then 1 else 0 end) ridetype
,sum(case when started_at is null then 1 else 0 end) startat
,sum(case when ended_at is null then 1 else 0 end) endat
,sum(case when start_station_name is null then 1 else 0 end) startstation
,sum(case when start_station_id is null then 1 else 0 end) startid
,sum(case when end_station_name is null then 1 else 0 end) endtstation
,sum(case when end_station_id is null then 1 else 0 end) endid
,sum(case when start_lat is null then 1 else 0 end) startlat
,sum(case when start_lng is null then 1 else 0 end) startlng
,sum(case when end_lat is null then 1 else 0 end) endlat
,sum(case when end_lng is null then 1 else 0 end) endlng
,sum(case when member_casual is null then 1 else 0 end) membercasual
from bike

We will go on to the next dataset feature which is the rideable_type.

SELECT rideable_type, COUNT(rideable_type)
FROM bike
GROUP BY rideable_type
ORDER BY COUNT(rideable_type) ASC

There are 3 classes of types in this data; docked, classic and electric.

SELECT started_at, ended_at
FROM bike
ORDER BY started_at ASC

STEP 4: Adding/Dropping columns for data analysis.

Going back to our business question, which is to analyze the bike usage of members and casual riders, we will now list all the underlying queries that we can formulate to answer the business task.

-What is the popular start time of the cyclers?

-What is the average cycling time?

-What is the busiest and slowest day/month?

-What is the average distance per cycle?

-Where is the most active station/dormant station?

By those questions above, we will need to drop the end_station_names and station_id column which consists of almost 1 million NULLS per column. We will add columns like starttime (extracted from started_at), cycletime (elapsed time between ended_at and started_at), month/day/year (convert the timestamp to their designated month/day/year), and distance traveled (using point command).

SELECT ride_id, rideable_type, started_at, ended_at, start_station_name AS station_name, start_lat, start_lng, end_lat, end_lng, member_casual
,started_at::time AS starttime
,extract
(
epoch from ((ended_at - started_at)/60)::interval
)
as cycletime
,TO_CHAR(started_at, 'fmDay') AS "day"
,TO_CHAR(started_at, 'fmMonth') AS "month"
,TO_CHAR(started_at, 'YYYY') AS "year"
,((point(start_lng, start_lat) <@> point(end_lng, end_lat)) * 1.609)as distance
FROM bike
WHERE start_lat != end_lat AND start_lng != end_lng

We will not stop there, we want to clean it further by using the ORDER BY command on our newly added column. We found out that there are negative cycle times and less than 5 meters cycling distance in some recorded rides. We will exclude those using the command below:

SELECT *
from (SELECT ride_id, rideable_type, started_at, ended_at, start_station_name AS station_name, start_lat, start_lng, end_lat, end_lng, member_casual
,started_at::time AS starttime
,extract
(
epoch from ((ended_at - started_at)/60)::interval
)
as cycletime
,TO_CHAR(started_at, 'fmDay') AS "day"
,TO_CHAR(started_at, 'fmMonth') AS "month"
,TO_CHAR(started_at, 'YYYY') AS "year"
,((point(start_lng, start_lat) <@> point(end_lng, end_lat)) * 1.609)as distance
FROM bike
WHERE start_lat != end_lat AND start_lng != end_lng) as Q
WHERE Q.cycletime >2 AND Q.distance > 0.005

With that, we only now have 5,091,767 rows for us to upload in Tableau.

ANALYSIS SUMMARY

We will use Tableau to answer our questions that will help us to derive our recommendations.


Google Capstone Case Study by Wilma Lapuz
Google Capstone Case Study by Wilma Lapuz


What is the median Travel Time?

Casual Riders strolled more than the Members
Casual Riders strolled more than the Members

What is the median distance per bike rented?

What is the median distance per bike rented?
Casual Riders had covers more distance than the Members

What is the busiest and slowest day/month?

What is the busiest and slowest day/month?
Casual Riders: Busiest July | Slowest January | Busiest Saturday | Slowest Tuesday WHILE Member Busiest August | Slowest January | Busiest Thursday | Slowest Sunday

What is the popular start time of the cyclers?


What is the popular start time of the cyclers?
Casual - Saturdays around 1500Hrs
Members - Tuesdays by 1700Hrs


Where is the most active station/dormant station?

Where is the most active station/dormant station?
Most Active for Casual Riders: Streeter Dr & Grand Ave


Streeter Dr & Grand Ave is located near the most popular tourist attractions/spots in Chicago, The Navy Pier.
Most Active for Members: Wells St & Concord Ln

Wells St & Concord Ln station is surrounded by communities, schools and a church.

Is there a trend or patterns in renters usage?

Google Data Analytics Capstone: Cyclistic Case Study
Winter

Google Data Analytics Capstone: Cyclistic Case Study
Spring

Google Data Analytics Capstone: Cyclistic Case Study
Summer

Google Data Analytics Capstone: Cyclistic Case Study
Autumn


The Spring Season figure shows that there is a positive upward trend of renters at the end of season. Then there is a steady trend of bike rentals n the Summer season and negative down slope when the winter is nearing.

Classic bikes are opted to be rented in the Summer and Winter season too while electric bikes are the popular choice for Spring and Autumn season.

Is there a correlation?

To prove if there is correlation between number of rented bikes and season, I downloaded the average temperature history and loaded it in Anaconda.

import pandas as pd
tempcalc = pd.read_csv('temp_calc.csv')
tempcalc.corr()

Google Data Analytics Capstone: Cyclistic Case Study Correlation
Correlation


A correlation coefficient of is too far from 1, meaning there is no linear relationship.

KEY FINDINGS

  • Members rented 60% of the total rides.
  • 53% of Casual Riders prefer electric bikes while 54% of Members chose classic bikes.
  • Casual Riders travel longer and farther than the Member.
  • The usual starting station of Members are from the community area while Casual Riders are from Chicago’s Bay Area.
  • Season affects the bike rentals.


Google Data Analytics Capstone: Cyclistic Case Study Correlation

Google Data Analytics Capstone: Cyclistic Case Study Correlation


RECOMMENDATIONS

Since demographics of bike users were anonymized, the recommendations will only evolve in the three variables: kind of bike rented, stations and time.

Create an Exclusive Membership Reward Points System:

  • Friday, Saturday, and Sunday by 3PM is the peak period for casual riders, thus an exclusive membership reward program will offer their leisure rides extra reward points while using bikes on the said peak periods from those popular stations.
  • Casual riders also be offered a discounted rate per minute for the electric bike for the first three months.
  • For members, they will receive reward points for every rider that will purchase an annual membership using their invite code.

Voucher Codes

  • Casual riders may increase their annual membership discount based on their past bike rentals to and from the popular stations.
  • Voucher codes for annual membership can be offered at the start of Spring Season in preparation for the upcoming peak rental season.

Exclusive for Health Enthusiast Members

  • Since casual riders were fond of renting an electric bike, Cyclistic may promote the use of classic bike and entice them by challenge badges that can be shared in social media.
  • Casual riders may have the option to save their cycled routes and to be featured on next popular/new discovered exploration routes in Cyclistic website.
  • Promote an extended free cycling time for members and accompanying health benefits if they cycle on weekends’ off-peak times from 5AM — 9AM.

FOR FURTHER EXPLORATION

  • Check the age brackets and financial capacity of casual riders and then create a focus group to discuss their takes on current membership strategy of Cyclistic.

REFERENCES

Wednesday, February 8, 2023

SPARTA’s SPCapstone001: Data Analyst Capstone Course: CONCEPT PAPER for Covid-19 Business Impact

 TRUSTING IN DIGITAL PIVOT FOR BUSINESS RECOVERY

INTRODUCTION

Background

Doctors are more concerned with COVID-19. Unlike Spanish Flu, there are asymptomatic carriers of the disease. That one detail makes it harder to mitigate the spread of the virus by simply taking temperatures, (Amenabar, 2020). On that note, it is hard to foretell when we can live without the fear of exposing ourselves to COVID-19 virus and its variants. This concept paper builds upon insights from various firms who commenced surveys in the businesses and to strategically focus on the economic problem caused by this pandemic.

Need for this Study

Time is a luxury that all country leaders don’t have. Making the right decisions not only to health factors but also the economic risks will be vital in the recovery of the Philippines and its people. The concept will identify the relationship between demand and supply factor in job security, effective rollout of the vaccine and digital solutions across businesses.

Problem Statement

Does pivoting digitally alone can help to recover business during pandemic?

Objective

The research paper seeks to verify the relationship of economic shock and consumer’s behavior in spending. The second objective is to determine the influence of upskilling and digitalization of firms in gaining competitiveness in this time of pandemic.

LITERATURE REVIEW

COVID-19 and Its Unprecedented Crisis

From 1918’s Spanish Flu pandemic, the American Public Health Association (APHA) stated that mixing of bodies and sharing of breath in crowded rooms, was dangerous (Billings, 1997). The lockdown aims to reduce the transmission of the virus by limiting contact between people. Thus, officials all over the world strictly implement different kinds of confinement and mitigation measures like social distancing. However, this measure implicated a profound impact not only socially but also created an unprecedented economic impact.

GDP plunged to negative 9.5 which is the lowest since 1980 and unemployment rate doubled from 5.1 of 2019 to 10.4%.


Figure 1 — Source: International Monetray Fund, 2021
Figure 1 — Source: International Monetray Fund, 2021

The main contributors to the decline were: Construction: -24.2 percent
Other Services: -38.0 percent
Real Estate and Ownership of Dwellings: -13.2 percent


Figure 2 — Source: Philippine Statistics Authority, 2021
Figure 2 — Source: Philippine Statistics Authority, 2021


Economic Shock of Lockdown

Economy works when money moves and during lockdown people stopped moving and so is the money. Philippine Government borrows, allotted and spend trillions of pesos to mitigate the supply- demand shocks of COVID-19. A supply shock is when public health authorities and employers prevent service workers from doing their jobs and a demand shock, on the other hand, is something that reduces consumers’ ability or willingness to purchases goods and services at given prices (Brinca, Duarte, & e Castro, 2020).

More than Php200-billion were allotted by the government to cover, sustain and fund the declines as seen above in figure 2, procurement, trainings and projects in response of the COVID-19 pandemic under Bayanihan II project.


Figure 3 — Source: Department of Budget and Management, 2021
Figure 3 — Source: Department of Budget and Management, 2021

Vaccine. Job. New Reality.

Vaccine hesitancy is not only an arising problem in Philippines but also declared one of the ten threats to global health by WHO (World Health Organization, 2019). Spreading correct information such as efficacy, side effects and the benefit of the vaccine to face the new normal helped the Philippine Government in doubling the people vaccinated on May 2021.


Figure 4 — Source: World Health Organization, 2021
Figure 4 — Source: World Health Organization, 2021

Successfully educating the citizens that vaccine is not the 100% solution in fight against actively mutating COVID virus and due to uncertainty when this pandemic will end, Philippine consumers became very conservative with their buying habits despite government injected funds for citizens to spend (Figure 2).

Cautious spending resulted a demand and supply shock which causes job loss and insecurity. Based on Inter-Agency Task Force for the Management of Emerging Infectious Diseases, 2020 report, 50% of private companies’ workers experienced decline in income and 80% of which is due to job loss.


Figure 5 — Source: Inter-Agency Task Force for the Management of Emerging Infectious Disease, 2020
Figure 5 — Source: Inter-Agency Task Force for the Management of Emerging Infectious Disease, 2020

Due to left and right retrenchment, Filipinos are looking for ways to acquire extra income and/or serve their clients contactless. And the outcome is digitalization. This is the latest competition now in the upcoming new reality of the world. Consumers are already aware of food delivery services and shopping online since pre-pandemic but how about the telecommuting or working from home?

ASIAN DEVELOPMENT BANK SURVEY, 2020 resulted that 57.3% of micro, 12% of small and 11.1% of medium firms’ workers are not possible to work from home. The same survey shows that financial assistance and tax incentives may help their businesses to adopt the new normal.


Figure 6 — Source: Asian Development Bank, 2020
Figure 6 — Source: Asian Development Bank, 2020


Digitalisation to Solve the Problem

Consumer confidence is expected to remain low even after the ECQ is lifted. Expectation of a worse family income situation is especially pronounced among the low-income group (Inter-Agency Task Force for the Management of Emerging Infectious Diseases, 2020). And that is a significant economic problem.

Digitalisation for businesses will not only attract consumers to buy their goods or services but it will generate a well of data that can help them dive and perform the analytics. Analytics will solve in managing of inventories and will also help to know more about erratic decision changes of Filipinos due to this never-ending battle cause of pandemic.

METHODOLOGY

Quantitative

In cross-sectional design, we will use descriptive and categorical analysis and prescriptive analysis will be utilized to determine the next policies to be developed by the government.

Qualitative

For qualitative data it will undergo transcriptions and reported on themes.

REFERENCES

1. Amenabar, T. (4 September, 2020). The Washington Post. Retrieved from The Washington Post:

https://www.washingtonpost.com/history/2020/09/01/1918-flu-pandemic-end/

2. Billings, M. (June, 1997). The Influenza Pandemic of 1918. Retrieved from Stanford University: https://virus.stanford.edu/uda/fluresponse.html

3. Brinca, B., Duarte, J. B., & e Castro, M. (17 June, 2020). Decomposing demand and supply shocks during COVID-19. Retrieved from VoxEU.org : https://voxeu.org/article/decomposing-demand-and-supply-shocks- during-covid-19

4. Department of Budget and Management. (31 May, 2021). COVID-19 Budget Utilization Reports as of May 31, 2021. Retrieved from Department of Budget and Management: https://www.dbm.gov.ph/index.php/programs-projects/status-of-covid-19-releases#bayanihan-2

5. International Monetary Fund. (April, 2021). International Monetary Fund. Retrieved from World Economic Outlook (April 2021): https://www.imf.org/external/datamapper/datasets

6. Inter-Agency Task Force for the Management of Emerging Infectious Diseases. (2020). WE RECOVER AS ONE. Manila: National Economic and Development Authority (NEDA).

7. World Health Organization. (2019). Top Ten Threats to Global Health in 2019. Retrieved from World Health Organization: https://www.who.int/news-room/spotlight/ten-threats-to-global-health-in-2019

8. World Health Organization. (2021). COVID-19 Explorer. Retrieved from World Health Organization: https://worldhealthorg.shinyapps.io/covid/

9. ASIAN DEVELOPMENT BANK. (17 July, 2020). ADB Philippine Enterprise Survey on COVID-19 Impact. Retrieved from ADB Data Library: https://data.adb.org/dataset/adb-philippine-enterprise-survey-covid-19-impact

10. Philippine Statistics Authority. (11 May, 2021). National Accounts. Retrieved from Republic of the Philippines — Philippine Statistics Authority: https://psa.gov.ph/national-accounts/base-2018/estimates



Get this gadget at facebook popup like box