Code Source that I looked at and learned from: link
As usual, import the packages we want
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets
import bqplot
import matplotlib.colors as mpl_colors
import seaborn as sns
import datetime
The name of the dataset: Yellow Taxi trip data 2021-01 in NYC
We can obtain the dataset at the TLC official site. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
The link towards the dataset: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv
Preferably, download the dataset first as the dataset is fairly large, 126MB.
However, as I had exceeded the Git LFS in my plan, I swapped the local import into url.
No identification of the license, so I am assuming we can play around with it. I do find an user guide of the dataset, but not much was mentioned on what we can do with the data. Link to the user guide: https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf
The dictionary of the dataset:https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
The following columns are the columns that I will be used in the final project part 3. The explanation for rest of the columns can be found inside the dictionary of the dataset noted above.
tpep_pickup_datetime - The date and time when the meter was engaged.
Passenger_count - The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance - The elapsed trip distance in miles reported by the taximeter.
Total_amount - The total amount charged to passengers. Does not include cash tips.
There was a warning indicating the dtype conflicts in some of the columns in the dataset. Will deal with the warning in the later stage. Escaping the warning now by setting low_memory = False
We will first downsample the dataset, else mybinder could not handle the dataset. I am taking the number of samples to be 0.1% of the original dataset else mybinder would crash. Feel free to adjust the sample to see different results.
data_2021 = pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv",
low_memory=False)
np.random.seed(2022)
nsamples = len(data_2021) // 100
downSampleMask = np.random.choice(range(len(data_2021)-1),
nsamples, replace=False)
data_2021 = data_2021.loc[downSampleMask]
data_2021.head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
411971 | 1.0 | 2021-01-12 06:23:14 | 2021-01-12 06:34:16 | 1.0 | 2.30 | 1.0 | N | 226 | 145 | 2.0 | 10.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 10.80 | 0.0 |
406278 | 1.0 | 2021-01-11 19:42:05 | 2021-01-11 19:48:14 | 1.0 | 1.90 | 1.0 | N | 237 | 137 | 1.0 | 7.5 | 3.5 | 0.5 | 2.35 | 0.0 | 0.3 | 14.15 | 2.5 |
989763 | 2.0 | 2021-01-25 14:59:49 | 2021-01-25 15:05:53 | 1.0 | 1.20 | 1.0 | N | 79 | 137 | 1.0 | 6.5 | 0.0 | 0.5 | 1.96 | 0.0 | 0.3 | 11.76 | 2.5 |
580385 | 2.0 | 2021-01-15 15:41:45 | 2021-01-15 15:44:18 | 1.0 | 0.43 | 1.0 | N | 237 | 236 | 2.0 | 4.0 | 0.0 | 0.5 | 0.00 | 0.0 | 0.3 | 7.30 | 2.5 |
542365 | 2.0 | 2021-01-14 18:47:27 | 2021-01-14 18:54:50 | 1.0 | 1.27 | 1.0 | N | 234 | 211 | 1.0 | 7.0 | 1.0 | 0.5 | 1.00 | 0.0 | 0.3 | 12.30 | 2.5 |
In total 1369765 rows after downsampling inside the dataset, indeed a large one.
len(data_2021)
13697
How many columns:
len(data_2021.columns)
18
Take a look on the columns that we cared and their datatypes:
Create a set to inspect: (better performance in runtime)
columnsCared = {'tpep_pickup_datetime', 'passenger_count',
'trip_distance','total_amount'}
for c in columnsCared:
print(c, data_2021[c].dtype)
tpep_pickup_datetime object trip_distance float64 total_amount float64 passenger_count float64
From the above, we can tell that the datetime column need some cleaning. We would create a new column called "pickup_date" to store the values that transferred the values inside "tpep_pickup_datetime" into date objects.
data_2021['pickup_date'] = pd.to_datetime(data_2021['tpep_pickup_datetime']).dt.date
data_2021['pickup_date'] = pd.to_datetime(data_2021['pickup_date'])
data_2021['pickup_date'].head()
411971 2021-01-12 406278 2021-01-11 989763 2021-01-25 580385 2021-01-15 542365 2021-01-14 Name: pickup_date, dtype: datetime64[ns]
Good, now we have changed the datatype of the dates into the correct type, lets see if there are any nan or strange entries in the column
pd.to_datetime(data_2021['pickup_date']).dt.date.unique()
array([datetime.date(2021, 1, 12), datetime.date(2021, 1, 11), datetime.date(2021, 1, 25), datetime.date(2021, 1, 15), datetime.date(2021, 1, 14), datetime.date(2021, 1, 31), datetime.date(2021, 1, 27), datetime.date(2021, 1, 19), datetime.date(2021, 1, 7), datetime.date(2021, 1, 6), datetime.date(2021, 1, 17), datetime.date(2021, 1, 21), datetime.date(2021, 1, 10), datetime.date(2021, 1, 16), datetime.date(2021, 1, 9), datetime.date(2021, 1, 2), datetime.date(2021, 1, 29), datetime.date(2021, 1, 18), datetime.date(2021, 1, 26), datetime.date(2021, 1, 4), datetime.date(2021, 1, 23), datetime.date(2021, 1, 24), datetime.date(2021, 1, 13), datetime.date(2021, 1, 22), datetime.date(2021, 1, 8), datetime.date(2021, 1, 20), datetime.date(2021, 1, 5), datetime.date(2021, 1, 30), datetime.date(2021, 1, 28), datetime.date(2021, 1, 1), datetime.date(2021, 1, 3), datetime.date(2020, 12, 31)], dtype=object)
Notice that we have three errorous data in the dataset, some indicated 2008/12/31, some indicated 2020/12/31 and the rest indicating 2009/12/31. We should remove these entries.
data_2021[(data_2021['pickup_date'] <= "2020-12-31")
| (data_2021['pickup_date'] > "2021-01-31")].head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | pickup_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3783 | 2.0 | 2020-12-31 18:11:53 | 2020-12-31 18:17:04 | 1.0 | 1.74 | 1.0 | N | 48 | 239 | 1.0 | 7.0 | 0.5 | 0.5 | 3.24 | 0.0 | 0.3 | 14.04 | 2.5 | 2020-12-31 |
Create a dropIndex list to remove all the error data
dropIndex = list(data_2021[(data_2021['pickup_date'] <= "2020-12-31")
| (data_2021['pickup_date'] > "2021-01-31")].index)
Take a look inside the dropIndex
dropIndex[:5], len(dropIndex)
([3783], 1)
Drop the error data
data_2021.drop([i for i in dropIndex], inplace=True)
Now the data should look good
pd.to_datetime(data_2021['tpep_pickup_datetime']).dt.date.unique()
array([datetime.date(2021, 1, 12), datetime.date(2021, 1, 11), datetime.date(2021, 1, 25), datetime.date(2021, 1, 15), datetime.date(2021, 1, 14), datetime.date(2021, 1, 31), datetime.date(2021, 1, 27), datetime.date(2021, 1, 19), datetime.date(2021, 1, 7), datetime.date(2021, 1, 6), datetime.date(2021, 1, 17), datetime.date(2021, 1, 21), datetime.date(2021, 1, 10), datetime.date(2021, 1, 16), datetime.date(2021, 1, 9), datetime.date(2021, 1, 2), datetime.date(2021, 1, 29), datetime.date(2021, 1, 18), datetime.date(2021, 1, 26), datetime.date(2021, 1, 4), datetime.date(2021, 1, 23), datetime.date(2021, 1, 24), datetime.date(2021, 1, 13), datetime.date(2021, 1, 22), datetime.date(2021, 1, 8), datetime.date(2021, 1, 20), datetime.date(2021, 1, 5), datetime.date(2021, 1, 30), datetime.date(2021, 1, 28), datetime.date(2021, 1, 1), datetime.date(2021, 1, 3)], dtype=object)
Create a chart to show the trip count at NYC in January, 2021
fig = plt.figure(figsize=(15,10))
record_2021_df = data_2021.value_counts('pickup_date').sort_index()
bar_chart = plt.bar(x =record_2021_df.index,
height = record_2021_df.values)
bar_chart.colors = ["blue"]
plt.xlabel("Date")
plt.ylabel("Trip Count")
plt.title("Trip Count in Jan. 2021")
plt.show()
record_2021_df.min(), record_2021_df.max()
(252, 621)
Thus, we should take the log for the values, as the range is pretty big. To make sure that data could show the graph better without losing the scale, taking log is my option.
record_2021_df = np.log10(record_2021_df)
record_2021_df.head()
pickup_date 2021-01-01 2.413300 2021-01-02 2.513218 2021-01-03 2.401401 2021-01-04 2.630428 2021-01-05 2.700704 dtype: float64
Create a new data frame that only contains the day and log count
data_grouped = pd.DataFrame()
data_grouped['date'] = record_2021_df.index.day
# data_grouped['record_2019'] = record_2019_df.values
# data_grouped['record_2020'] = record_2020_df.values
data_grouped['record_2021'] = record_2021_df.values
data_grouped.head()
date | record_2021 | |
---|---|---|
0 | 1 | 2.413300 |
1 | 2 | 2.513218 |
2 | 3 | 2.401401 |
3 | 4 | 2.630428 |
4 | 5 | 2.700704 |
Create a average recod count variable for each year
The number 13587 was from the source at here which tells us the total amount to taxi drivers in NYC. (There were 13,587 licensed yellow cabs, compared to 35,000 licensed High Volume vehicles, at that time.)
avg_record_2021_df = data_2021.value_counts('pickup_date').sort_index() / 13587
avg_record_2021_df.head()
pickup_date 2021-01-01 0.019062 2021-01-02 0.023994 2021-01-03 0.018547 2021-01-04 0.031427 2021-01-05 0.036947 dtype: float64
avg_record_2021 = avg_record_2021_df.sum() / len(avg_record_2021_df)
avg_record_2021
0.03251685078478717
Create a total amount variable for each year
total_amount_2021 = data_2021.groupby('pickup_date')['total_amount'].sum()
total_amount_2021 = np.log10(total_amount_2021)
total_amount_2021.head()
pickup_date 2021-01-01 3.685942 2021-01-02 3.791477 2021-01-03 3.692340 2021-01-04 3.899154 2021-01-05 3.968957 Name: total_amount, dtype: float64
Create a passenger count variable for the year
However, we would have to remove some faulty data inside the passenger count column, as we will find some NaN values and 0 values. These data entries need to be removed.
Check unique values to find the possible errors
data_2021['passenger_count'].unique()
array([ 1., nan, 2., 5., 0., 3., 6., 4.])
Show all the 0 passenger count values inside the dataset
data_2021[data_2021['passenger_count'] == 0].head()
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | pickup_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
931595 | 1.0 | 2021-01-23 20:01:34 | 2021-01-23 20:16:35 | 0.0 | 4.1 | 1.0 | N | 263 | 48 | 2.0 | 14.50 | 2.5 | 0.5 | 0.00 | 0.0 | 0.3 | 17.80 | 2.5 | 2021-01-23 |
1001934 | 1.0 | 2021-01-25 18:15:04 | 2021-01-25 18:15:46 | 0.0 | 0.2 | 1.0 | N | 264 | 264 | 1.0 | 2.50 | 1.0 | 0.5 | 5.00 | 0.0 | 0.3 | 9.30 | 0.0 | 2021-01-25 |
19503 | 2.0 | 2021-01-01 19:23:41 | 2021-01-01 19:23:48 | 0.0 | 0.0 | 5.0 | N | 116 | 116 | 1.0 | 31.62 | 0.0 | 0.5 | 0.01 | 0.0 | 0.3 | 32.43 | 0.0 | 2021-01-01 |
361608 | 1.0 | 2021-01-10 18:13:12 | 2021-01-10 18:21:14 | 0.0 | 1.2 | 1.0 | N | 151 | 41 | 1.0 | 7.50 | 0.0 | 0.5 | 1.00 | 0.0 | 0.3 | 9.30 | 0.0 | 2021-01-10 |
635146 | 1.0 | 2021-01-16 21:39:30 | 2021-01-16 22:04:07 | 0.0 | 5.1 | 1.0 | N | 230 | 42 | 1.0 | 20.00 | 3.0 | 0.5 | 4.75 | 0.0 | 0.3 | 28.55 | 2.5 | 2021-01-16 |
Make a dropIndex list to drop all the error data
dropIndex = list(data_2021[data_2021['passenger_count'] == 0].index)
dropIndex[:5], len(dropIndex)
([931595, 1001934, 19503, 361608, 635146], 285)
Drop the indices
data_2021.drop([i for i in dropIndex], inplace=True)
Drop the NaN values
data_2021 = data_2021[~pd.isnull(data_2021['passenger_count'])]
Check the unique data inside to make sure we cleaned everything
data_2021['passenger_count'].unique()
array([1., 2., 5., 3., 6., 4.])
pass_count_2021 = data_2021.groupby('pickup_date')['passenger_count'].sum()
pass_count_2021.head()
pickup_date 2021-01-01 379.0 2021-01-02 414.0 2021-01-03 361.0 2021-01-04 507.0 2021-01-05 680.0 Name: passenger_count, dtype: float64
myIndividuaLSelectedLabel = ipywidgets.Label()
ntrip_distance = 20
ntotal_payment = 20
Itrip_bins = np.linspace(1.26125, 48.80875, ntrip_distance+1)
Ipay_bins = np.linspace(6.9375, 259.1625, ntotal_payment+1)
Ihist2d, Itrip_edges, Ipay_edges = np.histogram2d(data_2021['trip_distance'],
data_2021['total_amount'],
weights=data_2021['passenger_count'],
bins = [Itrip_bins, Ipay_bins])
Itrip_centers = (Itrip_edges[:-1] + Itrip_edges[1:]) / 2
Ipay_centers = (Ipay_edges[:-1] + Ipay_edges[1:]) / 2
Itripmin = Itrip_centers.min()
Itripmax = Itrip_centers.max()
Ipaymin = Ipay_centers.min()
Ipaymax = Ipay_centers.max()
def individual_generate_histogram_from_trip_pay(data, ntrip=20, npay=20,
tripmin=Itripmin, tripmax=Itripmax,
paymin=Ipaymin,paymax=Ipaymax,
takeLog=True):
trip_bins = np.linspace(tripmin, tripmax, ntrip+1)
pay_bins = np.linspace(paymin, paymax, npay+1)
hist2d, trip_edges, pay_edges = np.histogram2d(data['trip_distance'],
data['total_amount'],
weights=data['passenger_count'],
bins = [trip_bins, pay_bins])
hist2d = hist2d.T
if takeLog:
hist2d[hist2d <= 0] = np.nan # set zeros to NaNs
# then take log
hist2d = np.log10(hist2d)
trip_centers = (trip_edges[:-1] + trip_edges[1:]) / 2
pay_centers = (pay_edges[:-1] + pay_edges[1:]) / 2
return hist2d, trip_centers, pay_centers, trip_edges, pay_edges
Ihist2d, Itrip_centers, Ipay_centers, Itrip_edges, Ipay_edges = individual_generate_histogram_from_trip_pay(data_2021)
# Scale
col_sc = bqplot.ColorScale(scheme="RdPu",
min=np.nanmin(Ihist2d),
max=np.nanmax(Ihist2d))
x_sc = bqplot.LinearScale()
y_sc = bqplot.LinearScale()
# Axis
c_ax = bqplot.ColorAxis(scale = col_sc,
orientation='vertical',
side='right')
x_ax = bqplot.Axis(scale = x_sc, label='Trip Distance')
y_ax = bqplot.Axis(scale = y_sc, label='Total Payment',
orientation='vertical', label_offset="45px")
# Marks
Iheat_map = bqplot.GridHeatMap(color = Ihist2d,
row = Ipay_centers,
column = Itrip_centers,
scales = {'color':col_sc,
'row': y_sc,
'column':x_sc},
interactions = {'click':'select'},
anchor_style = {'fill':'blue'},
selected_style = {'opacity':1.00},
unselected_style = {'opacity':1.00})
# Scale
x_scl = bqplot.DateScale()
y_scl = bqplot.LogScale()
# Axis
ax_xcl = bqplot.Axis(label='Date', scale = x_scl)
ax_ycl = bqplot.Axis(label = 'Passenger Count', scale = y_scl,
orientation = 'vertical', side = 'left')
# Marks
i,j = 19, 0
Itrips = [Itrip_edges[j], Itrip_edges[j+1]]
Ipays = [Ipay_edges[i], Ipay_edges[i+1]]
# region mask
region_mask = ((data_2021['total_amount'] >= Ipays[0]) & (data_2021['total_amount']<=Ipays[1]) &\
(data_2021['trip_distance'] >= Itrips[0]) & (data_2021['trip_distance']<=Itrips[1]))
# Fig
Ipass_scatt = bqplot.Scatter(x=data_2021['pickup_date'][region_mask],
y=data_2021['passenger_count'][region_mask],
scales = {'x':x_scl, 'y':y_scl})
# data_2021['pickup_date'][region_mask]
# data_2021['passenger_count'][region_mask]
def get_individual_data_value(change):
if len(change['owner'].selected) == 1:
i, j = change['owner'].selected[0]
v = Ihist2d[i, j]
myIndividuaLSelectedLabel.value = "Passenger Count in Log = " + str(v)
Itrips = [Itrip_edges[j], Itrip_edges[j+1]]
Ipays = [Ipay_edges[i], Ipay_edges[i+1]]
region_mask = ((data_2021['total_amount'] >= Ipays[0]) & (data_2021['total_amount']<=Ipays[1]) &\
(data_2021['trip_distance'] >= Itrips[0]) & (data_2021['trip_distance']<=Itrips[1]))
Ipass_scatt.x = data_2021['pickup_date'][region_mask]
Ipass_scatt.y = data_2021['passenger_count'][region_mask]
Iheat_map.observe(get_individual_data_value, 'selected')
def get_test(change):
if len(change['owner'].selected) == 1:
i, j = change['owner'].selected[0]
print(i, j)
v = Ihist2d[i, j]
print(v, v.dtype)
# Iheat_map.observe(get_test, 'selected')
fig_Iheatmap = bqplot.Figure(marks = [Iheat_map], axes = [c_ax, y_ax, x_ax])
fig_Ipass = bqplot.Figure(marks = [Ipass_scatt], axes = [ax_xcl, ax_ycl])
fig_Iheatmap.layout.min_width='500px'
fig_Ipass.layout.min_width='500px'
myDashboard = ipywidgets.VBox([myIndividuaLSelectedLabel, ipywidgets.HBox([fig_Iheatmap,fig_Ipass])])
# myDashboard
x_column = ['payment_type']
y_column = ['total_amount', 'tolls_amount']
The code for the write-up were either taken from the web or taken from this file. Citations will be included if I had used something from the web.
This post will take a closer look at the yellow taxi industry in New York City by examining the total trips and total payments and the passenger counts in January 2021
As we know, since the late 2019, when the COVID-2019 breakout, we had suffered a gigantic impact on travelling, both domestic and international. As you can see below, a major impact would be the count on recorded data for the January at 2019, 2020, and 2021. In 2019, the records we had on the taxi drives are 7667255 drives. In 2020, the records we had on the taxi drives are 6404796 drives. In 2021, the records we had on teh taxi drives are 1369741 drives. There was a significant drop in the recorded drives.
# I looked at the tutorial for reference and hints of this graph
# https://www.tutorialspoint.com/python-matplotlib-multiple-bars
plt.rcParams["figure.figsize"] = [15, 10]
plt.rcParams["figure.autolayout"] = True
plt.rc('axes', titlesize=30)
plt.rc('axes', labelsize=25)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
labels = data_grouped['date']
# record_2019 = record_2019_df.values
# record_2020 = record_2020_df.values
record_2021 = record_2021_df.values
x = np.arange(len(labels))
width = 0.7
fig, ax = plt.subplots()
# rects_2019 = ax.bar(x - width, record_2019, width, label = "2019")
# rects_2020 = ax.bar(x, record_2020, width, label = "2020")
rects_2021 = ax.bar(x, record_2021, width, label = "2021")
ax.set_xlabel("Date in January")
ax.set_ylabel('Taxi Drives in Log')
ax.set_title('Taxi Drives in January, 2021')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
for rect in rects:
height = round(rect.get_height(), 2)
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects_2021)
# trip_plt = plt
plt.show()
From the graph we can tell that there was much change in the taxi drive counts throughout the month. For the pattern, it almost looked like an sine wave in mathematics. We have some days with high trip counts and some days with lower trip counts, which grouped into 8 different sections in the month.
The TLC regulates medallion, street hail livery, commuter van, paratransit, and for-hire vehicles in New York City. The yellow taxi are the traditional type of taxi we imagine. Interesting enough, according to the NYC State Law, yellow taxicabs were limited to 13,587 on the road. So we can tell that with this gigantic amount of taxi drives per day, we are interested in the amount of average trips for each taxi driver per day. We can do a little calculation for the year 2021.
avg_record_2021
0.03251685078478717
The average trip for the 2021 taxi drivers were fairly low
Now, we had explored the trip, pay and passenger and the change in total record of trips in the 2021 year. We could also analyze the total revenue in the the yellow taxi drives for the 2021 year by building a bar graph graph.
plt.rcParams["figure.figsize"] = [15, 10]
plt.rcParams["figure.autolayout"] = True
plt.rc('axes', titlesize=30)
plt.rc('axes', labelsize=25)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
labels = data_grouped['date']
# record_2019 = total_amount_2019.values
# record_2020 = total_amount_2020.values
record_2021 = total_amount_2021.values
x = np.arange(len(labels))
width = 0.7
fig, ax = plt.subplots()
# rects_2019 = ax.bar(x - width, record_2019, width, label = "2019")
# rects_2020 = ax.bar(x, record_2020, width, label = "2020")
rects_2021 = ax.bar(x, record_2021, width, label = "2021")
ax.set_xlabel("Date in January")
ax.set_ylabel('Total Payment')
ax.set_title('Total Payment in January, 2019 - 2021')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
for rect in rects:
height = round(rect.get_height(), 2)
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects_2021)
# pay_plt = plt
plt.show()
The trips records across the year showed a significant change in the total amount and the average amount of trips in the month January. However, we have not inspect regarding the total amount of the market in January and the passenger count for each trip. We could make a bar graph for us to achieve something like this. I will take the 2021 data value as the example.
# I looked at the tutorial for reference and hints of this graph
# https://www.tutorialspoint.com/python-matplotlib-multiple-bars
plt.rcParams["figure.figsize"] = [15, 10]
plt.rcParams["figure.autolayout"] = True
plt.rc('axes', titlesize=30)
plt.rc('axes', labelsize=25)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
labels = data_grouped['date']
# record_2019 = record_2019_df.values
# record_2020 = record_2020_df.values
record_2021 = np.log10(pass_count_2021)
x = np.arange(len(labels))
width = 0.7
fig, ax = plt.subplots()
# rects_2019 = ax.bar(x - width, record_2019, width, label = "2019")
# rects_2020 = ax.bar(x, record_2020, width, label = "2020")
rects_2021 = ax.bar(x, record_2021, width, label = "2021")
ax.set_xlabel("Date in January")
ax.set_ylabel('Passenger Count in Log')
ax.set_title('Passenger Count in January, 2021')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
for rect in rects:
height = round(rect.get_height(), 2)
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects_2021)
# pass_plt = plt
plt.show()
From the three bar graphs above, we can tell that the relationship between the trip distance and the total amount is linear. However, the passenger count may flex for some distance and some payments.
Below is an interactive heatmap that shows the relationship between the variables that we indicated above.
The trips records across the three years showed a significant change in the total amount and the average amount of trips in the month January. However, we have not inspect regarding the total amount of the market in January and the passenger count for each trip. We could make an interactive dashboard graph for us to achieve something like this. I will take the 2021 data value as the example.
Interestingly, the dashboard does not work on mybinder. However, it does work when you download the script and run it locally...
myDashboard
VBox(children=(Label(value=''), HBox(children=(Figure(axes=[ColorAxis(orientation='vertical', scale=ColorScale…
I would like to make an additional graph showing the relationship between the payment_type and the total amount. The color scale would be the tip for that record. The interactive plot would be simple. We would make a ipywidget dropdown for us to choose from.
@ipywidgets.interact(style=plt.style.available, colormap_name=plt.colormaps(), x=x_column, y=y_column)
def payment_scatter(style, colormap_name, x, y):
with plt.style.context(style):
colorScale = data_2021['tip_amount']
plt.scatter(data_2021[x], data_2021[y], c=colorScale, cmap=colormap_name)
plt.xlabel(x)
plt.ylabel(y)
plt.title(x + " vs " + y)
plt.show()
interactive(children=(Dropdown(description='style', options=('Solarize_Light2', '_classic_test_patch', 'bmh', …