Data Exploration with Python and Jupyter - part 3¶
Basic usage of the Pandas library to download a dataset, explore its contents, clean up missing or invalid data, filter the data according to different criteria, and plot visualizations of the data.
- Part 1: Python and Jupyter
- Part 2: Pandas with toy data
- Part 3: Pandas with real data
Press Spacebar
to go to the next slide (or ?
to see all navigation shortcuts)
Let's download some real data¶
For some reason, the London Fire Brigade provides a public spreadsheet of all animal rescue incidents since 2009:
https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb
They provide a link to the dataset in excel format
In [1]:
# import the Pandas library & matplotlib for plotting
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# download an excel spreadsheet with some data and convert it to a DataFrame
url = "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/01007433-55c2-4b8a-b799-626d9e3bc284/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv.xlsx"
df = pd.read_excel(url)
Display the DataFrame¶
In [3]:
df
Out[3]:
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNotionalCost(£) | IncidentNotionalCost(£) | FinalDescription | ... | UPRN | Street | USRN | PostcodeDistrict | Easting_m | Northing_m | Easting_rounded | Northing_rounded | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 2009-01-01 03:01:00 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | Redacted | ... | NaN | Waddington Way | 20500146.0 | SE19 | NaN | NaN | 532350 | 170050 | NaN | NaN |
1 | 275091 | 2009-01-01 08:51:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | NaN | Grasmere Road | NaN | SE25 | 534785.0 | 167546.0 | 534750 | 167550 | 51.390954 | -0.064167 |
2 | 2075091 | 2009-01-04 10:07:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | NaN | Mill Lane | NaN | SM5 | 528041.0 | 164923.0 | 528050 | 164950 | 51.368941 | -0.161985 |
3 | 2872091 | 2009-01-05 12:27:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | 1.000215e+11 | Park Lane | 21401484.0 | UB9 | 504689.0 | 190685.0 | 504650 | 190650 | 51.605283 | -0.489684 |
4 | 3553091 | 2009-01-06 15:23:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | NaN | Swindon Lane | 21300122.0 | RM3 | NaN | NaN | 554650 | 192350 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10698 | 066639-27042024 | 2024-04-27 18:03:00 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | Redacted | ... | 1.009334e+10 | CATHEDRAL PASSAGE | 22503387.0 | SE1 | 532713.0 | 180259.0 | 532750 | 180250 | 51.505695 | -0.089152 |
10699 | 066731-27042024 | 2024-04-27 20:41:00 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | CAT STUCK ON ROOF | ... | NaN | SANDRINGHAM AVENUE | 22105748.0 | SW20 | NaN | NaN | 524350 | 169550 | NaN | NaN |
10700 | 066825-27042024 | 2024-04-27 23:27:00 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | Redacted | ... | NaN | RAINHILL WAY | 22700989.0 | E3 | NaN | NaN | 537550 | 182550 | NaN | NaN |
10701 | 067089-28042024 | 2024-04-28 14:21:00 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | KITTEN TRAPPED INSIDE ENGINE COMPARTMENT OF CA... | ... | 1.200377e+07 | WESTERN AVENUE | 20602931.0 | UB6 | 516701.0 | 182992.0 | 516750 | 182950 | 51.533783 | -0.318859 |
10702 | 067673-29042024 | 2024-04-29 16:20:00 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | CAT FALLEN INTO HOLE | ... | NaN | GREENHILL PARK | 20201414.0 | NW10 | NaN | NaN | 521250 | 183650 | NaN | NaN |
10703 rows × 31 columns
Column data types¶
In [4]:
df.dtypes
Out[4]:
IncidentNumber object DateTimeOfCall datetime64[ns] CalYear int64 FinYear object TypeOfIncident object PumpCount float64 PumpHoursTotal float64 HourlyNotionalCost(£) int64 IncidentNotionalCost(£) float64 FinalDescription object AnimalGroupParent object OriginofCall object PropertyType object PropertyCategory object SpecialServiceTypeCategory object SpecialServiceType object WardCode object Ward object BoroughCode object Borough object StnGroundName object UPRN float64 Street object USRN float64 PostcodeDistrict object Easting_m float64 Northing_m float64 Easting_rounded int64 Northing_rounded int64 Latitude float64 Longitude float64 dtype: object
DateTimeOfCall¶
In [5]:
df["DateTimeOfCall"].head()
Out[5]:
0 2009-01-01 03:01:00 1 2009-01-01 08:51:00 2 2009-01-04 10:07:00 3 2009-01-05 12:27:00 4 2009-01-06 15:23:00 Name: DateTimeOfCall, dtype: datetime64[ns]
In [6]:
# this is already a datetime object, which is great
# a quick sanity check to see if it looks correct:
pd.to_datetime(df["DateTimeOfCall"]).plot()
# should be a single monotonically increasing line: looks good!
Out[6]:
<Axes: >
Use datetime as the index¶
In [7]:
df.set_index("DateTimeOfCall", inplace=True)
In [8]:
df
Out[8]:
IncidentNumber | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNotionalCost(£) | IncidentNotionalCost(£) | FinalDescription | AnimalGroupParent | ... | UPRN | Street | USRN | PostcodeDistrict | Easting_m | Northing_m | Easting_rounded | Northing_rounded | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DateTimeOfCall | |||||||||||||||||||||
2009-01-01 03:01:00 | 139091 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | Redacted | Dog | ... | NaN | Waddington Way | 20500146.0 | SE19 | NaN | NaN | 532350 | 170050 | NaN | NaN |
2009-01-01 08:51:00 | 275091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Fox | ... | NaN | Grasmere Road | NaN | SE25 | 534785.0 | 167546.0 | 534750 | 167550 | 51.390954 | -0.064167 |
2009-01-04 10:07:00 | 2075091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | NaN | Mill Lane | NaN | SM5 | 528041.0 | 164923.0 | 528050 | 164950 | 51.368941 | -0.161985 |
2009-01-05 12:27:00 | 2872091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Horse | ... | 1.000215e+11 | Park Lane | 21401484.0 | UB9 | 504689.0 | 190685.0 | 504650 | 190650 | 51.605283 | -0.489684 |
2009-01-06 15:23:00 | 3553091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Rabbit | ... | NaN | Swindon Lane | 21300122.0 | RM3 | NaN | NaN | 554650 | 192350 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2024-04-27 18:03:00 | 066639-27042024 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | Redacted | Bird | ... | 1.009334e+10 | CATHEDRAL PASSAGE | 22503387.0 | SE1 | 532713.0 | 180259.0 | 532750 | 180250 | 51.505695 | -0.089152 |
2024-04-27 20:41:00 | 066731-27042024 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | CAT STUCK ON ROOF | Cat | ... | NaN | SANDRINGHAM AVENUE | 22105748.0 | SW20 | NaN | NaN | 524350 | 169550 | NaN | NaN |
2024-04-27 23:27:00 | 066825-27042024 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | Redacted | Cat | ... | NaN | RAINHILL WAY | 22700989.0 | E3 | NaN | NaN | 537550 | 182550 | NaN | NaN |
2024-04-28 14:21:00 | 067089-28042024 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | KITTEN TRAPPED INSIDE ENGINE COMPARTMENT OF CA... | Cat | ... | 1.200377e+07 | WESTERN AVENUE | 20602931.0 | UB6 | 516701.0 | 182992.0 | 516750 | 182950 | 51.533783 | -0.318859 |
2024-04-29 16:20:00 | 067673-29042024 | 2024 | 2024/25 | Special Service | 1.0 | 1.0 | 430 | 430.0 | CAT FALLEN INTO HOLE | Cat | ... | NaN | GREENHILL PARK | 20201414.0 | NW10 | NaN | NaN | 521250 | 183650 | NaN | NaN |
10703 rows × 30 columns
In [9]:
# can now use datetime to select rows: here is jan 2021
df.loc["2021-01-01":"2021-01-31", "FinalDescription"]
Out[9]:
DateTimeOfCall 2021-01-01 12:09:00 KITTEN STUCK UP TREE AL REQUESTED FROM SCENE 2021-01-01 14:06:00 Redacted 2021-01-03 18:40:00 CAT WITH LEG TRAPPED IN BATH PLUGHOLE 2021-01-04 13:39:00 Redacted 2021-01-06 10:22:00 Redacted 2021-01-06 13:09:00 CAT IN DISTRESS ON ROOF - ADDITIONAL APPLIANCE... 2021-01-06 20:35:00 DOG TRAPPED IN FOX HOLE - MEET AT CLUB HOUSE 2021-01-07 23:50:00 KITTEN STUCK BETWEEN WALL AND ROOF 2021-01-09 08:01:00 DOG STUCK IN TRENCH 2021-01-10 19:27:00 Redacted 2021-01-12 11:39:00 Redacted 2021-01-12 22:38:00 CAT TRAPPED IN DITCH 2021-01-16 18:05:00 DOG TRAPPED IN PORTER CABIN 2021-01-17 16:09:00 DOG TRAPPED IN WAREHOUSE AREA - CALLER BELIEVE... 2021-01-17 17:09:00 BIRD TRAPPED IN NETTING CALLER WILL MEET YOU 2021-01-18 15:17:00 CAT STUCK IN TREE BEING ATTACKED BY CROWS 2021-01-18 17:06:00 ASSIST RSPCA - SMALL ANIMAL RESUE - BIRD ENTAN... 2021-01-19 18:28:00 CAT TRAPPED BEHIND CUPBOARD 2021-01-19 20:24:00 Redacted 2021-01-19 20:36:00 RUNNING CALL AT ON ROOF 2021-01-20 09:35:00 CAT STUCK BETWEEN TREE BRANCHES 2021-01-21 13:15:00 SWAN TRAPPED IN NETTING 2021-01-21 18:23:00 CAT TRAPPED IN CHIMNEY 2021-01-22 14:22:00 CAT TRAPPED BETWEEN WALL AND FENCE 2021-01-23 10:18:00 CAT TRAPPED IN CHIMNEY 2021-01-23 15:43:00 CAT TRAPPED BETWEEN WALLS 2021-01-23 17:16:00 Redacted 2021-01-25 12:02:00 ASSIST RSPCA WITH FOX STUCK DOWN CULVERT 2021-01-26 13:42:00 DOG STUCK IN RAILINGS - CALLER WILL MEET YOU 2021-01-26 18:21:00 Redacted 2021-01-26 22:44:00 BIRDS TRAPPED IN BASKETBALL COURT CALLER IS ON... 2021-01-26 23:35:00 FOX TRAPPED IN FENCE IN ALLEYWAY NEXT TO 2021-01-27 09:18:00 CAT STUCK IN TREE - ATTENDED YESTERDAY AND ADV... 2021-01-27 10:12:00 BIRD TRAPPED BY LEG IN A TREE - RSPCA IN ATTEN... 2021-01-27 15:22:00 CAT UP TREE ASSIST RSPCA 2021-01-29 10:47:00 TRAPPED FOX IN FENCE IN REAR GARDEN 2021-01-30 14:53:00 CAT STUCK UNDER SHED 2021-01-30 15:28:00 BIRD CAUGHT IN NETTING - RSPCA ON SCENE 2021-01-30 17:54:00 DOG TRAPPED UNDER CAR 2021-01-31 12:53:00 CAT STUCK UP TREE - RSPCA ON SCENE 2021-01-31 13:48:00 INJURED CAT STUCK IN GREEN AREA AT REAR OF Name: FinalDescription, dtype: object
In [10]:
# resample the timeseries by month and count incidents
df.resample("ME")["IncidentNumber"].count().plot(title="Monthly Calls")
# see https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases
plt.show()
In [11]:
# resample by year, sum total costs, average hourly costs
fig, axs = plt.subplots(figsize=(16, 4), ncols=2)
df.resample("YE")["IncidentNotionalCost(£)"].sum().plot(
title="Year total cost", ax=axs[0]
)
df.resample("YE")["HourlyNotionalCost(£)"].mean().plot(
title="Average hourly cost", ax=axs[1]
)
plt.show()
Missing data¶
Different strategies for dealing with missing data:
- Ignore the issue
- some things may break / not work as expected
- Remove rows/columns with missing data
- remove all rows with missing data:
df.dropna(axis=0)
- remove all columns with missing data:
df.dropna(axis=1)
- remove all rows with missing data:
- Guess (impute) missing data
- replace all missing entries with a value:
df.fillna(1)
- replace missing entries with mean for that column
df.fillna(df.mean())
- replace each missing entry with previous valid entry:
df.fillna(method="pad")
- replace missing by interpolating between valid entries:
df.interpolate()
- replace all missing entries with a value:
In [12]:
# count missing entries for each column
df.isna().sum()
Out[12]:
IncidentNumber 0 CalYear 0 FinYear 0 TypeOfIncident 0 PumpCount 72 PumpHoursTotal 73 HourlyNotionalCost(£) 0 IncidentNotionalCost(£) 73 FinalDescription 5 AnimalGroupParent 0 OriginofCall 0 PropertyType 0 PropertyCategory 0 SpecialServiceTypeCategory 0 SpecialServiceType 0 WardCode 15 Ward 15 BoroughCode 14 Borough 14 StnGroundName 0 UPRN 6712 Street 0 USRN 1156 PostcodeDistrict 0 Easting_m 5693 Northing_m 5693 Easting_rounded 0 Northing_rounded 0 Latitude 5693 Longitude 5693 dtype: int64
In [13]:
# If PumpCount is missing, typically so is PumpHoursTotal
# 66 rows are missing at least one of these
pump_missing = df["PumpCount"].isna() | df["PumpHoursTotal"].isna()
print(pump_missing.sum())
73
In [14]:
# so we could choose to drop these rows
df1 = df.drop(df.loc[pump_missing].index)
# here we made a new dataset df1 with these rows dropped
# to drop the rows from the original dataset df, could do:
#
# df = df.drop(df.loc[pump_missing == True].index)
#
# or:
#
# df.drop(df.loc[pump_missing == True].index, inplace=True)
#
print(len(df1))
10630
In [15]:
# another equivalent way to do this
df2 = df.dropna(subset=["PumpCount", "PumpHoursTotal"])
print(len(df2))
10630
In [16]:
# but if we drop them, we lose valid data from other columns
# let's look at the distribution of values:
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
df.plot.hist(y="PumpCount", ax=axs[0])
df.plot.hist(y="PumpHoursTotal", ax=axs[1])
plt.plot()
Out[16]:
[]
In [17]:
# looks like it would be better to replace missing PumpCount and PumpHoursTotal fields with 1
df.fillna({"PumpCount": 1, "PumpHoursTotal": 1}, inplace=True)
In [18]:
df.isna().sum()
Out[18]:
IncidentNumber 0 CalYear 0 FinYear 0 TypeOfIncident 0 PumpCount 0 PumpHoursTotal 0 HourlyNotionalCost(£) 0 IncidentNotionalCost(£) 73 FinalDescription 5 AnimalGroupParent 0 OriginofCall 0 PropertyType 0 PropertyCategory 0 SpecialServiceTypeCategory 0 SpecialServiceType 0 WardCode 15 Ward 15 BoroughCode 14 Borough 14 StnGroundName 0 UPRN 6712 Street 0 USRN 1156 PostcodeDistrict 0 Easting_m 5693 Northing_m 5693 Easting_rounded 0 Northing_rounded 0 Latitude 5693 Longitude 5693 dtype: int64
Count the unique entries in each column¶
In [19]:
df.nunique().sort_values()
Out[19]:
TypeOfIncident 1 PumpCount 4 SpecialServiceTypeCategory 4 PropertyCategory 7 OriginofCall 8 PumpHoursTotal 12 HourlyNotionalCost(£) 14 CalYear 16 FinYear 17 SpecialServiceType 24 AnimalGroupParent 28 BoroughCode 37 Borough 70 IncidentNotionalCost(£) 89 StnGroundName 109 PropertyType 192 PostcodeDistrict 280 Northing_rounded 428 Easting_rounded 532 WardCode 760 Ward 1345 UPRN 3802 Northing_m 4514 Easting_m 4592 Longitude 4939 Latitude 4939 FinalDescription 6502 USRN 7086 Street 7703 IncidentNumber 10703 dtype: int64
In [20]:
# "cat" and "Cat" are treated as different animals here:
df["AnimalGroupParent"].unique()
Out[20]:
array(['Dog', 'Fox', 'Horse', 'Rabbit', 'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird', 'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer', 'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'cat', 'Hamster', 'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie', 'Unknown - Animal rescue from water - Farm animal', 'Pigeon', 'Goat', 'Tortoise', 'Unknown - Animal rescue from below ground - Farm animal'], dtype=object)
In [21]:
# select rows where AnimalGroupParent is "cat", replace with "Cat"
df.loc[df["AnimalGroupParent"] == "cat", "AnimalGroupParent"] = "Cat"
In [22]:
df["AnimalGroupParent"].unique()
Out[22]:
array(['Dog', 'Fox', 'Horse', 'Rabbit', 'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird', 'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer', 'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'Hamster', 'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie', 'Unknown - Animal rescue from water - Farm animal', 'Pigeon', 'Goat', 'Tortoise', 'Unknown - Animal rescue from below ground - Farm animal'], dtype=object)
In [23]:
df.groupby("AnimalGroupParent")["IncidentNumber"].count().sort_values().plot.barh(
logx=True
)
plt.show()
In [24]:
# apparently different hourly costs
# does it depend on the type of event? or does it just increase over time?
df["HourlyNotionalCost(£)"].unique()
Out[24]:
array([255, 260, 290, 295, 298, 326, 328, 333, 339, 346, 352, 364, 388, 430])
In [25]:
# just goes up over time
df["HourlyNotionalCost(£)"].plot.line()
Out[25]:
<Axes: xlabel='DateTimeOfCall'>
In [26]:
# Group incidents by fire station & count them
df.groupby("StnGroundName")["IncidentNumber"].count()
Out[26]:
StnGroundName Acton 85 Addington 74 Barking 98 Barnet 102 Battersea 91 ... Whitechapel 33 Willesden 80 Wimbledon 88 Woodford 104 Woodside 91 Name: IncidentNumber, Length: 109, dtype: int64
Plot location of calls on a map¶
- note: this section uses some more libraries, to install them:
pip install geopandas contextily
In [27]:
import geopandas
# drop missing longitude/latitude
df2 = df.dropna(subset=["Longitude", "Latitude"])
# also drop zero values
df2 = df2[df2["Latitude"] != 0]
# set crs to EPSG:4326 to specify WGS84 Latitude/Longitude
gdf = geopandas.GeoDataFrame(
df2,
geometry=geopandas.points_from_xy(df2["Longitude"], df2["Latitude"]),
crs="EPSG:4326",
)
In [28]:
gdf.head()
Out[28]:
IncidentNumber | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNotionalCost(£) | IncidentNotionalCost(£) | FinalDescription | AnimalGroupParent | ... | Street | USRN | PostcodeDistrict | Easting_m | Northing_m | Easting_rounded | Northing_rounded | Latitude | Longitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DateTimeOfCall | |||||||||||||||||||||
2009-01-01 08:51:00 | 275091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Fox | ... | Grasmere Road | NaN | SE25 | 534785.0 | 167546.0 | 534750 | 167550 | 51.390954 | -0.064167 | POINT (-0.06417 51.39095) |
2009-01-04 10:07:00 | 2075091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | Mill Lane | NaN | SM5 | 528041.0 | 164923.0 | 528050 | 164950 | 51.368941 | -0.161985 | POINT (-0.16199 51.36894) |
2009-01-05 12:27:00 | 2872091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Horse | ... | Park Lane | 21401484.0 | UB9 | 504689.0 | 190685.0 | 504650 | 190650 | 51.605283 | -0.489684 | POINT (-0.48968 51.60528) |
2009-01-07 06:29:00 | 4011091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | Holloway Road | NaN | E11 | 539013.0 | 186162.0 | 539050 | 186150 | 51.557221 | 0.003880 | POINT (0.00388 51.55722) |
2009-01-07 11:55:00 | 4211091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | Aldersbrook Road | NaN | E12 | 541327.0 | 186654.0 | 541350 | 186650 | 51.561067 | 0.037434 | POINT (0.03743 51.56107) |
5 rows × 31 columns
In [29]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
plt.title("Call locations")
# plt.axis("off")
plt.show()
In [30]:
import contextily as cx
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations")
plt.axis("off")
plt.show()
In [31]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
for animal, colour in [
("Cow", "black"),
("Deer", "red"),
("Fox", "blue"),
("Snake", "yellow"),
]:
gdf[gdf["AnimalGroupParent"] == animal].plot(
ax=ax, color=colour, alpha=0.5, label=animal
)
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations by animal")
plt.legend()
plt.axis("off")
plt.show()
Suggested workflow / philosophy¶
1. you want to do something but not sure how¶
- if you know / have a guess which function to use, look at its docstring:
?function_name
- if you don't have any idea what to try, google
how do I ... in pandas
- modern alternative: ask chat gpt to
write python code using pandas to ...
- if in doubt, just try something!
Suggested workflow / philosophy¶
2. you try something and get an error message¶
- copy & paste the last bit into google (along with the
function_name
and/orpandas
) - don't be intimidated by the long and apparently nonsensical error messages
- almost certainly someone else has had this exact problem
- almost certainly the solution is waiting for you
Suggested workflow / philosophy¶
3. look for a stackoverflow answer with many up-votes¶
- ignore the green tick, this just means the person asking the question liked the answer
- typically an answer with many up-votes is a better option
- more recent answers can also be better: sometimes a library has changed since an older answer was written
Next steps¶
- experiment with your own datasets
- read some pandas documentation
- follow a tutorial
- free interactive kaggle courses