Data Exploration with Python and JupyterΒΆ
Basic usage of the Pandas library to download a dataset, explore its contents, clean up missing or invalid data, filter the data according to different criteria, and plot visualizations of the data.
- Part 1: Python and Jupyter
- Part 2: Pandas with toy data
- Part 3: Pandas with real data
Press Spacebar
to go to the next slide (or ?
to see all navigation shortcuts)
Let's download some real dataΒΆ
For some reason, the London Fire Brigade provides a public spreadsheet of all animal rescue incidents since 2009:
https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb
They provide a link to the dataset in csv (comma-delimited) format
InΒ [1]:
# import the Pandas library & matplotlib for plotting
import pandas as pd
import matplotlib.pyplot as plt
InΒ [2]:
# download a csv file with some data and convert it to a DataFrame
url = "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/01007433-55c2-4b8a-b799-626d9e3bc284/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv"
df = pd.read_csv(url)
Suggested workflow / philosophyΒΆ
- you want to do something
- if you know / have a guess which function to use, look at its docstring:
?function_name
- if you don't have any idea what to try, google
how do I ... in pandas
- if in doubt, just try something!
- if you know / have a guess which function to use, look at its docstring:
- if you get an error, copy & paste the last bit into google (along with
funtion_name
and/orpandas
)- don't be intimidated by the long and apparently nonsensical error messages
- almost certainly someone else has had this exact problem
- almost certainly the solution is waiting for you
- look for a stackoverflow answer with many up-votes
- ignore the green tick, this just means the person asking the question liked the answer
- typically an answer with many up-votes is a better option
- more recent answers can also be better: sometimes a library has changed since an older answer was written
(For anyone who wasn't already doing this, that may be the most useful thing in this course)
Display the DataFrameΒΆ
InΒ [3]:
df
Out[3]:
IncidentNumber | DateTimeOfCall | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNotionalCost(Β£) | IncidentNotionalCost(Β£) | FinalDescription | ... | UPRN | Street | USRN | PostcodeDistrict | Easting_m | Northing_m | Easting_rounded | Northing_rounded | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 139091 | 2009-01-01 03:01:00 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | Redacted | ... | NaN | Waddington Way | 20500146.0 | SE19 | NaN | NaN | 532350 | 170050 | NaN | NaN |
1 | 275091 | 2009-01-01 08:51:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | NaN | Grasmere Road | NaN | SE25 | 534785.0 | 167546.0 | 534750 | 167550 | 51.390954 | -0.064167 |
2 | 2075091 | 2009-01-04 10:07:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | NaN | Mill Lane | NaN | SM5 | 528041.0 | 164923.0 | 528050 | 164950 | 51.368941 | -0.161985 |
3 | 2872091 | 2009-01-05 12:27:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | 1.000215e+11 | Park Lane | 21401484.0 | UB9 | 504689.0 | 190685.0 | 504650 | 190650 | 51.605283 | -0.489684 |
4 | 3553091 | 2009-01-06 15:23:00 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | ... | NaN | Swindon Lane | 21300122.0 | RM3 | NaN | NaN | 554650 | 192350 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9723 | 096744-30062023 | 2023-06-30 15:56:00 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | Redacted | ... | 1.000207e+11 | SELSDON PARK ROAD | 20502647.0 | CR2 | 536269.0 | 162904.0 | 536250 | 162950 | 51.348885 | -0.044629 |
9724 | 096880-30062023 | 2023-06-30 20:21:00 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | PIGEON STUCK BETWEEN FENCES IN PLAYING FIE... | ... | 2.070058e+08 | CAMBRIDGE GARDENS | 20702560.0 | EN1 | 534305.0 | 197280.0 | 534350 | 197250 | 51.658273 | -0.059731 |
9725 | 096884-30062023 | 2023-06-30 20:31:00 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | CAT STUCK BETWEEN BUILT IN FRIDGE AND WALL | ... | NaN | EASTFIELD ROAD | 20702215.0 | EN3 | NaN | NaN | 535650 | 198250 | NaN | NaN |
9726 | 096913-30062023 | 2023-06-30 21:24:00 | 2023 | 2023/24 | Special Service | 1.0 | 2.0 | 388 | 776.0 | Redacted | ... | NaN | NORBURY COURT ROAD | 20501229.0 | SW16 | NaN | NaN | 530650 | 169150 | NaN | NaN |
9727 | 096935-30062023 | 2023-06-30 22:26:00 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | Redacted | ... | NaN | QUEENSHURST SQUARE | 21880473.0 | KT2 | NaN | NaN | 518150 | 169750 | NaN | NaN |
9728 rows Γ 31 columns
Column data typesΒΆ
InΒ [4]:
df.dtypes
Out[4]:
IncidentNumber object DateTimeOfCall object CalYear int64 FinYear object TypeOfIncident object PumpCount float64 PumpHoursTotal float64 HourlyNotionalCost(Β£) int64 IncidentNotionalCost(Β£) float64 FinalDescription object AnimalGroupParent object OriginofCall object PropertyType object PropertyCategory object SpecialServiceTypeCategory object SpecialServiceType object WardCode object Ward object BoroughCode object Borough object StnGroundName object UPRN float64 Street object USRN float64 PostcodeDistrict object Easting_m float64 Northing_m float64 Easting_rounded int64 Northing_rounded int64 Latitude float64 Longitude float64 dtype: object
Convert DateTimeOfCall to a date-timeΒΆ
InΒ [5]:
df["DateTimeOfCall"].head()
Out[5]:
0 2009-01-01 03:01:00 1 2009-01-01 08:51:00 2 2009-01-04 10:07:00 3 2009-01-05 12:27:00 4 2009-01-06 15:23:00 Name: DateTimeOfCall, dtype: object
InΒ [6]:
# this looks like what we want..
pd.to_datetime(df["DateTimeOfCall"]).head()
Out[6]:
0 2009-01-01 03:01:00 1 2009-01-01 08:51:00 2 2009-01-04 10:07:00 3 2009-01-05 12:27:00 4 2009-01-06 15:23:00 Name: DateTimeOfCall, dtype: datetime64[ns]
InΒ [7]:
# ..but which number is the month and which is the day?
# how can we check if what we just did was correct?
pd.to_datetime(df["DateTimeOfCall"]).plot()
# should be a single monotonically increasing line: looks good!
Out[7]:
<Axes: >
InΒ [8]:
# replace DateTimeOfCall column in dataframe with this one
df["DateTimeOfCall"] = pd.to_datetime(df["DateTimeOfCall"])
Use the datetime as the indexΒΆ
InΒ [9]:
df.set_index("DateTimeOfCall", inplace=True)
InΒ [10]:
df
Out[10]:
IncidentNumber | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNotionalCost(Β£) | IncidentNotionalCost(Β£) | FinalDescription | AnimalGroupParent | ... | UPRN | Street | USRN | PostcodeDistrict | Easting_m | Northing_m | Easting_rounded | Northing_rounded | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DateTimeOfCall | |||||||||||||||||||||
2009-01-01 03:01:00 | 139091 | 2009 | 2008/09 | Special Service | 1.0 | 2.0 | 255 | 510.0 | Redacted | Dog | ... | NaN | Waddington Way | 20500146.0 | SE19 | NaN | NaN | 532350 | 170050 | NaN | NaN |
2009-01-01 08:51:00 | 275091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Fox | ... | NaN | Grasmere Road | NaN | SE25 | 534785.0 | 167546.0 | 534750 | 167550 | 51.390954 | -0.064167 |
2009-01-04 10:07:00 | 2075091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | NaN | Mill Lane | NaN | SM5 | 528041.0 | 164923.0 | 528050 | 164950 | 51.368941 | -0.161985 |
2009-01-05 12:27:00 | 2872091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Horse | ... | 1.000215e+11 | Park Lane | 21401484.0 | UB9 | 504689.0 | 190685.0 | 504650 | 190650 | 51.605283 | -0.489684 |
2009-01-06 15:23:00 | 3553091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Rabbit | ... | NaN | Swindon Lane | 21300122.0 | RM3 | NaN | NaN | 554650 | 192350 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2023-06-30 15:56:00 | 096744-30062023 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | Redacted | Cat | ... | 1.000207e+11 | SELSDON PARK ROAD | 20502647.0 | CR2 | 536269.0 | 162904.0 | 536250 | 162950 | 51.348885 | -0.044629 |
2023-06-30 20:21:00 | 096880-30062023 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | PIGEON STUCK BETWEEN FENCES IN PLAYING FIE... | Bird | ... | 2.070058e+08 | CAMBRIDGE GARDENS | 20702560.0 | EN1 | 534305.0 | 197280.0 | 534350 | 197250 | 51.658273 | -0.059731 |
2023-06-30 20:31:00 | 096884-30062023 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | CAT STUCK BETWEEN BUILT IN FRIDGE AND WALL | Cat | ... | NaN | EASTFIELD ROAD | 20702215.0 | EN3 | NaN | NaN | 535650 | 198250 | NaN | NaN |
2023-06-30 21:24:00 | 096913-30062023 | 2023 | 2023/24 | Special Service | 1.0 | 2.0 | 388 | 776.0 | Redacted | Cat | ... | NaN | NORBURY COURT ROAD | 20501229.0 | SW16 | NaN | NaN | 530650 | 169150 | NaN | NaN |
2023-06-30 22:26:00 | 096935-30062023 | 2023 | 2023/24 | Special Service | 1.0 | 1.0 | 388 | 388.0 | Redacted | Cat | ... | NaN | QUEENSHURST SQUARE | 21880473.0 | KT2 | NaN | NaN | 518150 | 169750 | NaN | NaN |
9728 rows Γ 30 columns
InΒ [11]:
# can now use datetime to select rows: here is jan 2021
df.loc["2021-01-01":"2021-01-31", "FinalDescription"]
Out[11]:
DateTimeOfCall 2021-01-01 12:09:00 KITTEN STUCK UP TREE AL REQUESTED FROM SCENE 2021-01-01 14:06:00 Redacted 2021-01-03 18:40:00 CAT WITH LEG TRAPPED IN BATH PLUGHOLE 2021-01-04 13:39:00 Redacted 2021-01-06 10:22:00 Redacted 2021-01-06 13:09:00 CAT IN DISTRESS ON ROOF - ADDITIONAL APPLIANCE... 2021-01-06 20:35:00 DOG TRAPPED IN FOX HOLE - MEET AT CLUB HOUSE 2021-01-07 23:50:00 KITTEN STUCK BETWEEN WALL AND ROOF 2021-01-09 08:01:00 DOG STUCK IN TRENCH 2021-01-10 19:27:00 Redacted 2021-01-12 11:39:00 Redacted 2021-01-12 22:38:00 CAT TRAPPED IN DITCH 2021-01-16 18:05:00 DOG TRAPPED IN PORTER CABIN 2021-01-17 16:09:00 DOG TRAPPED IN WAREHOUSE AREA - CALLER BELIEVE... 2021-01-17 17:09:00 BIRD TRAPPED IN NETTING CALLER WILL MEET YOU 2021-01-18 15:17:00 CAT STUCK IN TREE BEING ATTACKED BY CROWS 2021-01-18 17:06:00 ASSIST RSPCA - SMALL ANIMAL RESUE - BIRD ENTAN... 2021-01-19 18:28:00 CAT TRAPPED BEHIND CUPBOARD 2021-01-19 20:24:00 Redacted 2021-01-19 20:36:00 RUNNING CALL AT ON ROOF 2021-01-20 09:35:00 CAT STUCK BETWEEN TREE BRANCHES 2021-01-21 13:15:00 SWAN TRAPPED IN NETTING 2021-01-21 18:23:00 CAT TRAPPED IN CHIMNEY 2021-01-22 14:22:00 CAT TRAPPED BETWEEN WALL AND FENCE 2021-01-23 10:18:00 CAT TRAPPED IN CHIMNEY 2021-01-23 15:43:00 CAT TRAPPED BETWEEN WALLS 2021-01-23 17:16:00 Redacted 2021-01-25 12:02:00 ASSIST RSPCA WITH FOX STUCK DOWN CULVERT 2021-01-26 13:42:00 DOG STUCK IN RAILINGS - CALLER WILL MEET YOU 2021-01-26 18:21:00 Redacted 2021-01-26 22:44:00 BIRDS TRAPPED IN BASKETBALL COURT CALLER IS ON... 2021-01-26 23:35:00 FOX TRAPPED IN FENCE IN ALLEYWAY NEXT TO 2021-01-27 09:18:00 CAT STUCK IN TREE - ATTENDED YESTERDAY AND ADV... 2021-01-27 10:12:00 BIRD TRAPPED BY LEG IN A TREE - RSPCA IN ATTEN... 2021-01-27 15:22:00 CAT UP TREE ASSIST RSPCA 2021-01-29 10:47:00 TRAPPED FOX IN FENCE IN REAR GARDEN 2021-01-30 14:53:00 CAT STUCK UNDER SHED 2021-01-30 15:28:00 BIRD CAUGHT IN NETTING - RSPCA ON SCENE 2021-01-30 17:54:00 DOG TRAPPED UNDER CAR 2021-01-31 12:53:00 CAT STUCK UP TREE - RSPCA ON SCENE 2021-01-31 13:48:00 INJURED CAT STUCK IN GREEN AREA AT REAR OF Name: FinalDescription, dtype: object
InΒ [12]:
# resample the timeseries by month and count incidents
df.resample("M")["IncidentNumber"].count().plot(title="Monthly Calls")
# see https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases
plt.show()
InΒ [13]:
# resample by year, sum total costs, average hourly costs
fig, axs = plt.subplots(figsize=(16, 4), ncols=2)
df.resample("Y")["IncidentNotionalCost(Β£)"].sum().plot(
title="Year total cost", ax=axs[0]
)
df.resample("Y")["HourlyNotionalCost(Β£)"].mean().plot(
title="Average hourly cost", ax=axs[1]
)
plt.show()
Missing dataΒΆ
Different strategies for dealing with missing data:
- Ignore the issue
- some things may break / not work as expected
- Remove rows/columns with missing data
- remove all rows with missing data:
df.dropna(axis=0)
- remove all columns with missing data:
df.dropna(axis=1)
- remove all rows with missing data:
- Guess (impute) missing data
- replace all missing entries with a value:
df.fillna(1)
- replace missing entries with mean for that column
df.fillna(df.mean())
- replace each missing entry with previous valid entry:
df.fillna(method="pad")
- replace missing by interpolating between valid entries:
df.interpolate()
- replace all missing entries with a value:
InΒ [14]:
# count missing entries for each column
df.isna().sum()
Out[14]:
IncidentNumber 0 CalYear 0 FinYear 0 TypeOfIncident 0 PumpCount 65 PumpHoursTotal 66 HourlyNotionalCost(Β£) 0 IncidentNotionalCost(Β£) 66 FinalDescription 5 AnimalGroupParent 0 OriginofCall 0 PropertyType 0 PropertyCategory 0 SpecialServiceTypeCategory 0 SpecialServiceType 0 WardCode 10 Ward 10 BoroughCode 12 Borough 12 StnGroundName 0 UPRN 6127 Street 0 USRN 1156 PostcodeDistrict 0 Easting_m 5108 Northing_m 5108 Easting_rounded 0 Northing_rounded 0 Latitude 5108 Longitude 5108 dtype: int64
InΒ [15]:
# If PumpCount is missing, typically so is PumpHoursTotal
# 66 rows are missing at least one of these
pump_missing = df["PumpCount"].isna() | df["PumpHoursTotal"].isna()
print(pump_missing.sum())
66
InΒ [16]:
# so we could choose to drop these rows
df1 = df.drop(df.loc[pump_missing == True].index)
# here we made a new dataset df1 with these rows dropped
# to drop the rows from the original dataset df, could do:
#
# df = df.drop(df.loc[pump_missing == True].index)
#
# or:
#
# df.drop(df.loc[pump_missing == True].index, inplace=True)
#
print(len(df1))
9662
InΒ [17]:
# another equivalent way to do this
df2 = df.dropna(subset=["PumpCount", "PumpHoursTotal"])
print(len(df2))
9662
InΒ [18]:
# but if we drop them, we lose valid data from other columns
# let's look at the distribution of values:
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
df.plot.hist(y="PumpCount", ax=axs[0])
df.plot.hist(y="PumpHoursTotal", ax=axs[1])
plt.plot()
Out[18]:
[]
InΒ [19]:
# looks like it would be better to replace missing PumpCount and PumpHoursTotal fields with 1
?df.fillna
df.fillna({"PumpCount": 1, "PumpHoursTotal": 1}, inplace=True)
InΒ [20]:
df.isna().sum()
Out[20]:
IncidentNumber 0 CalYear 0 FinYear 0 TypeOfIncident 0 PumpCount 0 PumpHoursTotal 0 HourlyNotionalCost(Β£) 0 IncidentNotionalCost(Β£) 66 FinalDescription 5 AnimalGroupParent 0 OriginofCall 0 PropertyType 0 PropertyCategory 0 SpecialServiceTypeCategory 0 SpecialServiceType 0 WardCode 10 Ward 10 BoroughCode 12 Borough 12 StnGroundName 0 UPRN 6127 Street 0 USRN 1156 PostcodeDistrict 0 Easting_m 5108 Northing_m 5108 Easting_rounded 0 Northing_rounded 0 Latitude 5108 Longitude 5108 dtype: int64
Count the unique entries in each columnΒΆ
InΒ [21]:
df.nunique().sort_values()
Out[21]:
TypeOfIncident 1 PumpCount 4 SpecialServiceTypeCategory 4 PropertyCategory 7 OriginofCall 8 PumpHoursTotal 12 HourlyNotionalCost(Β£) 13 CalYear 15 FinYear 16 SpecialServiceType 24 AnimalGroupParent 28 BoroughCode 37 Borough 70 IncidentNotionalCost(Β£) 82 StnGroundName 108 PropertyType 187 PostcodeDistrict 277 Northing_rounded 425 Easting_rounded 530 WardCode 759 Ward 1272 UPRN 3446 Northing_m 4188 Easting_m 4254 Longitude 4549 Latitude 4549 FinalDescription 5907 USRN 6496 Street 7172 IncidentNumber 9728 dtype: int64
InΒ [22]:
# "cat" and "Cat" are treated as different animals here:
df["AnimalGroupParent"].unique()
Out[22]:
array(['Dog', 'Fox', 'Horse', 'Rabbit', 'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird', 'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer', 'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'cat', 'Hamster', 'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie', 'Unknown - Animal rescue from water - Farm animal', 'Pigeon', 'Goat', 'Tortoise', 'Unknown - Animal rescue from below ground - Farm animal'], dtype=object)
InΒ [23]:
# select rows where AnimalGroupParent is "cat", replace with "Cat"
df.loc[df["AnimalGroupParent"] == "cat", "AnimalGroupParent"] = "Cat"
InΒ [24]:
df["AnimalGroupParent"].unique()
Out[24]:
array(['Dog', 'Fox', 'Horse', 'Rabbit', 'Unknown - Heavy Livestock Animal', 'Squirrel', 'Cat', 'Bird', 'Unknown - Domestic Animal Or Pet', 'Sheep', 'Deer', 'Unknown - Wild Animal', 'Snake', 'Lizard', 'Hedgehog', 'Hamster', 'Lamb', 'Fish', 'Bull', 'Cow', 'Ferret', 'Budgie', 'Unknown - Animal rescue from water - Farm animal', 'Pigeon', 'Goat', 'Tortoise', 'Unknown - Animal rescue from below ground - Farm animal'], dtype=object)
InΒ [25]:
df.groupby("AnimalGroupParent")["IncidentNumber"].count().sort_values().plot.barh(
logx=True
)
plt.show()
InΒ [26]:
# apparently different hourly costs
# does it depend on the type of event? or does it just increase over time?
df["HourlyNotionalCost(Β£)"].unique()
Out[26]:
array([255, 260, 290, 295, 298, 326, 328, 333, 339, 346, 352, 364, 388])
InΒ [27]:
# just goes up over time
df["HourlyNotionalCost(Β£)"].plot.line()
Out[27]:
<Axes: xlabel='DateTimeOfCall'>
InΒ [28]:
# Group incidents by fire station & count them
df.groupby("StnGroundName")["IncidentNumber"].count()
Out[28]:
StnGroundName Acton 74 Addington 66 Barking 91 Barnet 95 Battersea 82 .. Whitechapel 26 Willesden 68 Wimbledon 75 Woodford 95 Woodside 83 Name: IncidentNumber, Length: 108, dtype: int64
Plot location of calls on a mapΒΆ
- note: this section uses some more libraries, to install them:
pip install geopandas contextily
InΒ [29]:
# drop missing longitude/latitude
df2 = df.dropna(subset=["Longitude", "Latitude"])
# also drop zero values
df2 = df2[df2["Latitude"] != 0]
# convert to geodataframe using geopandas
import geopandas
# set crs to EPSG:4326 to specify WGS84 Latitude/Longitude
gdf = geopandas.GeoDataFrame(
df2,
geometry=geopandas.points_from_xy(df2["Longitude"], df2["Latitude"]),
crs="EPSG:4326",
)
gdf.head()
Out[29]:
IncidentNumber | CalYear | FinYear | TypeOfIncident | PumpCount | PumpHoursTotal | HourlyNotionalCost(Β£) | IncidentNotionalCost(Β£) | FinalDescription | AnimalGroupParent | ... | Street | USRN | PostcodeDistrict | Easting_m | Northing_m | Easting_rounded | Northing_rounded | Latitude | Longitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DateTimeOfCall | |||||||||||||||||||||
2009-01-01 08:51:00 | 275091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Fox | ... | Grasmere Road | NaN | SE25 | 534785.0 | 167546.0 | 534750 | 167550 | 51.390954 | -0.064167 | POINT (-0.06417 51.39095) |
2009-01-04 10:07:00 | 2075091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | Mill Lane | NaN | SM5 | 528041.0 | 164923.0 | 528050 | 164950 | 51.368941 | -0.161985 | POINT (-0.16199 51.36894) |
2009-01-05 12:27:00 | 2872091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Horse | ... | Park Lane | 21401484.0 | UB9 | 504689.0 | 190685.0 | 504650 | 190650 | 51.605283 | -0.489684 | POINT (-0.48968 51.60528) |
2009-01-07 06:29:00 | 4011091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | Holloway Road | NaN | E11 | 539013.0 | 186162.0 | 539050 | 186150 | 51.557221 | 0.003880 | POINT (0.00388 51.55722) |
2009-01-07 11:55:00 | 4211091 | 2009 | 2008/09 | Special Service | 1.0 | 1.0 | 255 | 255.0 | Redacted | Dog | ... | Aldersbrook Road | NaN | E12 | 541327.0 | 186654.0 | 541350 | 186650 | 51.561067 | 0.037434 | POINT (0.03743 51.56107) |
5 rows Γ 31 columns
InΒ [30]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
plt.title("Call locations")
# plt.axis("off")
plt.show()
InΒ [31]:
import contextily as cx
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf.plot(ax=ax, color="black", alpha=0.3)
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations")
plt.axis("off")
plt.show()
InΒ [32]:
f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
for animal, colour in [
("Cow", "black"),
("Deer", "red"),
("Fox", "blue"),
("Snake", "yellow"),
]:
gdf[gdf["AnimalGroupParent"] == animal].plot(
ax=ax, color=colour, alpha=0.5, label=animal
)
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations by animal")
plt.legend()
plt.axis("off")
plt.show()
Next stepsΒΆ
- experiment with your own datasets
- read some pandas documentation
- follow a tutorial
- free interactive kaggle courses