Update as of September 21, 2021: AutoZone announced for Q4 2021 with a huge surprise beat on both sales and earnings, resulting in a roughly 4% gain. Advance Auto Parts and O'Reilly Auto Parts both saw a 2% gain at the same time. This reflects what we have concluded in our analysis here, and a hedged three-way pair trade would have resulted in a profit.

Overview¶

Disclaimer: Information is not intended as, and shall not be understood or construed as, financial advice.

Through the COVID pandemic, the United States saw a trend of deurbanization as individuals began remote work en masse and social distancing measures increased. Naturally, this led to a myriad of second order effects: a shift to e-commerce (WSJ), a spike in home demand (WSJ), chip shortages (WSJ), and much more. One of the effects of deurbanization is the acceleration of demand for cars. Naturally, as more people are moving to areas outside of urban centers that are less accessible through public transit, people have become increasingly reliant on cars (AP News). Along with a perfect storm of conditions such as chip shortages (Reuters, CNN) and pent-up demand for cars (CNBC), we are are seeing record demand coupled with record supply shortages for cars.

The used car market has been especially bizarre -- some used cars have even cost more than their new counterparts (AP News). And given this outlandish demand, paired with record averages of car ages in the United States (CNBC), auto parts retailers have seen record sales (WSJ). This, in itself, is an extraordinary fact. For a long duration toward the beginning of the pandemic, there was a brake in traffic across the country (The Brookings Institution). Of course, if few people are commuting into offices and individuals are heavily encouraged to social distance and avoid social outings, who is still driving?

Nevertheless, auto parts retailers have benefitted from the conditions in the car market, and this quarter, given extreme weather across the United States, auto parts retailers are primed for more record profits. It's actually a well documented expectation (WSJ) -- more used cars out on the market, combined with hot weather this summer, are both potential tailwinds for auto parts retailers this quarter.

However, here I argue that the used car markets and extreme weather have differential effects on auto parts retailers, resulting in fluctations of actual sales versus sales estimates for three major U.S. public auto parts retailers: AutoZone (AZO), Advance Auto Parts (AAP), and O'Reilly Auto Parts (ORLY).

Extreme Weather Effects on Auto Part Retailers¶

One natural hole in the extreme weather argument is that the auto part retailers and extreme weather are varied in their distributions across the United States. Extreme weather is not a rising tide that lifts all boats. Auto part retailers with higher concentrations in areas with more extreme weather should experience higher demand for auto parts. If auto part retailers were similarly distributed across the United States (e.g., AZO, AAP, and ORLY had equal proportions of stores in each county in the United States), differences in sales should be attributed more to pricing dynamics; however, because auto part retailers are distributed differently across the United States, we should see differences in sales that result from extreme weather.

A natural question arises: how are auto part retailers differently distributed across the United States?

There are multiple ways we can obtain store location data and visualize the differences between each of the different major auto part retailers.

Scrape the store location data from each of the respective websites;
Query the store location data from OpenStreetMaps API;
Query the store location data from Google Maps API;

Each of these has their own pros and cons.

Option 1 is clearly the most officially recognized list of stores, and should be best maintained, as it is in the best interest of the companies themselves to adequately list out their stores. However, scraping this data has high fixed cost (time).

Option 2 is the least expensive, as querying OpenStreetMaps can be done for free. However, this option provides an incomplete list of stores across the United States. Individuals maintain the location data, and all data is user-submitted, which can either lead to holes in the data or faulty data (which has been shown not to be the case here). This data source also gets some brownie points for being open source.

Option 3 is the most expensive, as each Google Maps API query is not free. Furthermore, the coverage from Google Maps is subject to some of the same data faults as available through OSM (source) -- while at the same time not being closed source.

Here, I choose to go with option 2 (OpenStreetMaps) as the data provider for store locations. This source is queryable for free, either by hitting any of their Overpass API endpoints (here), or directly queryable through a front-end web interface called Overpass Turbo (here). The only downside here is that Overpass uses its own proprietary language for querying called Overpass QL. This, however, is written with C-style syntax, so the usage is not difficult to pick up. Now, given an outlet for querying OpenStreetMap, we can use TagInfo to find tags associated with each of the store locations in OpenStreetMap, allowing us to easily determine how to query via Overpass.

In [2]:

import fiona  # Must be imported before geopandas
import geopandas as gpd
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import numpy as np

pio.renderers.default = "png"
pio.kaleido.scope.default_format = "png"
pio.kaleido.scope.default_width = 1000
pio.kaleido.scope.default_height = 714

In [3]:

"""Get our data from OpenStreetMaps and extract columns that matter to us."""

cols = [
    "addr:housenumber", "addr:street", "addr:street:name", 
    "addr:street:type", "addr:city", "addr:state", 
    "addr:full", "addr:postcode", "addr:country", "building", 
    "name", "shop", "brand", "opening_hours", "phone", 
    "operator", "start_date", "ref:store_number", "geometry",
]

gdf_azo = gpd.read_file("data/azo.geojson")[cols]
gdf_aap = gpd.read_file("data/aap.geojson")[cols]
gdf_orly = gpd.read_file("data/orly.geojson")[cols]

gdf_azo["Company"] = "AZO"
gdf_aap["Company"] = "AAP"
gdf_orly["Company"] = "ORLY"

gdf_autos = pd.concat([gdf_azo, gdf_aap, gdf_orly])

In [4]:

"""From longitude / latitude coordinates, use OSM to clean location data."""

d = dict()

from geopy.geocoders import Nominatim
import us
geolocator = Nominatim(user_agent="store-lookup", timeout=5)

def f(row):
    try:
        lat = row.geometry.centroid.y
        lon = row.geometry.centroid.x
        location = geolocator.reverse(str(lat) + "," + str(lon))
        d[len(d)] = {
            "Latitude": location.raw["lat"],
            "Longitude": location.raw["lon"],
            "Shop": location.raw["address"].get("shop"),
            "House Number": location.raw["address"].get("house_number"),
            "Road": location.raw["address"].get("road"),
            "City": location.raw["address"].get("city"),
            "County": location.raw["address"].get("county"),
            "State": us.states.lookup(location.raw["address"].get("state")).abbr,
            "State Full": location.raw["address"].get("state"),
            "Zip": location.raw["address"].get("postcode"),
            "Country": location.raw["address"].get("country"),
            "Country Code": location.raw["address"].get("country_code"),
            "Company": row["Company"],
        }
    except AttributeError:
        print(row.geometry.centroid.y, row.geometry.centroid.x)
    except Exception as e:
        print(e)
    
gdf_autos.apply(f, axis=1)
df_autos = pd.DataFrame.from_dict(d, orient="index")

18.337560680425796 -64.88950977907058

Based on these descriptive statistics of our company locations, it appears as though for all our companies, approximately 30-45% of their total locations are actually represented. Looking into OpenStreetMap's TagInfo shows that this is a problem with the underlying data not being represented or tagged, rather than a problem of querying. In the future, either scraping from official websites should give the full store count, though it is unclear if querying through Google Maps API would similarly fully represent the store locations (OpenStreetMaps has claimed to have similar coverage as Google Maps, though this is untested for our use case in particular.).

Yet at the same time, it appears all of the different companies are represented in nearly all of the states -- it's a good sign that we're seeing a breadth of representation across the United States, though we would need to do further digging to determine the depth of the representation. Some immediate questions come to mind. Is the state-level distribution accurate for each of the companies? What can we say about the qualities of the missing data? Why is over half of the location data missing?

Next, let's visualize the location data.

In [5]:

"""Plot all of the locations of each company on a map of the United States"""

fig = px.scatter_geo(
    df_autos,
    lat=df_autos["Latitude"],
    lon=df_autos["Longitude"],
    color="Company",
)
fig.update_layout(
    geo_scope="usa",
    title="Auto Part Retailer US Locations",
    title_x=0.5,
)

fig.show()

This visualization is a little hectic and given the number of locations, we can separate each of the company locations into their own maps to determine regional influence.

In [6]:

"""Trifurcate the plotting of each store location in the United States based on the company"""

fig = px.scatter_geo(
    df_autos,
    lat=df_autos["Latitude"],
    lon=df_autos["Longitude"],
    facet_col="Company",
    color="Company",
)
fig.update_layout(
    geo_scope="usa",
    geo2_scope="usa",
    geo3_scope="usa",
    title="Auto Part Retailer US Locations",
    title_x=0.5,
)

fig.show()

It appears as though each of the different auto part retailers has varying levels of geographic representation. Some quick notes on location distribution:

Each of the different companies has roughly similar total store counts (around 5,000 to 6,000 stores in the United States), and of our data, AutoZone is best represented, though each company is seeing around 35% of total store coverage represented in the data.
AutoZone seems to be most geographically diverse through the United States.
Advance Auto Parts seems to be concentrated on the east coast and does not have any locations represented in California in our data.
O'Reilly Auto Parts is also geographically diverse, but most notably, lacks in the northeast.

Let's do another quick visualization to determine the percentage of market share per state. (Here, I define market share to be a company's percentage of total auto part retailers available in the state.) In order to get an accurate location, we can use the longitude / latitude coordinates from each location (which is available in each row) to extract out more direct location data using GeoPy and Nominatim. Note that these queries will take awhile, as we are querying external sources to get location data here. Ideally, we should be caching these calls or running on a private Nominatim server, rather than the default, public one.

In [7]:

"""
Manipulate the data to determine the number of stores of each company in each state, 
as well as percentage share within the state.
"""

d = dict()
for state in df_autos["State"].unique():
    d[state] = {
        "Total Retailers": len(df_autos[df_autos["State"] == state]),
        "Count AZO": len(df_autos[(df_autos["State"] == state) & (df_autos["Company"] == "AZO")]),
        "Count AAP": len(df_autos[(df_autos["State"] == state) & (df_autos["Company"] == "AAP")]),
        "Count ORLY": len(df_autos[(df_autos["State"] == state) & (df_autos["Company"] == "ORLY")]),
    }
df_counts = pd.DataFrame.from_dict(d, orient="index")
df_counts["Percent AZO"] = df_counts["Count AZO"] / df_counts["Total Retailers"]
df_counts["Percent AAP"] = df_counts["Count AAP"] / df_counts["Total Retailers"]
df_counts["Percent ORLY"] = df_counts["Count ORLY"] / df_counts["Total Retailers"]

def highest_percentage(row):
    if row["Percent AZO"] > row["Percent AAP"] and row["Percent AZO"] > row["Percent ORLY"]:
        return "AZO"
    elif row["Percent AAP"] > row["Percent AZO"] and row["Percent AAP"] > row["Percent ORLY"]:
        return "AAP"
    elif row["Percent ORLY"] > row["Percent AZO"] and row["Percent ORLY"] > row["Percent AAP"]:
        return "ORLY"
    else:
        return None
    
df_counts["Highest Count"] = df_counts.apply(highest_percentage, axis=1)
df_counts = df_counts.sort_index().reset_index()
df_counts.columns = ["State"] + list(df_counts.columns[1:])
# df_counts["FIPS"] = df_counts["State"].apply(lambda s: str(us.states.lookup(s).fips))

In [8]:

"""Manipulate a column in df_counts to better visualize the three companies' locations"""

def get_color(row):
    """Gets the color value associated with a company's market share in the state"""
    if row["Highest Count"] == "AZO":
        return row["Percent AZO"] / 3.
    elif row["Highest Count"] == "AAP":
        return (1. / 3) + (row["Percent AAP"] / 3.)
    elif row["Highest Count"] == "ORLY":
        return (2. / 3) + (row["Percent ORLY"] / 3.)
    else:
        return None

df_counts["Color"] = df_counts.apply(get_color, axis=1)

In [9]:

"""Create a choropleth map that shows each state's dominant auto parts retailer."""

fig = px.choropleth(
    df_counts,
    locations="State", 
    locationmode="USA-states", 
    color="Color",
    color_continuous_scale=[
        (0., "white"), (0.33, "#636EFA"),  # AZO
        (0.33, "white"), (0.66, "#EF553B"),  # AAP
        (0.66, "white"), (1., "#00CC96"),  # ORLY
    ],
    scope="usa",
)

fig.update_layout(
    coloraxis_showscale=True,
    coloraxis_colorbar_tickmode="array",
    coloraxis_colorbar_tickvals=[0.38, 0.64, 0.9],
    coloraxis_colorbar_ticktext=["AZO", "AAP", "ORLY"],
    title="Auto Parts Retailers Market Share by State",
    title_x=0.5,
    showlegend=True,
)

fig.show()

This shows a pretty interesting picture of market share of each of the auto part retailers in the United States. It appears as though they each have regional dominance:

AutoZone is a weakly dominant retailer in the southwest and some of the northeast;
O'Reilly Auto Parts is a pretty strong dominant retailer in the northwest and midwest;
Advance Auto Parts is a weakly dominant retailer in the southeast and midatlantic

However, one thing to note is that there has been different types of extreme weather that have occurred throughout the United States this last quarter:

In the northwest and California region, there was a heatwave that lasted between late June and mid-July, most affecting the northwest United States and California (Wikipedia).
In the east, Hurricane Ida had most affected Louisiana and the tristate area, though left a devastating trail in between as well (Wikipedia).

One simple way we can measure impact is simply calculate the percent of stores that are in affected regions for each of our companies, which should yield some idea of how much exposure each company has to the extreme weather. This should give us a rough idea as to the impacts, and additional work can be done later to more thoroughly quantify both the extremity of the weather as well as the extent of exposure.

In [10]:

"""For each store, get the distance to the nearest competitor."""

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km


def get_nearest_competitor(df, row):
    """
    Gets the nearest distance to competitor store in kilometers.
    """
    
    if row["Company"]:
        return min(
            haversine_np(
                row["Longitude"], 
                row["Latitude"], 
                df.loc[df["Company"] != row["Company"]]["Longitude"],
                df.loc[df["Company"] != row["Company"]]["Latitude"],
            )
        )
    else:
        return None


df_autos["Longitude"] = df_autos["Longitude"].apply(float)
df_autos["Latitude"] = df_autos["Latitude"].apply(float)
df_autos["Nearest Competitor"] = df_autos.apply(lambda row: get_nearest_competitor(df_autos, row), axis=1)

In [11]:

"""
For each company, get their percentages of stores exposed to the 
areas that have experienced heatwaves and the areas that have been
affected by Hurricane Ida. Additionally, of the exposed stores, get 
the percent that are close to competitors.
"""

d = dict()

heatwave_states = ["OR", "WA", "CA", "ID", "MT"]
ida_states = ["LA", "MS", "AL", "PA", "NJ", "NY", "MD", "VA", "CT"]

for company in df_autos["Company"].unique():
    n_heatwave = len(df_autos.loc[
        (df_autos["Company"] == company) & 
        (df_autos["State"].isin(heatwave_states))
    ])
    n_ida = len(df_autos.loc[
        (df_autos["Company"] == company) & 
        (df_autos["State"].isin(ida_states))
    ])
    n_exposed = len(df_autos.loc[
        (df_autos["Company"] == company) & 
        (df_autos["State"].isin(heatwave_states + ida_states))
    ])
    n_exposed_1 = len(df_autos.loc[
        (df_autos["Company"] == company) & 
        (df_autos["State"].isin(heatwave_states + ida_states)) & 
        (df_autos["Nearest Competitor"] < 1)
    ])
    n_exposed_5 = len(df_autos.loc[
        (df_autos["Company"] == company) & 
        (df_autos["State"].isin(heatwave_states + ida_states)) & 
        (df_autos["Nearest Competitor"] < 5)
    ])
    n_total = len(df_autos.loc[df_autos["Company"] == company])
    
    d[company] = {
        "% Stores in Heatwave": str(int(n_heatwave / n_total * 100)) + "%", 
        "% Stores in Hurricane Ida": str(int(n_ida / n_total * 100)) + "%",
        "% Stores in Extreme Weather": str(int(n_exposed / n_total * 100)) + "%",
        "% of Exposed Stores within 1km of Competitor": str(int(n_exposed_1 / n_exposed * 100)) + "%",
        "% of Exposed Stores within 5km of Competitor": str(int(n_exposed_5 / n_exposed * 100)) + "%",
    }

df_exposures = pd.DataFrame.from_dict(d, orient="index")
df_exposures = df_exposures.reset_index()
df_exposures.columns = ["Company"] + list(df_exposures.columns[1:])

In [12]:

"""Create and colorize a table of companies' extreme weather exposure and proximity to competitors."""

colors = ["#FCBABA", "#FFCCCB", "#FCF6E1", "#E4F2D5", "#CBE8BE", "#B1D9A3"]

def get_colors(series):
    """
    Given a series of numbers, return a new series of colors based on min-max normalization.
    """
    if series.name == "Company":
        return series.apply(lambda x: "#FFFFFF")
    series = series.apply(lambda x: int(x[:-1]))
    idxes = (series - series.min()) / (series.max() - series.min())
    idxes = idxes.apply(lambda x: round(x * (len(colors) - 1)))
    return idxes.apply(lambda x: colors[x])
    
    
fig = go.Figure(
    data=[
        go.Table(
            header=dict(
                values=list(df_exposures.columns), 
                align="center",
            ),
            cells=dict(
                values=[df_exposures[key] for key in df_exposures.columns],
                fill_color=[get_colors(df_exposures[key]) for key in df_exposures.columns],
                align="center",
            )
        )
    ]
)

fig.show()

Conclusion¶

Interestingly, it looks like Advance Auto Parts and O'Reilly Auto Parts have inverted exposures to the extreme weather, while AutoZone has medium exposure to both Hurricane Ida and the heatwave:

Advance Auto Parts has very little store exposure to the heatwave in the northwest, but high concentration of stores exposed to Hurricane Ida;
O'Reilly Auto Parts has high exposure to stores in the heatwave in the northwest, but few stores exposed to Hurricane Ida, resulting in the lowest total percentage of stores in extreme weather;
AutoZone seems to be a middleground, having the highest percentage of their total store exposed to some form of extreme weather, having a medium number of stores exposed to both the heatwave and Hurricane Ida.

Additionally, I added some metric of the exposed stores that are within some distance of competitors to determine how strong market dominance is for their store locations. Based on these metrics, it looks as though O'Reilly Auto Parts has the highest percentage of exposed stores located close to some competitor.

If we want to determine the biggest winner or loser regarding extreme weather in the last quarter, there are a few natural questions that arise.

To what extent does heat drive sales versus heavy wind and rainfall?
- O'Reilly Auto Parts has the highest percentage of stores exposed to the northwest heatwave, but lowest percentage of stores exposed to Hurricane Ida. Nevertheless, if extreme heat is what drives sales (e.g., driving antifreeze purchases), it might be more impactful than extreme rain and wind, resulting in O'Reilly Auto Parts coming out as the biggest winner here.
- AutoZone and Advance Auto Parts both see higher percentages of stores exposed to Hurricane Ida, but lower percentages of stores exposed to the northwest heatwave. If extreme rain and wind are what drive sales (e.g., damage to cars), it might be more impactful than extreme heat, resulting in AutoZone and Advance Auto Parts coming out as the biggest winners here.
To what extent does competition hurt sales?
- Naturally, competition hurts sales. But the nuance here could be that a company may only place stores near competitors if the target addressable market is disproportionately large in the area. If, however, a company places stores near competitors in areas with small addressable markets, then both companies lose.

Both these questions are very nuanced and in themselves can drive completely separate analyses. If we were to continue a deep dive into these questions, they would be directly answerable by regressing using consumer credit card data enriched with more acute weather data.