Overview¶
Back in August 2021, the New York Times released an article spelling out a damning picture of COVID's effects on kindergarten enrollment. The article makes the broad claims that kindergarten fall enrollment fell, pretty much across the board, but particularly in low-income neighborhoods:
The analysis by The New York Times in conjunction with Stanford University shows that in those 33 states, 10,000 local public schools lost at least 20 percent of their kindergartners. In 2019 and in 2018, only 4,000 or so schools experienced such steep drops.
The months of closed classrooms took a toll on nearly all students, and families of all levels of income and education scrambled to help their children make up for the gaps. But the most startling declines were in neighborhoods below and just above the poverty line, where the average household income for a family of four was 35k or less. The drop was 28 percent larger in schools in those communities than in the rest of the country.
They go on to mention that the primary cause of this decline in enrollment is remote learning: "The data covered two-thirds of all public schools. [....] It showed that remote schooling was a main factor driving enrollment declines." Of course, this enrollment change leads to a plethora of negative impacts for some of our most vulnerable students. Kindergarten students get their first introduction to education, learn how to cooperate with one another, and get a sense of some fundamental principles (e.g., identifying letters and numbers) needed for future learning. This enrollment decline, in essence, lowers young childrens' education readiness and starts off their schooling experience one step behind.
However, this decline, especially among lower income areas, is actually non-intuitive for a few reasons.
First, we can expect that the lowest income areas are least capable of providing at-home child care. A lot of low-income households have frontline workers, and frontline workers are least capable of remote work. Furthermore, a parent in a low-income household might also be strapped most for cash -- how, then, can they take extended time off of work to take care of children? You may argue that the unemployment numbers looked quite awful at the beginning of COVID (March 2020 through the summer), which might allow many workers to spend more time with their younger children at home. But by September, many COVID relief measures (e.g., unemployment insurance) had begun to expire, and even more, we can see that employment had rebounded strongly in time for fall enrollment in schools.
Second, remote learning is less accessible in low income areas. Schools are less capable of providing students with resources needed (e.g., school-issued laptops), and low income areas are strapped for cash and may have more problems providing an environment for their children to remote learn (e.g., internet access, increased costs to stay at home, increased food costs).
These insights challenge the extent to which low-income areas are keeping their young children home, opting to skip out on enrolling their children in kindergarten. Can we take a look at the data to verify the New York Times' reporting?
Fact Checking NYT: Is Kindergarten Enrollment Generally Lower?¶
This question is quite straightforward to check. On the whole, across all the data and across the entirety of the United States, can we look at whether kindergarten enrollment declined? To what extent has enrollment declined? Can we see any obvious trends here?
First, we must extract the data and bring it into a usable form:
# Import packages we will use.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
from urllib.request import urlopen
import json
import us
pio.renderers.default = "png"
pio.kaleido.scope.default_width = 1000
pio.kaleido.scope.default_height = 714
# Import relevant data.
df_fips = pd.read_csv("data/county_state_to_fips.csv") # Obtained from https://github.com/kjhealy/fips-codes
df_household_size = pd.read_csv(
"data/household_size_by_county.csv",
) # Obtained from https://covid19.census.gov/datasets/USCensus::average-household-size-and-population-density-county
df_personal_income = pd.read_csv(
"data/personal_income_by_county_state.csv",
header=[1, 2, 3],
) # Obtained from https://www.bea.gov/news/2021/personal-income-county-and-metropolitan-area-2020
df_district_ids = pd.read_csv(
"data/nces_district_id_to_county_state.csv",
encoding="ISO8859-1",
) # Obtained from https://nces.ed.gov/ccd/address.asp
df_enrollments = pd.read_csv(
"data/enrollments_by_grade_and_district.csv",
) # Obtained from https://purl.stanford.edu/zf406jp4427
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
# Massage the data
df_fips["fips"] = df_fips["fips"].apply(lambda x: str(x).zfill(5))
df_fips["state-name"] = df_fips["state"] + "-" + df_fips["name"]
df_district_ids["state-name"] = (
df_district_ids["mstate09"] + "-" + df_district_ids["coname09"].apply(lambda x: x.title())
)
df_district_ids = df_district_ids.merge(df_fips[["fips", "state-name"]], how="left", on="state-name")
df_personal_income = df_personal_income.dropna(how="all")
df_personal_income = df_personal_income.drop(index=list(df_personal_income.index)[-5:])
df_personal_income = df_personal_income.drop(columns=df_personal_income.columns[-4:-1])
df_personal_income.columns = [
"County",
"2018 Avg Income",
"2019 Avg Income",
"2020 Avg Income",
"2020 Rank in State",
"State"
]
df_personal_income = df_personal_income.dropna(how="any", subset=["State"])
df_personal_income["county-state"] = (
df_personal_income["County"].apply(str.lower) +
" county, " +
df_personal_income["State"].apply(str.lower)
)
df_household_size["fips"] = df_household_size["GEOID"].apply(lambda x: str(x).zfill(5))
df_household_size["county-state"] = (
df_household_size["NAME"].apply(str.lower) + ", " + df_household_size["State"].apply(str.lower)
)
df_household_size = df_household_size.merge(
df_personal_income,
how="left",
on="county-state"
)
df_household_size = df_household_size.dropna(subset=["2018 Avg Income", "2019 Avg Income", "2020 Avg Income"])
df_household_size["2018 Avg Income"] = df_household_size["2018 Avg Income"].apply(lambda x: int(x.replace(',', '')))
df_household_size["2019 Avg Income"] = df_household_size["2019 Avg Income"].apply(lambda x: int(x.replace(',', '')))
df_household_size["2020 Avg Income"] = df_household_size["2020 Avg Income"].apply(lambda x: int(x.replace(',', '')))
df_household_size["2020 Rank in State"] = df_household_size["2020 Rank in State"].apply(lambda x: int(x))
df_household_size["state_abbrev"] = df_household_size["State_x"].apply(
lambda x: us.states.lookup(x).abbr
)
cols = [
"state", "county", "district_nces_id", "district",
"year", "term", "grade", "male", "female", "low_income", "total"
]
df_enrollments = df_enrollments[cols]
df_enrollments = df_enrollments.merge(
df_district_ids[["leaid", "fips"]],
how="left",
left_on="district_nces_id",
right_on="leaid"
)
cols = ["male", "female", "low_income", "total"]
df_enrollments[cols] = df_enrollments[cols].apply(pd.to_numeric, errors='coerce')
# Grab just kindergarten enrollments from data
df_enrollments_kindergarten = df_enrollments.loc[
(df_enrollments["grade"] == "kindergarten") |
(df_enrollments["grade"] == "combined_K") |
(df_enrollments["grade"] == "kindergarten_full_day") |
(df_enrollments["grade"] == "kindergarten_half_day")
]
# Create a list of districts that have data for both 2020 and 2021, at least
constant_districts = df_enrollments_kindergarten.groupby("district").filter(
lambda x: (x["year"].min() <= 2020) & (x["year"].max() >= 2021)
)["district"].unique()
df_enrollments_kindergarten = df_enrollments_kindergarten.loc[
df_enrollments_kindergarten["district"].isin(constant_districts)
]
# Aggregate by state and calculate YoY change in enrollment
df_k_enrollments_by_state = df_enrollments_kindergarten.groupby(["state", "year"]).sum()
df_k_enrollments_by_state = df_k_enrollments_by_state.reset_index()
df_k_enrollments_by_state["total_yoy"] = df_k_enrollments_by_state.groupby("state").pct_change()["total"]
# Aggregate by county and calculate YoY change in enrollment
df_k_enrollments_by_fips = df_enrollments_kindergarten.groupby(["state", "fips", "year"]).sum()
df_k_enrollments_by_fips = df_k_enrollments_by_fips.reset_index()
df_k_enrollments_by_fips = df_k_enrollments_by_fips.dropna(subset=["fips", "year", "total", "state"])
# Expand the index to include all years for each FIPS code
idx = pd.MultiIndex.from_product(
(df_k_enrollments_by_fips["fips"].unique(), df_k_enrollments_by_fips["year"].unique()),
names=["fips", "year"]
)
df_k_enrollments_by_fips = df_k_enrollments_by_fips.set_index(["fips", "year"])
df_k_enrollments_by_fips = df_k_enrollments_by_fips.reindex(idx).reset_index()
df_k_enrollments_by_fips = df_k_enrollments_by_fips.sort_values(by=["fips", "year"])
df_k_enrollments_by_fips["total_yoy"] = (
df_k_enrollments_by_fips.groupby("fips")["total"].pct_change(fill_method=None)
)
Now that we have the data, we can do a quick visualization of the kindergarten enrollment year-over-year change for the year starting in fall 2020.
df_temp = df_k_enrollments_by_state.loc[df_k_enrollments_by_state["year"] == 2021]
fig = px.choropleth(
df_temp,
locationmode="USA-states",
locations="state",
scope="usa",
color="total_yoy",
color_continuous_scale="RdBu",
color_continuous_midpoint=0,
)
fig.update_layout(
title="Kindergarten YoY Enrollment by State",
title_x=0.5,
coloraxis_colorbar_tickformat="0%",
coloraxis_colorbar_title="Enrollment YoY",
)
fig.show()
On the whole, we see that almost all states in the data show year-over-year decreased kindergarten enrollment in the fall of 2020. The one exception state is NJ, which has a positive enrollment. Just eyeballing the numbers, we can see that many of these states have decreases over 10%! This is a pretty wild decline in kindergarten enrollment, and this seems to be what the NYT is referring to on the whole.
Fact Checking NYT: Is Kindergarten Enrollment Lower in Low-Income Areas?¶
We've already seen that kindergarten enrollment is in general decline across the United States. But if we are to bifurcate the data into impoverished areas and non-impoverished areas, there are several questions by way of methodology:
- How do we define impoverished or "low-income"?
- The New York Times has broadly defined a low-income area as a county with an average income lower than 35k, but this seems to be a very arbitrary amount.
- Should our definition account for the states' different costs of living?
- For example, living costs in California are higher than in Mississippi. Are we to take these two states to have similar poverty levels?
- At what level should we aggregate an area?
- We don't want areas that are so small that we will see sample size problems (e.g., year-over-year increases over 100% for counties with low student population), but we also don't want to see areas that are so big that we can't get meaningful relationships with income.
With regards to point (1), it may instead make sense to look at an impoverished area based on income percentile. We may take the bottom 1 or 2 percentile to be more meaningful when addressing poverty. On the whole, 35k is an arbitrary, but reasonable, amount to set as a bar for poverty, but Wikipedia lists the poverty threshold as being "for a single person under 65 was an annual income of 12k, or about 35 dollars per day. The threshold for a family group of four, including two children, was 26k, about 72 dollars per day." We see that both these numbers seem wildly different from the line that New York Times has left us to believe, but additionally, income dynamics below the poverty line itself are wildly different. Only a small percentage of impoverished households (below the poverty line) are in destitution, and we will very likely see massive differences in behaviors surrounding child care and education than we are to see in the general poverty group.
With regards to point (2), looking at a percentile of income within each particular state might make sense to account for cost of living.
Lastly, with regards to point (3), we investigate both county-level and state-level change in enrollment, as these two levels are given in the data, but it might also be interesting to look at different swathes of areas. For example, in many rural counties, there are actually very few kindergarten students on the whole, which can lead to some pretty drastic year-over-year changes (e.g., going from 3 students to 10 students would yield a 233% increase in enrollment!).
For now, we choose to look at county- and state-level income and enrollment dynamics, focusing on both the arbitrary 35k line that the NYT has imposed, and also percentile of incomes, both in-state and across the entirety of the United States.
Impact on Counties Below 35k Average Income¶
# Select counties with average income less than $35,000
df_poverty_counties = df_household_size.loc[df_household_size["2020 Avg Income"] <= 35000]
df_nonpoverty_counties = df_household_size.loc[~df_household_size["fips"].isin(df_poverty_counties)]
df_temp = df_k_enrollments_by_fips.merge(
df_household_size[[
"NAME", "State_x", "fips",
"B25010_001E", "B25010_002E", "B25010_003E",
"B01001_001E", "B01001_calc_PopDensity",
"2018 Avg Income", "2019 Avg Income", "2020 Avg Income",
"2020 Rank in State",
]],
how="left",
on="fips",
)
df_temp = df_temp.loc[df_temp["year"] == 2021]
df_temp["Impoverished"] = df_temp["fips"].apply(lambda x: x in list(df_poverty_counties["fips"]))
df_temp = df_temp.replace([np.inf, -np.inf], np.nan)
df_temp = df_temp.dropna(subset=["total_yoy", "2020 Avg Income"])
fig = px.scatter(
df_temp.loc[
(df_temp["2020 Avg Income"] <= df_temp["2020 Avg Income"].quantile(0.98)) &
(df_temp["2020 Avg Income"] >= df_temp["2020 Avg Income"].quantile(0.02)) &
(df_temp["total_yoy"] <= 1)
], # Account for > 100% increases in county class size
x="2020 Avg Income",
y="total_yoy",
trendline="ols",
trendline_color_override="darkblue",
)
fig.update_layout(
title="County Avg Income vs. K-Enrollment YoY Change",
title_x=0.5,
xaxis_tickformat="$,",
yaxis_title="K-Enrollment YoY Change",
yaxis_tickformat="0%",
)
fig.show()
print(
f"Percentage of impoverished counties with kindergarten enrollment decrease: {len(df_temp.loc[(df_temp['Impoverished']) & (df_temp['total_yoy'] < 0)]) / len(df_temp.loc[df_temp['Impoverished']]):.2%}"
)
print(
f"Percentage of non-impoverished counties with kindergarten enrollment decrease: {len(df_temp.loc[~(df_temp['Impoverished']) & (df_temp['total_yoy'] < 0)]) / len(df_temp.loc[~df_temp['Impoverished']]):.2%}"
)
Regressing on kindergarten enrollment change versus income, we don't see a strong correlation between county income and kindergarten enrollment. However, in absolutes, the percentage of non-impoverished counties that had kindergarten enrollment declines was actually higher than that of impoverished counties. While this alone doesn't tell us anything about the impact of the incomes on the magnitude of kindergarten enrollment, we can see that impoverished areas (below 35k in average household income) don't tend to withdraw their kids from kindergarten more often than non-impoverished areas do.
Impact on Bottom Percentile Counties¶
The picture on those with below 35k in average household income, at a glance, could still match up with the New York Times' depiction -- that is, we still haven't investigated the extent to which low-income households aren't sending kids back to school. What we have discovered, though, is that it seems like more low-income households (once again, 35k or less) did send their kindergarten students back to school more frequently did their higher income counterparts.
However, we can note that this 35k threshold is quite arbitrary; first, it does not cohere to consensus poverty threshold numbers. Second, it does not take into account cost of living, so 35k might go further in Mississippi than in California, for example.
The first concern can be addressed by instead looking into bottom percentile counties, rather than adhering to an arbitrary standard.
df_temp = df_k_enrollments_by_fips.merge(
df_household_size[[
"NAME", "State_x", "fips",
"B25010_001E", "B25010_002E", "B25010_003E",
"B01001_001E", "B01001_calc_PopDensity",
"2018 Avg Income", "2019 Avg Income", "2020 Avg Income",
"2020 Rank in State",
]],
how="left",
on="fips",
)
df_temp = df_temp.loc[df_temp["year"] == 2021]
df_temp = df_temp.replace([np.inf, -np.inf], np.nan)
df_temp = df_temp.dropna(subset=["total_yoy", "2020 Avg Income"])
df_temp = df_temp.loc[
(df_temp["2020 Avg Income"] <= df_temp["2020 Avg Income"].quantile(0.98)) &
(df_temp["2020 Avg Income"] >= df_temp["2020 Avg Income"].quantile(0.02)) &
(df_temp["total_yoy"] <= 1)
] # Account for > 100% increases in county class size
fig = px.scatter(
df_temp,
x="2020 Avg Income",
y="total_yoy",
trendline="ols",
trendline_color_override="darkblue",
)
fig.update_layout(
title="County Avg Income vs. K-Enrollment YoY Change (Income Middle 96 Percentile)",
title_x=0.5,
xaxis_tickformat="$,",
yaxis_title="K-Enrollment YoY Change",
yaxis_tickformat="0%",
)
fig.show()
print(
f"Percentage of middle-income counties with kindergarten enrollment decrease: {len(df_temp.loc[df_temp['total_yoy'] < 0]) / len(df_temp):.2%}"
)
Looking into the middle 96 percentile of the United States yields that there doesn't seem to be any noticeable patterns here. But what about the top 2 percentile and bottom 2 percentile?
df_temp = df_k_enrollments_by_fips.merge(
df_household_size[[
"NAME", "State_x", "fips",
"B25010_001E", "B25010_002E", "B25010_003E",
"B01001_001E", "B01001_calc_PopDensity",
"2018 Avg Income", "2019 Avg Income", "2020 Avg Income",
"2020 Rank in State",
]],
how="left",
on="fips",
)
df_temp = df_temp.loc[df_temp["year"] == 2021]
df_temp = df_temp.replace([np.inf, -np.inf], np.nan)
df_temp = df_temp.dropna(subset=["total_yoy", "2020 Avg Income"])
df_temp = df_temp.loc[
(df_temp["2020 Avg Income"] >= df_temp["2020 Avg Income"].quantile(0.98)) &
(df_temp["total_yoy"] <= 1)
] # Account for > 100% increases in county class size
fig = px.scatter(
df_temp,
x="2020 Avg Income",
y="total_yoy",
trendline="ols",
trendline_color_override="darkblue",
)
fig.update_layout(
title="County Avg Income vs. K-Enrollment YoY Change (Top 2%)",
title_x=0.5,
xaxis_tickformat="$,",
yaxis_title="K-Enrollment YoY Change",
yaxis_tickformat="0%",
)
fig.show()
print(
f"Percentage of high-income counties with kindergarten enrollment decrease: {len(df_temp.loc[df_temp['total_yoy'] < 0]) / len(df_temp):.2%}"
)
Once again, it doesn't seem like we can easily extrapolate a trendline among the richest counties, as many of these counties are centered around 90k average household income and very few are much higher than that. However, we notice that the percentage of high-income counties with kindergarten enrollment decreases is much higher than in the middle-income counties.
df_temp = df_k_enrollments_by_fips.merge(
df_household_size[[
"NAME", "State_x", "fips",
"B25010_001E", "B25010_002E", "B25010_003E",
"B01001_001E", "B01001_calc_PopDensity",
"2018 Avg Income", "2019 Avg Income", "2020 Avg Income",
"2020 Rank in State",
]],
how="left",
on="fips",
)
df_temp = df_temp.loc[df_temp["year"] == 2021]
df_temp = df_temp.replace([np.inf, -np.inf], np.nan)
df_temp = df_temp.dropna(subset=["total_yoy", "2020 Avg Income"])
df_temp = df_temp.loc[
(df_temp["2020 Avg Income"] <= df_temp["2020 Avg Income"].quantile(0.02)) &
(df_temp["total_yoy"] <= 1)
] # Account for > 100% increases in county class size
fig = px.scatter(
df_temp,
x="2020 Avg Income",
y="total_yoy",
trendline="ols",
trendline_color_override="darkblue",
)
fig.update_layout(
title="County Avg Income vs. K-Enrollment YoY Change (Bottom 2%)",
title_x=0.5,
xaxis_tickformat="$,",
yaxis_title="K-Enrollment YoY Change",
yaxis_tickformat="0%",
)
fig.show()
print(
f"Percentage of low-income counties with kindergarten enrollment decrease: {len(df_temp.loc[df_temp['total_yoy'] < 0]) / len(df_temp):.2%}"
)
Looking at this, it seems like within the most impoverished counties in America, it's difficult to extrapolate whether these counties actually do have some statistically significant trend; among the lowest income areas, most counties have around 30k average household income. Only a few counties have average household incomes much lower than this 30k mark. However, we see that the low-income counties have a substantially higher percentage with kindergarten enrollment increases than middle-income counties and high-income counties. Only 64% of low-income counties saw kindergarten enrollment decreases, while 83% of high-income counties saw kindergarten enrollment decreases.
This follows the narrative we have hypothesized that low-income counties are less capable of keeping their kindergarten aged children at home.
However, let's see what this general trend actually looks like if we categorize less arbitrarily. We can bucket counties by quartile of household income and see if there exists a trend there.
df_temp = df_k_enrollments_by_fips.merge(
df_household_size[[
"NAME", "State_x", "fips",
"B25010_001E", "B25010_002E", "B25010_003E",
"B01001_001E", "B01001_calc_PopDensity",
"2018 Avg Income", "2019 Avg Income", "2020 Avg Income",
"2020 Rank in State",
]],
how="left",
on="fips",
)
df_temp = df_temp.loc[df_temp["year"] == 2021]
df_temp = df_temp.replace([np.inf, -np.inf], np.nan)
df_temp = df_temp.dropna(subset=["total_yoy", "2020 Avg Income"])
df_temp = df_temp.loc[df_temp["total_yoy"] <= 1]
df_temp["Quartile"] = pd.qcut(df_temp["2020 Avg Income"], 4, labels=list(range(1, 5)))
df_temp = df_temp.groupby("Quartile").apply(lambda x: len(x.loc[x["total_yoy"] < 0]) / len(x))
df_temp = df_temp.reset_index()
df_temp.columns = ["Quartile", "Percentage Declined"]
fig = px.bar(
df_temp,
x="Quartile",
y="Percentage Declined",
)
fig.update_layout(
title="Percentage of Countes with Enrollment Decline by Quartile",
title_x=0.5,
xaxis_nticks=5,
yaxis_title="Percentage Declined",
yaxis_tickformat="0%",
)
fig.show()
Using a less arbitrary categorization for income bracketing (i.e., not just an arbitrary cut-off at 30k per year), it seems that the relationship does not hold very well. The lowest bracket does seem to have the lowest percentage decline of kindergarten enrollment, but it doesn't look like by a significant margin.
Conclusion¶
While it does seem like there is across-the-board a decline in kindergarten enrollment as a result of COVID measures, the narrative that the New York Times pushes for seems to only show a part of the picture. Yes, COVID kindergarten enrollment seems to have seen a steep drop almost throughout the entire nation, with the worst declines in states in the northeast, Pacific northwest, and California. However, it does not appear to me that the lowest-income counties were worst hit. There are fundamental reasons why this might be the case: (1) low-income households are worst equipped to deal with child care, and (2) remote learning is least accessible in low-income areas.
And the data seems to point to this being the case.
Surely, some low-income counties were hard hit. Using the thresholds set by The New York Times, the most impoverished counties seemed to see fewer enrollment declines than did the most wealthy counties. But even taking less arbitrary income bracketing, I just don't see a strong case that the lowest-income counties are enrolling students at a much lower rate.