Processing StatCan Data

Setup

Parameters

from pyprojroot import here
LABOUR_DATA_FILE = here() / "data" / "14100355.csv"

Libraries

import polars as pl
import polars.selectors as cs
from mizani.bounds import squish
from pyprojroot import here
from great_tables import GT, md, html
from plotnine import *
from labourcan.data_processing import read_labourcan

Read data

read_labourcan returns a polars dataframe with columns:

  • Unnecessary metadata columns removed
  • Filtered to seasonally adjusted estimates only
  • Additional YEAR, MONTH, and DATE_YMD columns extracted from REF_DATE
  • Sorted chronologically by year and month
labour = read_labourcan(LABOUR_DATA_FILE)
labour.glimpse()
Rows: 12252
Columns: 10
$ REF_DATE    <str> '1976-01', '1976-01', '1976-01', '1976-01', '1976-01', '1976-01', '1976-01', '1976-01', '1976-01', '1976-01'
$ GEO         <str> 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada'
$ Industry    <str> 'Total employed, all industries', 'Goods-producing sector', 'Agriculture [111-112, 1100, 1151-1152]', 'Forestry, fishing, mining, quarrying, oil and gas [21, 113-114, 1153, 2100]', 'Utilities [22]', 'Construction [23]', 'Manufacturing [31-33]', 'Services-producing sector', 'Wholesale and retail trade [41, 44-45]', 'Transportation and warehousing [48-49]'
$ Statistics  <str> 'Estimate', 'Estimate', 'Estimate', 'Estimate', 'Estimate', 'Estimate', 'Estimate', 'Estimate', 'Estimate', 'Estimate'
$ Data type   <str> 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted', 'Seasonally adjusted'
$ UOM         <str> 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands', 'Persons in thousands'
$ VALUE       <f64> 9636.7, 3312.5, 463.6, 244.2, 110.4, 654.9, 1839.5, 6324.1, 1592.9, 573.2
$ YEAR        <i32> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976
$ MONTH       <i32> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ DATE_YMD   <date> 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01

Process data

For our graphic, we need to make this variable: signed cenetered rank in % change in # of jobs

Signed referring to that the rank is effectively computed over negative and positive %change separately, and in opposite directions (descending for negative, and ascending for positive).

Meaning the highest negative value (closest to 0) is rank = -1, and lowest positive value (closets to 0) is rank = 1

For example:

  • Input is vector of % change which can be positive or negative: [-0.01, -0.02, 0.01, 0.02]
  • Output: [-1, -2, 1, 2]

% Change per month

First, compute % change from previous month. This needs to be done over different subsets of data:

  • Industry
  • Geolocation
  • Labour Force Characteristic (If provided)
  • Gender
  • Age group

In the seasonally adjusted dataset, only Industry and Geolocation are provided. The LFC is total employment, the Gender is both, and Age group is all.

labour_processed = (
    # if we sort acesnding by time, then lag value is the month before
    labour.sort(["Industry", "YEAR", "MONTH"])
    .with_columns(
        LAGGED_VALUE=pl.col("VALUE")
        .shift(1)
        .over(["Industry"])
    )
    # compute percent difference
    .with_columns((pl.col("VALUE") - pl.col("LAGGED_VALUE")).alias("DIFF"))
    .with_columns((pl.col("DIFF") / pl.col("LAGGED_VALUE")).alias("PDIFF"))
    .select(
        pl.col("Industry"),
        cs.matches("Labour force characteristics"),
        pl.col("DATE_YMD"),
        pl.col("YEAR"),
        pl.col("MONTH"),
        cs.matches("VALUE"),
        cs.matches("DIFF"),
    )
    .sort(["Industry", "YEAR", "MONTH", "PDIFF"])
)
labour_processed.glimpse()
Rows: 12252
Columns: 8
$ Industry      <str> 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]'
$ DATE_YMD     <date> 1976-01-01, 1976-02-01, 1976-03-01, 1976-04-01, 1976-05-01, 1976-06-01, 1976-07-01, 1976-08-01, 1976-09-01, 1976-10-01
$ YEAR          <i32> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976
$ MONTH         <i32> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ VALUE         <f64> 391.9, 395.1, 399.7, 399.7, 407.7, 411.6, 417.6, 423.7, 423.5, 428.9
$ LAGGED_VALUE  <f64> None, 391.9, 395.1, 399.7, 399.7, 407.7, 411.6, 417.6, 423.7, 423.5
$ DIFF          <f64> None, 3.2000000000000455, 4.599999999999966, 0.0, 8.0, 3.900000000000034, 6.0, 6.099999999999966, -0.19999999999998863, 5.399999999999977
$ PDIFF         <f64> None, 0.008165348303138672, 0.011642622120981943, 0.0, 0.020015011258443835, 0.009565857247976537, 0.014577259475218658, 0.014607279693486507, -0.0004720320981826496, 0.012750885478158152

Signed Centered Rank

Now we can compute the signed centered rank.

Define centered_rank_expr function which takes a polars series and returns an expression, meaning it can be used in a polars with_columns call, which is nice because it can take advantage of polars lazy-evaluation optimization.

Below is the definition and a test-case.

def centered_rank_expr(col):
    """
    - Largest negative value gets rank -1
    - Smallest positive value gets rank +1
    - Zero gets rank 0
    """
    return (
        pl.when(col < 0)
        .then(
            # minus the total # of -ve values
            (col.rank(method="ordinal", descending=True) * -1) + (col > 0).sum()
        )
        .when(col == 0)
        .then(pl.lit(0))
        .when(col > 0)
        .then(col.rank(method="ordinal") - (col < 0).sum())
        .otherwise(pl.lit(None))
    )

# test it on this subset of data
test_series = (
    # .filter(pl.col("Labour force characteristics") == "Employment")
    labour_processed
    .with_columns(pl.col("PDIFF").round(decimals=4))
    .filter(pl.col("YEAR") == 2025, pl.col("MONTH") == 1)
    .select(pl.col("PDIFF"))
    .sample(n=10, seed=1)
    .select("PDIFF")
)

test_series.with_columns(centered_rank_expr(pl.col("PDIFF")).alias("rank")).sort(
    "PDIFF"
)
shape: (10, 2)
PDIFF rank
f64 i64
-0.0336 -5
-0.0207 -4
-0.0177 -3
-0.0101 -2
-0.003 -1
0.0006 1
0.0109 2
0.0122 3
0.0179 4
0.044 5

Looks good, so now we can apply to the data:

labour_processed = labour_processed.with_columns(
    centered_rank_across_industry=centered_rank_expr(pl.col("PDIFF")).over(
        ["YEAR", "MONTH"]
    )
)
labour_processed.glimpse()
Rows: 12252
Columns: 9
$ Industry                       <str> 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]', 'Accommodation and food services [72]'
$ DATE_YMD                      <date> 1976-01-01, 1976-02-01, 1976-03-01, 1976-04-01, 1976-05-01, 1976-06-01, 1976-07-01, 1976-08-01, 1976-09-01, 1976-10-01
$ YEAR                           <i32> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976
$ MONTH                          <i32> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ VALUE                          <f64> 391.9, 395.1, 399.7, 399.7, 407.7, 411.6, 417.6, 423.7, 423.5, 428.9
$ LAGGED_VALUE                   <f64> None, 391.9, 395.1, 399.7, 399.7, 407.7, 411.6, 417.6, 423.7, 423.5
$ DIFF                           <f64> None, 3.2000000000000455, 4.599999999999966, 0.0, 8.0, 3.900000000000034, 6.0, 6.099999999999966, -0.19999999999998863, 5.399999999999977
$ PDIFF                          <f64> None, 0.008165348303138672, 0.011642622120981943, 0.0, 0.020015011258443835, 0.009565857247976537, 0.014577259475218658, 0.014607279693486507, -0.0004720320981826496, 0.012750885478158152
$ centered_rank_across_industry  <i64> None, 8, 11, 0, 8, 8, 9, 8, -1, 5

Check output visually for 1 year 1 month

# check 1 year 1 month
(
    labour_processed
    .with_columns(pl.col("PDIFF").round(decimals=4))
    .filter(pl.col("YEAR") == 2025)
    .sort(["YEAR", "MONTH", "PDIFF"])
    .select(["YEAR", "MONTH", "Industry", "VALUE", "DIFF", "PDIFF", cs.matches("rank")])
)
shape: (168, 7)
YEAR MONTH Industry VALUE DIFF PDIFF centered_rank_across_industry
i32 i32 str f64 f64 f64 i64
2025 1 "Wholesale trade [41]" 689.4 -24.0 -0.0336 -8
2025 1 "Utilities [22]" 155.8 -3.3 -0.0207 -7
2025 1 "Other services (except public … 771.0 -13.9 -0.0177 -6
2025 1 "Forestry, fishing, mining, qua… 337.1 -5.6 -0.0163 -5
2025 1 "Business, building and other s… 726.4 -7.4 -0.0101 -4
2025 8 "Accommodation and food service… 1177.1 9.2 0.0079 5
2025 8 "Construction [23]" 1636.3 17.1 0.0106 6
2025 8 "Agriculture [111-112, 1100, 11… 217.7 4.8 0.0225 7
2025 8 "Utilities [22]" 163.5 4.7 0.0296 8
2025 8 "Wholesale trade [41]" 731.1 27.6 0.0392 9