Introduction
Standard association rules (or implications in Formal Concept Analysis) identify correlations between attributes (). However, correlation does not imply causation. A rule might be strong simply because both and are caused by a third confounding variable .
The fcaR package now supports Mining Causal
Association Rules, implementing a method to identify likely
causal relationships by controlling for confounding variables. This
considers the “Fair Odds Ratio” calculated on a “Fair Data Set” of
matched pairs.
The Approach
To check if is causal, the algorithm:
- Identifies potential confounders (controlled variables) that are not part of the premise , the conclusion , or variables irrelevant to .
- Constructs a Fair Data Set by finding
matched pairs of objects. Two objects
form a matched pair if:
- They have the same values for all controlled variables.
- One object has the premise ( has property ).
- The other object does not ( does not have property ).
- Computes the Fair Odds Ratio on these matched pairs.
- Considers the rule “Causal” if the lower bound of the Confidence Interval for the Fair Odds Ratio is greater than 1.
Example 1: Direct Causality
Let’s consider a simple case where Treatment causes Recovery.
# 100 Patients
# 50 Treated, 50 Untreated
# Treated: 90% Recovery
# Untreated: 20% Recovery
n <- 100
treated <- c(rep(1, 45), rep(1, 5), rep(0, 10), rep(0, 40))
recovered <- c(rep(1, 45), rep(0, 5), rep(1, 10), rep(0, 40))
I <- matrix(c(treated, recovered), ncol = 2)
colnames(I) <- c("Treatment", "Recovery")
fc <- FormalContext$new(I)We can mine for causal rules targeting “Recovery”:
rules <- fc$find_causal_rules(
response_var = "Recovery",
min_support = 0.1,
confidence_level = 0.95
)
rules$print()
#> Rules set with 1 rules
#> Rule 1: {Treatment} -> {Recovery} [support = 0.5, confidence = 0.9,
#> fair_odds_ratio = 35, ci_lower = 4.8, ci_upper = 255.47]The algorithm correctly identifies “Treatment” as a cause for “Recovery”.
Example 2: Simpson’s Paradox (Spurious Correlation)
A classic example where standard association rules fail is Simpson’s Paradox, or confounding variables creating spurious correlations.
Consider a dataset relating Ice Cream consumption and Drowning. They are highly correlated because both increase during hot weather (the Heat variable).
- Heat causes Ice Cream.
- Heat causes Drowning.
- Ice Cream does not cause Drowning.
However, a naive frequent itemset mining might find
Ice Cream -> Drowning.
Let’s simulate this:
set.seed(123)
n <- 200
# Heat: 50% Hot, 50% Cold
heat <- c(rep(1, 100), rep(0, 100))
# Ice Cream: Strongly dependent on Heat (80% if Hot, 20% if Cold)
ic <- numeric(200)
ic[1:100] <- rbinom(100, 1, 0.8)
ic[101:200] <- rbinom(100, 1, 0.2)
# Drowning: Strongly dependent on Heat (80% if Hot, 20% if Cold)
drown <- numeric(200)
drown[1:100] <- rbinom(100, 1, 0.8)
drown[101:200] <- rbinom(100, 1, 0.2)
I <- matrix(c(heat, ic, drown), ncol = 3)
colnames(I) <- c("Heat", "IceCream", "Drowning")
fc_spurious <- FormalContext$new(I)If we just looked at correlations, IceCream and
Drowning would be correlated. But
find_causal_rules controls for confounders.
When testing IceCream -> Drowning: - It controls for
Heat. - It compares days with same Heat (Hot vs Hot, Cold
vs Cold) but different Ice Cream consumption. - Within “Hot” days, Ice
Cream consumption is random (w.r.t Drowning causal mechanism) and
doesn’t increase drowning risk further. - The odds ratio should be near
1.
causal_rules <- fc_spurious$find_causal_rules(
response_var = "Drowning",
min_support = 0.5
)
# Should contain "Heat" but NOT "IceCream"
print(causal_rules)
#> Rules set with 1 rules
#> Rule 1: {Heat} -> {Drowning} [support = 0.5, confidence = 0.81,
#> fair_odds_ratio = 27, ci_lower = 3.67, ci_upper = 198.69]As expected, the algorithm identifies Heat as the true cause and rejects the spurious Ice Cream association.
Conclusion
The find_causal_rules method provides a powerful tool to
go beyond simple association and identify rules that are robust to
confounding, providing a step towards causal inference in Concept
Analysis. It returns a RuleSet object with quality metrics
including Support, Confidence, and the Fair Odds Ratio with its
Confidence Interval.