How Tian Mira Built a Technical BaZi Corpus of 18,255 Canonical Profiles
Introduction
This article describes how the Tian Mira engine built a technical BaZi corpus. It is not a ranking of personalities, nor an astrological interpretation. It is foundational work: gathering public birth records, auditing them, matching cross-source entries, deduplicating, and producing consistent, documented BaZi calculations.
The final corpus contains 18,255 canonical profiles, each with a full advanced_v2 calculation. It draws on 19,394 source records from two legally distinct collections.
Why Build a Technical BaZi Corpus
A BaZi engine cannot be seriously evaluated on a handful of hand-picked cases. Checking calculation consistency, solar correction accuracy, pillar stability and element weighting requires a corpus large enough to be meaningful, documented enough to be audited, and verifiable.
Building this corpus required:
- collecting reliable source birth records;
- auditing them without altering them;
- identifying cross-source duplicates;
- deciding when two records refer to the same person;
- keeping profiles separate when identity remains uncertain.
Two Source Collections, Two Distinct Rights Regimes
The corpus combines two collections that are not governed by the same terms.
Astro-Databank C Collection
- 3,604 birth records (letter C);
- Rodden Rating AA for all records;
- verified line by line against the official C sample;
- strictly non-commercial use;
- attribution required (birth data: Astro-Databank / Astrodienst; calculations: Tian Mira);
- commercial use prohibited without explicit permission from the rights holder.
VedAstro Collection
- 15,790 valid AA records from the VedAstro dataset;
- upstream dataset labelled MIT by its publisher on HuggingFace;
- underlying provenance (link to Astro-Databank) not verified line by line by Tian Mira;
- Tian Mira does not guarantee the complete rights chain of every record.
This corpus combines records subject to different upstream terms. No single licence applies to the complete set.
From 19,394 Source Records to 18,255 Canonical Profiles
Source Records
Each birth entry from an upstream collection is a source record. The total is:
3,604 + 15,790 = 19,394 source records
Cross-Source Matching
A cross-source link is established when an Astro-Databank record and a VedAstro record are likely to refer to the same person. Matching compares name, date, time, place and coordinates.
Across the corpus, 1,236 links were examined:
- 1,139 confirmed links (same person);
- 97 non-merged links (uncertain identity, conflicting data, or coordinate divergence).
Canonical Profile
When two source records are confirmed as the same person, they produce a single canonical profile. The official Astro-Databank data is kept as the primary source.
When a link is uncertain or false, both records remain separate, preserving each source without deletion.
19,394 − 1,139 = 18,255 canonical profiles
Why Some Matches Were Not Merged
The 97 non-merged links fall into three categories:
- uncertain identity (insufficient match confidence);
- time divergence (same name and date, different birth time);
- extreme coordinate divergence (over 8,000 km apart, likely indicating two distinct people or a matching error).
These 97 cases are individually documented. No source record was deleted: caution requires keeping both entries pending possible human verification.
The 18,255 advanced_v2 Calculations
Every canonical profile has a complete BaZi calculation from the Tian Mira engine:
- Four Pillars (year, month, day, hour);
- Day Master;
- Hidden Stems;
- Ten Gods;
- Na Yin;
- Advanced five-element weighting (advanced_v2);
- Luck Cycles.
The advanced_v2 method exposes for each profile:
- Wood, Fire, Earth, Metal, Water percentages (sum = 100);
- raw scores;
- root strength;
- support / pressure ratio;
- conclusion and confidence level.
No narrative, predictive or divinatory interpretation is included.
Storage Deduplication and Publication Architecture
The source package contained redundant copies of the same calculations across the Astro-Databank, VedAstro and unified distributions. Physical deduplication reduced storage volume.
| Level | Before | After |
|---|---|---|
| Total volume | ~672.7 MiB | ~342.7 MiB |
| Savings | – | ~330 MiB (49%) |
The planned deployment architecture separates:
- lightweight files (documentation, A–Z index, schemas, manifests) for the public site — ~20.6 MiB;
- 32 expert shards (18,255 advanced_v2 calculations) for Cloudflare R2 storage — ~322.1 MiB.
Each distribution (Astro-Databank and VedAstro) references canonical calculations without physically duplicating them.
Provenance, Rights and Limitations
Rights by Collection
| Collection | Regime | Commercial Use |
|---|---|---|
| Astro-Databank C | Non-commercial | Prohibited without permission |
| VedAstro | Upstream MIT declared | Permitted by declared licence, not guaranteed by Tian Mira |
The unified corpus has no global licence. Each profile retains its source rights regime.
Limitations
- VedAstro record provenance is not verified line by line.
- BaZi is a symbolic cultural system, not a scientific method.
- Tian Mira calculations are technical outputs, not predictions.
- Historical dates may carry calendar uncertainties.
What the Corpus Enables
- Auditing a BaZi engine across a large, documented dataset.
- Comparing elemental weighting methods.
- Studying pillar, Day Master and element distributions.
- Serving as a basis for technical and statistical research.
- Clearly distinguishing upstream data rights regimes.
What It Does Not Claim
- That all 18,255 profiles come from a single database.
- That all upstream data is verified or guaranteed by Tian Mira.
- That the corpus is free of rights restrictions for any commercial use.
- That BaZi is a predictive scientific method.
- That predictive, medical, legal or financial interpretation can be derived from the calculations.
Cautious Conclusion
The Tian Mira 2026 technical corpus is an audit and research tool. It documents its sources, matching decisions, merge rules, calculations, limitations and rights regimes. It claims to predict nothing.
Methodological transparency is not a selling point: it is the minimum condition for a technical corpus to be examined, challenged, corrected and improved.
Methodological Note
- Engine: Tian Mira BaZi calculation engine
- Method: advanced_v2 (normalised weighting with roots, season, stem-branch interactions and hidden stems)
- Correction: true solar time with historical timezone and equation of time
- Geocoding: local GeoNames index
- Interpretation: none (purely technical outputs)
Recommended Citation
Tian Mira, Technical BaZi Corpus 2026 — canonical model and advanced_v2 calculations, lightweight version 2026.1, June 2026.
>
Birth data: Astro-Databank/Astrodienst (C collection, 3,604 records, non-commercial use) and VedAstro dataset (15,790 records, upstream MIT declared).
>
BaZi calculations and canonical model: Tian Mira.
Current Status
Public downloadable dataset: 3,604 Astro-Databank C profiles, free, non-commercial use only. The 18,255-profile unified corpus and VedAstro data are not offered for public download.