TB is still among the major global health problems especially in low- and medium-income economies whereby access to speedy and precise diagnosticities is still poor. We present a Multi-Modes Stacking Ensemble on Tuberculosis (MMSE-TB), a model that combines three modalities which are diverse and complementary that are used to detect tuberculosis; these include chest X-ray, cough audio, and clinical text. The modalities are modeled with separate architectures of deep learning: a Feature-Map-Normalized CNN which extracts radiological features, a Capsule Network which predicts patterns with space-temporal correlations of a cough spectrogram and a BioBERT-generated encoder which predicts features of clinical text with semantic meaning behind them. Models are combined using dynamically-optimized weighting program using Mayfly Optimization Algorithm to contribute dynamically and confidently and reliably with all modalities. Experimental analysis has demonstrated that this tri-modal ensemble has a drastic positive effect on the accuracy of diagnostic performance as well as a decrease in false negative rate and a high quality of robustness even in heterogeneous data sets. This architecture has a scaled, clinically flexible way of screening TB through artificial intelligence.