Inputs Teacher Model (Pretrained) Internal Features!! Input C (Complete Data) Transformer Encoder T Teacher Prediction y_! Input M (Missing Data) Prediction Loss (y s'Avs Total Loss A Knowledge Distillation (Student B) Knowledge Distillation (Student A) Feature Alignment (Avs Backpropagation Total Loss B Backpropagation Prediction Loss (y "B vs y_0) Student ModeA (Handles MissingInput) Transformer Encoder S_A Ground Truth RUL RULLabels Student A Prediction y "A Student Model B (Handles Missing Labels) Transformer Encoder S B Student B Prediction y_s^8 Final Output Final RUL Prediction (y_s)
Note : please avoid using
Here is a clear background and explanation of the full method, including what each part is doing and why.
Background & Motivation
Missing values: Some input features (sensor channels) are missing for some samples due to sensor failure or corruption.
Missing labels: Not all samples have a ground-truth RUL value. For example, data collected during normal operation is often unlabeled.
Most traditional deep learning models require complete data and full labels. But in our case, both are incomplete. If we try to train a model directly, it will either fail to learn properly or discard valuable data.
What We Are Doing: Overview
We solve this using a Teacher–Student knowledge distillation framework:
We train a Teacher model on a clean and complete dataset where both inputs and labels are available.
We then use that Teacher to teach two separate Student models:
Student A learns from incomplete input (some sensor values missing).
Student B learns from incomplete labels (RUL labels missing for some samples).
We use knowledge distillation to guide both students, even when labels are missing.
Why We Use Two Students
the answer:
Student A handles Missing Input Features: It receives input with some features masked out. Since it cannot see the full input, we help it by transferring internal features (feature distillation) and predictions from the teacher.
Student B handles Missing RUL Labels: It receives full input but does not always have a ground-truth RUL label. We guide it using the predictions of the teacher model (prediction distillation).
Using two students allows each to specialize in solving one problem with a tailored learning strategy.
Detailed Explanation of the Teaching Process
1. Teacher Model (Trained First)
Input: Complete features
Label: Known RUL values
Output:
Final prediction ŷ_T (predicted RUL)
Internal features f_T (last encoder layer output)
2. Student A (Handles Missing Input)
Input: Some sensor values are masked
Label: RUL label available for some samples
Output: Predicted RUL: ŷ_S^A
How the Teacher Teaches Student A:
The student sees masked inputs. It tries to reconstruct what the teacher would have done if it had the full input.
We calculate:
Prediction distillation loss: How close is ŷ_S^A to ŷ_T?
Feature distillation loss: How close are the student’s encoder features to the teacher’s? f_S^A vs. f_T
Supervised loss: Where RUL label is available, compare to ground truth.
All these losses are combined, and we update the student encoder through backpropagation.
3. Student B (Handles Missing Labels)
Input: Full sensor data
Label: RUL label available only for some samples
Output: Predicted RUL: ŷ_S^B
How the Teacher Teaches Student B:
The student sees the full input, but no ground-truth RUL label.
We compute:
Prediction distillation loss: ŷ_S^B vs. ŷ_T
Supervised loss (only when RUL is available)
No feature distillation is used here — only predictions are used to guide learning.
I’m a little confused about the diagram I made. I’m not sure if it accurately represents what’s described in the text. I need help clarifying that part—if it doesn’t match, I’d like to create two separate diagrams to illustrate each challenge.
The knowledge distillation part seems a bit fuzzy to me because it’s not entirely clear what the teacher is teaching the student models. I need a very clear explanation of this process everything shoule appear clearly on the diagrams from the input till the final prediction especially since I want to use a transformer-based architecture.
please avoid using AI


Step by step
Solved in 2 steps with 4 images







