Answered: Inputs Teacher Model (Pretrained)…

Systems Architecture

7th Edition

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Stephen D. Burd

Chapter10: Application Development

Section: Chapter Questions

Problem 14VE

See similar textbooks

Related questions

Question

Here is a clear background and explanation of the full method, including what each part is doing and why.

Background & Motivation

Missing values: Some input features (sensor channels) are missing for some samples due to sensor failure or corruption.

Missing labels: Not all samples have a ground-truth RUL value. For example, data collected during normal operation is often unlabeled.

Most traditional deep learning models require complete data and full labels. But in our case, both are incomplete. If we try to train a model directly, it will either fail to learn properly or discard valuable data.

What We Are Doing: Overview

We solve this using a Teacher–Student knowledge distillation framework:

We train a Teacher model on a clean and complete dataset where both inputs and labels are available.

We then use that Teacher to teach two separate Student models:

Student A learns from incomplete input (some sensor values missing).

Student B learns from incomplete labels (RUL labels missing for some samples).

We use knowledge distillation to guide both students, even when labels are missing.

Why We Use Two Students

the answer:

Student A handles Missing Input Features: It receives input with some features masked out. Since it cannot see the full input, we help it by transferring internal features (feature distillation) and predictions from the teacher.

Student B handles Missing RUL Labels: It receives full input but does not always have a ground-truth RUL label. We guide it using the predictions of the teacher model (prediction distillation).

Using two students allows each to specialize in solving one problem with a tailored learning strategy.

Detailed Explanation of the Teaching Process

1. Teacher Model (Trained First)

Input: Complete features

Label: Known RUL values

Output:

Final prediction ŷ_T (predicted RUL)

Internal features f_T (last encoder layer output)

2. Student A (Handles Missing Input)

Input: Some sensor values are masked

Label: RUL label available for some samples

Output: Predicted RUL: ŷ_S^A

How the Teacher Teaches Student A:

The student sees masked inputs. It tries to reconstruct what the teacher would have done if it had the full input.

We calculate:

Prediction distillation loss: How close is ŷ_S^A to ŷ_T?

Feature distillation loss: How close are the student’s encoder features to the teacher’s? f_S^A vs. f_T

Supervised loss: Where RUL label is available, compare to ground truth.

All these losses are combined, and we update the student encoder through backpropagation.

3. Student B (Handles Missing Labels)

Input: Full sensor data

Label: RUL label available only for some samples

Output: Predicted RUL: ŷ_S^B

How the Teacher Teaches Student B:

The student sees the full input, but no ground-truth RUL label.

We compute:

Prediction distillation loss: ŷ_S^B vs. ŷ_T

Supervised loss (only when RUL is available)

No feature distillation is used here — only predictions are used to guide learning.

I’m a little confused about the diagram I made. I’m not sure if it accurately represents what’s described in the text. I need help clarifying that part—if it doesn’t match, I’d like to create two separate diagrams to illustrate each challenge.

The knowledge distillation part seems a bit fuzzy to me because it’s not entirely clear what the teacher is teaching the student models. I need a very clear explanation of this process everything shoule appear clearly on the diagrams from the input till the final prediction especially since I want to use a transformer-based architecture.