Sign language recognition is the task of recognizing the sign performed by deaf people. Sign gestures can be isolated (one sign per video) or continuous (several signs per video). In this assignment, you will develop a deep learning model for sign language recognition at the sentence level. This task is similar to the video captioning task where the input will be a sign video and the output will be the performed sign(s) by the signer in the text format. An example of the input and output of the system is shown below. In the name of Allah Encoder Decoder In this assignment you need to do the following: O O о Read and prepare the dataset The dataset is provided as images (access link) There are 10 different sentences performed by three signers There are 80 frames extracted from each video's sample Use word embeddings to represent the ground truth text when you feed it to the model For each frame, use MobileNetV2 to extract features from the last fully connected layer before the classification layer of the MobileNetV2 pretrained model. You will use these features as inputs for the following questions. Develop an encoder-decoder model to recognize the sign video (video captioning) Select the best architecture that gives good results Develop an encoder-decoder model with attention to recognize the sign video (video captioning) Report the results using word error rate (WER) metric Transformer Develop a transformer model for the same problem. Sign language recognition is the task of recognizing the sign performed by deaf people. Sign gestures can be isolated (one sign per video) or continuous (several signs per video). In this assignment, you will develop a deep learning model for sign language recognition at the sentence level. This task is similar to the video captioning task where the input will be a sign video and the output will be the performed sign(s) by the signer in the text format. An example of the input and output of the system is shown below. In the name of Allah Encoder Decoder In this assignment you need to do the following: O O о Read and prepare the dataset The dataset is provided as images (access link) There are 10 different sentences performed by three signers There are 80 frames extracted from each video's sample Use word embeddings to represent the ground truth text when you feed it to the model For each frame, use MobileNetV2 to extract features from the last fully connected layer before the classification layer of the MobileNetV2 pretrained model. You will use these features as inputs for the following questions. Develop an encoder-decoder model to recognize the sign video (video captioning) Select the best architecture that gives good results Develop an encoder-decoder model with attention to recognize the sign video (video captioning) Report the results using word error rate (WER) metric Transformer Develop a transformer model for the same problem.

icon
Related questions
Question
Could you help me to solve this question?
Sign language recognition is the task of recognizing the sign performed by deaf people. Sign gestures can be
isolated (one sign per video) or continuous (several signs per video). In this assignment, you will develop a deep
learning model for sign language recognition at the sentence level. This task is similar to the video captioning
task where the input will be a sign video and the output will be the performed sign(s) by the signer in the text
format. An example of the input and output of the system is shown below.
In the name of Allah
Encoder
Decoder
In this assignment you need to do the following:
O
O
о
Read and prepare the dataset
The dataset is provided as images (access link)
There are 10 different sentences performed by three signers
There are 80 frames extracted from each video's sample
Use word embeddings to represent the ground truth text when you feed it to the model
For each frame, use MobileNetV2 to extract features from the last fully connected
layer before the classification layer of the MobileNetV2 pretrained model. You will use these
features as inputs for the following questions.
Develop an encoder-decoder model to recognize the sign video (video captioning)
Select the best architecture that gives good results
Develop an encoder-decoder model with attention to recognize the sign video
(video captioning)
Report the results using word error rate (WER) metric
Transformer
Develop a transformer model for the same problem.
Transcribed Image Text:Sign language recognition is the task of recognizing the sign performed by deaf people. Sign gestures can be isolated (one sign per video) or continuous (several signs per video). In this assignment, you will develop a deep learning model for sign language recognition at the sentence level. This task is similar to the video captioning task where the input will be a sign video and the output will be the performed sign(s) by the signer in the text format. An example of the input and output of the system is shown below. In the name of Allah Encoder Decoder In this assignment you need to do the following: O O о Read and prepare the dataset The dataset is provided as images (access link) There are 10 different sentences performed by three signers There are 80 frames extracted from each video's sample Use word embeddings to represent the ground truth text when you feed it to the model For each frame, use MobileNetV2 to extract features from the last fully connected layer before the classification layer of the MobileNetV2 pretrained model. You will use these features as inputs for the following questions. Develop an encoder-decoder model to recognize the sign video (video captioning) Select the best architecture that gives good results Develop an encoder-decoder model with attention to recognize the sign video (video captioning) Report the results using word error rate (WER) metric Transformer Develop a transformer model for the same problem.
Sign language recognition is the task of recognizing the sign performed by deaf people. Sign gestures can be
isolated (one sign per video) or continuous (several signs per video). In this assignment, you will develop a deep
learning model for sign language recognition at the sentence level. This task is similar to the video captioning
task where the input will be a sign video and the output will be the performed sign(s) by the signer in the text
format. An example of the input and output of the system is shown below.
In the name of Allah
Encoder
Decoder
In this assignment you need to do the following:
O
O
о
Read and prepare the dataset
The dataset is provided as images (access link)
There are 10 different sentences performed by three signers
There are 80 frames extracted from each video's sample
Use word embeddings to represent the ground truth text when you feed it to the model
For each frame, use MobileNetV2 to extract features from the last fully connected
layer before the classification layer of the MobileNetV2 pretrained model. You will use these
features as inputs for the following questions.
Develop an encoder-decoder model to recognize the sign video (video captioning)
Select the best architecture that gives good results
Develop an encoder-decoder model with attention to recognize the sign video
(video captioning)
Report the results using word error rate (WER) metric
Transformer
Develop a transformer model for the same problem.
Transcribed Image Text:Sign language recognition is the task of recognizing the sign performed by deaf people. Sign gestures can be isolated (one sign per video) or continuous (several signs per video). In this assignment, you will develop a deep learning model for sign language recognition at the sentence level. This task is similar to the video captioning task where the input will be a sign video and the output will be the performed sign(s) by the signer in the text format. An example of the input and output of the system is shown below. In the name of Allah Encoder Decoder In this assignment you need to do the following: O O о Read and prepare the dataset The dataset is provided as images (access link) There are 10 different sentences performed by three signers There are 80 frames extracted from each video's sample Use word embeddings to represent the ground truth text when you feed it to the model For each frame, use MobileNetV2 to extract features from the last fully connected layer before the classification layer of the MobileNetV2 pretrained model. You will use these features as inputs for the following questions. Develop an encoder-decoder model to recognize the sign video (video captioning) Select the best architecture that gives good results Develop an encoder-decoder model with attention to recognize the sign video (video captioning) Report the results using word error rate (WER) metric Transformer Develop a transformer model for the same problem.
Expert Solution
steps

Step by step

Solved in 2 steps

Blurred answer