모델 구조는 아래와 같습니다. 

Wav2Vec2CMSCModel(

  (dropout_input): Dropout(p=0.1, inplace=False)

  (dropout_features): Dropout(p=0.1, inplace=False)

  (encoder): TransformerEncoder(

    (layers): ModuleList(

      (0-11): 12 x TransformerEncoderLayer(

        (self_attn): MultiHeadAttention(

          (dropout): Dropout()

          (k_proj): Linear(in_features=768, out_features=768, bias=True)

          (v_proj): Linear(in_features=768, out_features=768, bias=True)

          (q_proj): Linear(in_features=768, out_features=768, bias=True)

          (out_proj): Linear(in_features=768, out_features=768, bias=True)

        )

        (dropout1): Dropout(p=0.1, inplace=False)

        (dropout2): Dropout(p=0.0, inplace=False)

        (dropout3): Dropout(p=0.1, inplace=False)

        (self_attn_layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

        (fc1): Linear(in_features=768, out_features=3072, bias=True)

        (fc2): Linear(in_features=3072, out_features=768, bias=True)

        (final_layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

      )

    )

    (layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

  )

  (feature_extractor): ConvFeatureExtraction(

    (conv_layers): ModuleList(

      (0): Sequential(

        (0): Conv1d(12, 256, kernel_size=(2,), stride=(2,), bias=False)

        (1): Dropout(p=0.0, inplace=False)

        (2): Fp32GroupNorm(256, 256, eps=1e-05, affine=True)

        (3): GELU(approximate='none')

      )

      (1-3): 3 x Sequential(

        (0): Conv1d(256, 256, kernel_size=(2,), stride=(2,), bias=False)

        (1): Dropout(p=0.0, inplace=False)

        (2): GELU(approximate='none')

      )

    )

  )

  (post_extract_proj): Linear(in_features=256, out_features=768, bias=True)

  (conv_pos): ConvPositionalEncoding(

    (pos_conv): Sequential(

      (0): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)

      (1): SamePad()

      (2): GELU(approximate='none')

    )

  )

  (layer_norm): FusedLayerNorm(torch.Size([256]), eps=1e-05, elementwise_affine=True)

  (quantizer): GumbelVectorQuantizer(

    (weight_proj): Linear(in_features=256, out_features=640, bias=True)

  )

  (project_q): Linear(in_features=256, out_features=256, bias=True)

  (final_proj): Linear(in_features=768, out_features=256, bias=True)

)


일반적인 모델 학습 과는 다른것 같은게,  

outputs=model_pretrained(source=inputs) 이렇게 시행을 해도 뽑아내는 키들에 logits이 없으며 features만 뽑아냅니다. 

보통 파운데이션 모델이 pretrained된 건 representative 한 특징을 뽑아내는 기능을 가지고 있고, downstream task에 특화시키는 건 그 후 따로 파인튜닝을 하며 구현해야 한다고 알고 있는데, 구체적인 방법 아시는 분 계신가요?