English

x86/x64 SIMD命令一覧表 (SSE~AVX512) Beta

MMXレジスタ(64ビット)の命令は割愛しました。

S1=SSE  S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512

*はPS/PD/DQ を SS/SD/SI に変えるとスカラー命令(最下位のひとつのデータだけ計算)になります。

各命令の下の青字はその命令に対応するC/C++ intrinsicsの名前です。

AVX/AVX2

AVX512

うろ覚えの命令名を見つけてマニュアルを引けるようにするために作ったものです。プログラミングの際にはこの表の内容を元にせずに必ずマニュアルで確認してください。

Intelの マニュアル→ http://www.intel.co.jp/content/www/jp/ja/processors/architectures-software-developer-manuals.html

お気づきの点がありましたらこちらのフィードバックフォーム、またはメールでページ末尾のアドレスまでご一報いただければ幸甚です。

 

強調表示    色    この設定をデフォルトにするにはここをクリックしてからこのページをブックマークしてください。

MOVE     ○MM = XMM/YMM/ZMM

  整数 実数 YMMレーン(128bit)
QWORD DWORD WORD BYTE 倍精度 単精度 半精度
○MM全体
↑↓
○MM/mem
MOVDQA (S2     512bitは下段の数字つきの命令のみ
_mm_load_si128
_mm_store_si128
MOVDQU (S2
_mm_loadu_si128
_mm_storeu_si128
MOVAPD (S2
_mm_load_pd
_mm_loadr_pd
_mm_store_pd
_mm_storer_pd

MOVUPD (S2
_mm_loadu_pd
_mm_storeu_pd
MOVAPS (S1
_mm_load_ps
_mm_loadr_ps
_mm_store_ps
_mm_storer_ps

MOVUPS (S1
_mm_loadu_ps
_mm_storeu_ps
 
VMOVDQA64 (V5…
_mm_mask_load_epi64
_mm_mask_store_epi64
など
VMOVDQU64 (V5…
_mm_mask_loadu_epi64
_mm_mask_store_epi64
など
VMOVDQA32 (V5…
_mm_mask_load_epi32
_mm_mask_store_epi32
など
VMOVDQU32 (V5…
_mm_mask_loadu_epi32
_mm_mask_storeu_epi32
など
VMOVDQU16 (V5+BW…
_mm_mask_loadu_epi16
_mm_mask_storeu_epi16
など
VMOVDQU8 (V5+BW…
_mm_mask_loadu_epi8
_mm_mask_storeu_epi8
など
XMM上半分
↑↓
mem
MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd
MOVHPS (S1
_mm_loadh_pi
_mm_storeh_pi
 
XMM上半分
↑↓
XMM下半分
MOVHLPS (S1
_mm_movehl_ps
MOVLHPS (S1
_mm_movelh_ps
 
XMM下半分
↑↓
mem
        MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
MOVLPS (S1
_mm_loadl_pi
_mm_storel_pi
   
XMM最下位ひとつ
↑↓
r/m
MOVQ (S2
_mm_cvtsi64_si128
_mm_cvtsi128_si64
MOVD (S2
_mm_cvtsi32_si128
_mm_cvtsi128_si32
   
XMM最下位ひとつ
↑↓
XMM/mem
MOVQ (S2
_mm_move_epi64
      MOVSD (S2
_mm_load_sd
_mm_store_sd
_mm_move_sd
MOVSS (S1
_mm_load_ss
_mm_store_ss
_mm_move_ss
   
XMM全体

ひとつのデータ
小ネタ2
_mm_set1_epi64x
VPBROADCASTQ (V2
_mm_broadcastq_epi64
小ネタ2
_mm_set1_epi32
VPBROADCASTD (V2
_mm_broadcastd_epi32
小ネタ2
_mm_set1_epi16
VPBROADCASTW (V2
_mm_broadcastw_epi16
_mm_set1_epi8
VPBROADCASTB (V2
_mm_broadcastb_epi8
小ネタ2
_mm_set1_pd
_mm_load1_pd
MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd

小ネタ2
_mm_set1_ps
_mm_load1_ps

VBROADCASTSS
 memから (V1
 XMMから(V2
_mm_broadcast_ss
YMM/ZMM全体

ひとつのデータ
VPBROADCASTQ (V2
_mm256_broadcastq_epi64
VPBROADCASTD (V2
_mm256_broadcastd_epi32
VPBROADCASTW (V2
_mm256_broadcastw_epi16
VPBROADCASTB (V2
_mm256_broadcastb_epi8
VBROADCASTSD
 memから (V1
 XMMから (V2
_mm256_broadcast_sd
VBROADCASTSS
 memから (V1
 XMMから(V2
_mm256_broadcast_ss
  VBROADCASTF128 (V1
_mm256_broadcast_ps
_mm256_broadcast_pd

VBROADCASTI128 (V2
_mm256_broadcastsi128_si256
YMM/ZMM全体

2/4/8個のデータ
VBROADCASTI64X2 (V5+DQ…
_mm512_broadcast_i64x2
VBROADCASTI64X4 (V5
_mm512_broadcast_i64x4
VBROADCASTI32X2 (V5+DQ…
_mm512_broadcast_i32x2
VBROADCASTI32X4 (V5…
_mm512_broadcast_i32x4
VBROADCASTI32X8 (V5+DQ
_mm512_broadcast_i32x8
VBROADCASTF64X2 (V5+DQ…
_mm512_broadcast_f64x2
VBROADCASTF64X4 (V5
_mm512_broadcast_f64x4
VBROADCASTF32X2 (V5+DQ…
_mm512_broadcast_f32x2
VBROADCASTF32X4 (V5…
_mm512_broadcast_f32x4
VBROADCASTF32X8 (V5+DQ
_mm512_broadcast_f32x8
○MM

複数個のデータ
_mm_set_epi64x
_mm_setr_epi64x
_mm_set_epi32
_mm_setr_epi32
_mm_set_epi16
_mm_setr_epi16
_mm_set_epi8
_mm_setr_epi8
_mm_set_pd
_mm_setr_pd
_mm_set_ps
_mm_setr_ps
   
○MM全体

ゼロ
小ネタ1
_mm_setzero_si128
小ネタ1
_mm_setzero_pd
小ネタ1
_mm_setzero_ps
   
extract PEXTRQ (S4.1
_mm_extract_epi64
PEXTRD (S4.1
_mm_extract_epi32
PEXTRW rへ (S2
PEXTRW r/mへ (S4.1
_mm_extract_epi16
PEXTRB (S4.1
_mm_extract_epi8
  EXTRACTPS (S4.1
_mm_extract_ps
  VEXTRACTF128 (V1
_mm256_extractf128_ps
_mm256_extractf128_pd
_mm256_extractf128_si256

VEXTRACTI128 (V2
_mm256_extracti128_si256
VEXTRACTI64X2 (V5+DQ…
_mm512_extracti64x2_epi64
VEXTRACTI64X4 (V5
_mm512_extracti64x4_epi64
VEXTRACTI32X4 (V5…
_mm512_extracti32x4_epi32
VEXTRACTI32X8 (V5+DQ
_mm512_extracti32x8_epi32
VEXTRACTF64X2 (V5+DQ…
_mm512_extractf64x2_pd
VEXTRACTF64X4 (V5
_mm512_extractf64x4_pd
VEXTRACTF32X4 (V5…
_mm512_extractf32x4_ps
VEXTRACTF32X8 (V5+DQ
_mm512_extractf32x8_ps
insert PINSRQ (S4.1
_mm_insert_epi64
PINSRD (S4.1
_mm_insert_epi32
PINSRW (S2
_mm_insert_epi16
PINSRB (S4.1
_mm_insert_epi8
  INSERTPS (S4.1
_mm_insert_ps
  VINSERTF128 (V1
_mm256_insertf128_ps
_mm256_insertf128_pd
_mm256_insertf128_si256

VINSERTI128 (V2
_mm256_inserti128_si256
VINSERTI64X2 (V5+DQ…
_mm512_inserrti64x2
VINSERTI64X4 (V5…
_mm512_inserti64x4
VINSERTI32X4 (V5…
_mm512_inserti32x4
VINSERTI32X8 (V5+DQ
_mm512_inserti32x8
VINSERTF64X2 (V5+DQ…
_mm512_insertf64x2
VINSERTF64X4 (V5
_mm512_insertf64x4
VINSERTF32X4 (V5…
_mm512_insertf32x4
VINSERTF32X8 (V5+DQ
_mm512_insertf32x8
unpack
PUNPCKHQDQ (S2
_mm_unpackhi_epi64
PUNPCKLQDQ (S2
_mm_unpacklo_epi64
PUNPCKHDQ (S2
_mm_unpackhi_epi32
PUNPCKLDQ (S2
_mm_unpacklo_epi32
PUNPCKHWD (S2
_mm_unpackhi_epi16
PUNPCKLWD (S2
_mm_unpacklo_epi16
PUNPCKHBW (S2
_mm_unpackhi_epi8
PUNPCKLBW (S2
_mm_unpacklo_epi8
UNPCKHPD (S2
_mm_unpackhi_pd
UNPCKLPD (S2
_mm_unpacklo_pd
UNPCKHPS (S1
_mm_unpackhi_ps
UNPCKLPS (S1
_mm_unpacklo_ps
   
shuffle/permute
VPERMQ (V2
_mm256_permute4x64_epi64
VPERMI2Q (V5…
_mm_permutex2var_epi64
PSHUFD (S2
_mm_shuffle_epi32
VPERMD (V2
_mm256_permutevar8x32_epi32
_mm256_permutexvar_epi32
VPERMI2D (V5…
_mm_permutex2var_epi32
PSHUFHW (S2
_mm_shufflehi_epi16
PSHUFLW (S2
_mm_shufflelo_epi16
VPERMW (V5+BW…
_mm_permutexvar_epi16
VPERMI2W (V5+BW…
_mm_permutex2var_epi16
PSHUFB (SS3
_mm_shuffle_epi8
SHUFPD (S2
_mm_shuffle_pd
VPERMILPD (V1
_mm_permute_pd
_mm_permutevar_pd

VPERMPD (V2
_mm256_permute4x64_pd
VPERMI2PD (V5…
_mm_permutex2var_pd
SHUFPS (S1
_mm_shuffle_ps
VPERMILPS (V1
_mm_permute_ps
_mm_permutevar_ps

VPERMPS (V2
_mm256_permutevar8x32_ps
VPERMI2PS (V5…
_mm_permutex2var_ps
  VPERM2F128 (V1
_mm256_permute2f128_ps
_mm256_permute2f128_pd
_mm256_permute2f128_si256

VPERM2I128 (V2
_mm256_permute2x128_si256
VSHUFI64X2 (V5…
_mm512_shuffle_i64x2
VSHUFI32X4 (V5…
_mm512_shuffle_i32x4
VSHUFF64X2 (V5…
_mm512_shuffle_f64x2
VSHUFF32X4 (V5…
_mm512_shuffle_f32x4
blend
VPBLENDMQ (V5…
_mm_mask_blend_epi32
VPBLENDD (V2
_mm_blend_epi32
VPBLENDMD (V5…
_mm_mask_blend_epi32
PBLENDW (S4.1
_mm_blend_epi16
VPBLENDMW (V5+BW…
_mm_mask_blend_epi16
PBLENDVB (S4.1
_mm_blendv_epi8
VPBLENDMB (V5+BW…
_mm_mask_blend_epi8
BLENDPD (S4.1
_mm_blend_pd
BLENDVPD (S4.1
_mm_blendv_pd
VBLENDMPD (V5…
_mm_mask_blend_pd
BLENDPS (S4.1
_mm_blend_ps
BLENDVPS (S4.1
_mm_blendv_ps
VBLENDMPS (V5…
_mm_mask_blend_ps
   
move and duplicate MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd
MOVSHDUP (S3
_mm_movehdup_ps
MOVSLDUP (S3
_mm_moveldup_ps
 
mask move VPMASKMOVQ (V2
_mm_maskload_epi64
_mm_maskstore_epi64
VPMASKMOVD (V2
_mm_maskload_epi32
_mm_maskstore_epi32
    VMASKMOVPD (V1
_mm_maskload_pd
_mm_maskstore_pd
VMASKMOVPS (V1
_mm_maskload_ps
_mm_maskstore_ps
   
最上位ビット抽出       PMOVMSKB (S2
_mm_movemask_epi8
MOVMSKPD (S2
_mm_movemask_pd
MOVMSKPS (S1
_mm_movemask_ps
   
VPMOVQ2M (V5+DQ…
_mm_movepi64_mask
VPMOVD2M (V5+DQ…
_mm_movepi32_mask
VPMOVW2M (V5+BW…
_mm_movepi16_mask
VPMOVB2M (V5+BW…
_mm_movepi8_mask
gather
VPGATHERDQ (V2
_mm_i32gather_epi64
_mm_mask_i32gather_epi64

VPGATHERQQ (V2
_mm_i64gather_epi64
_mm_mask_i64gather_epi64
VPGATHERDD (V2
_mm_i32gather_epi32
_mm_mask_i32gather_epi32

VPGATHERQD (V2
_mm_i64gather_epi32
_mm_mask_i64gather_epi32
    VGATHERDPD (V2
_mm_i32gather_pd
_mm_mask_i32gather_pd

VGATHERQPD (V2
_mm_i64gather_pd
_mm_mask_i64gather_pd
VGATHERDPS (V2
_mm_i32gather_ps
_mm_mask_i32gather_ps

VGATHERQPS (V2
_mm_i64gather_ps
_mm_mask_i64gather_ps
   
scatter
VPSCATTERDQ (V5…
_mm_i32scatter_epi64
_mm_mask_i32scatter_epi64

VPSCATTERQQ (V5…
_mm_i64scatter_epi64
_mm_mask_i64scatter_epi64
VPSCATTERDD (V5…
_mm_i32scatter_epi32
_mm_mask_i32scatter_epi32

VPSCATTERQD (V5…
_mm_i64scatter_epi32
_mm_mask_i64scatter_epi32
    VSCATTERDPD (V5…
_mm_i32scatter_pd
_mm_mask_i32scatter_pd

VSCATTERQPD (V5…
_mm_i64scatter_pd
_mm_mask_i64scatter_pd
VSCATTERDPS (V5…
_mm_i32scatter_ps
_mm_mask_i32scatter_ps

VSCATTERQPS (V5…
_mm_i64scatter_ps
_mm_mask_i64scatter_ps
   
compress
VPCOMPRESSQ (V5…
_mm_mask_compress_epi64
_mm_mask_compressstoreu_epi64
VPCOMPRESSD (V5…
_mm_mask_compress_epi32
_mm_mask_compressstoreu_epi32
VCOMPRESSPD (V5…
_mm_mask_compress_pd
_mm_mask_compressstoreu_pd
VCOMPRESSPS (V5…
_mm_mask_compress_ps
_mm_mask_compressstoreu_ps
expand
VEXPANDQ (V5…
_mm_mask_expand_epi64
_mm_mask_expandloadu_epi64
VEXPANDD (V5…
_mm_mask_expand_epi32
_mm_mask_expandloadu_epi32
VEXPANDPD (V5…
_mm_mask_expand_pd
_mm_mask_expandloadu_pd
VEXPANDPS (V5…
_mm_mask_expand_ps
_mm_mask_expandloadu_ps
align right VALIGNQ (V5…
_mm_alignr_epi64
VALIGND (V5…
_mm_alignr_epi32
PALIGNR (SS3
_mm_alignr_epi8
Opmaskのビットを拡張 VPMOVM2Q (V5+DQ…
_mm_movm_epi64
VPMOVM2D (V5+DQ…
_mm_movm_epi32
VPMOVM2W (V5+BW…
_mm_movm_epi16
VPMOVM2B (V5+BW…
_mm_movm_epi8

 

変換

変換元\変換先 整数 実数
QWORD DWORD WORD BYTE 倍精度 単精度 半精度
整数 QWORD VPMOVQD (V5…
_mm_cvtepi64_epi32
VPMOVSQD (V5…
_mm_cvtsepi64_epi32
VPMOVUSQD (V5…
_mm_cvtusepi64_epi32
VPMOVQW (V5…
_mm_cvtepi64_epi16
VPMOVSQW (V5…
_mm_cvtsepi64_epi16
VPMOVUSQW (V5…
_mm_cvtusepi64_epi16
VPMOVQB (V5…
_mm_cvtepi64_epi8
VPMOVSQB (V5…
_mm_cvtsepi64_epi8
VPMOVUSQB (V5…
_mm_cvtusepi64_epi8
CVTSI2SD (S2 スカラーのみ
_mm_cvtsi64_sd
VCVTQQ2PD* (V5+DQ…
_mm_cvtepi64_pd
VCVTUQQ2PD* (V5+DQ…
_mm_cvtepu64_pd
CVTSI2SS (S1 スカラーのみ
_mm_cvtsi64_ss
VCVTQQ2PS* (V5+DQ…
_mm_cvtepi64_ps
VCVTUQQ2PS* (V5+DQ…
_mm_cvtepu64_ps
DWORD 小ネタ3
PMOVSXDQ (S4.1
_mm_ cvtepi32_epi64
PMOVZXDQ (S4.1
_mm_ cvtepu32_epi64
  PACKSSDW (S2
_mm_packs_epi32
PACKUSDW (S4.1
_mm_packus_epi32
VPMOVDW (V5…
_mm_cvtepi32_epi16
VPMOVSDW (V5…
_mm_cvtsepi32_epi16
VPMOVUSDW (V5…
_mm_cvtusepi32_epi16
VPMOVDB (V5…
_mm_cvtepi32_epi8
VPMOVSDB (V5…
_mm_cvtsepi32_epi8
VPMOVUSDB (V5…
_mm_cvtusepi32_epi8
CVTDQ2PD* (S2
_mm_cvtepi32_pd
VCVTUDQ2PD* (V5…
_mm_cvtepu32_pd
CVTDQ2PS* (S2
_mm_cvtepi32_ps
VCVTUDQ2PS* (V5…
_mm_cvtepu32_ps
WORD PMOVSXWQ (S4.1
_mm_ cvtepi16_epi64
PMOVZXWQ (S4.1
_mm_ cvtepu16_epi64
小ネタ3
PMOVSXWD (S4.1
_mm_ cvtepi16_epi32
PMOVZXWD (S4.1
_mm_ cvtepu16_epi32
PACKSSWB (S2
_mm_packs_epi16
PACKUSWB (S2
_mm_packus_epi16
VPMOVWB (V5+BW…
_mm_cvtepi16_epi8
VPMOVSWB (V5+BW…
_mm_cvtsepi16_epi8
VPMOVUSWB (V5+BW…
_mm_cvtusepi16_epi8
BYTE PMOVSXBQ (S4.1
_mm_ cvtepi8_epi64
PMOVZXBQ (S4.1
_mm_ cvtepu8_epi64
PMOVSXBD (S4.1
_mm_ cvtepi8_epi32
PMOVZXBD (S4.1
_mm_ cvtepu8_epi32
小ネタ3
PMOVSXBW (S4.1
_mm_ cvtepi8_epi16
PMOVZXBW (S4.1
_mm_ cvtepu8_epi16
実数 倍精度 CVTSD2SI / CVTTSD2SI (S2 スカラーのみ
_mm_cvtsd_si64 / _mm_cvttsd_si64
VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ…
_mm_cvtpd_epi64 / _mm_cvttpd_epi64
VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ…
_mm_cvtpd_epu64 / _mm_cvttpd_epu64
右側は端数切捨
CVTPD2DQ* / CVTTPD2DQ* (S2
_mm_cvtpd_epi32 / _mm_cvttpd_epi32
VCVTPD2UDQ* / VCVTTPD2UDQ* (V5…
_mm_cvtpd_epu32 / _mm_cvttpd_epu32
右側は端数切捨
CVTPD2PS* (S2
_mm_cvtpd_ps
単精度 CVTSS2SI / CVTTSS2SI (S1 スカラーのみ
_mm_cvtss_si64 / _mm_cvttss_si64
VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ…
_mm_cvtps_epi64 / _mm_cvttps_epi64
VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ…
_mm_cvtps_epu64 / _mm_cvttps_epu64
右側は端数切捨
CVTPS2DQ* / CVTTPS2DQ* (S2
_mm_cvtps_epi32 / _mm_cvttps_epi32
VCVTPS2UDQ* / VCVTTPS2UDQ* (V5…
_mm_cvtps_epu32 / _mm_cvttps_epu32
右側は端数切捨
  CVTPS2PD* (S2
_mm_cvtps_pd
VCVTPS2PH (V1
_mm_cvtps_ph
半精度 VCVTPH2PS (V1
_mm_cvtph_ps

 

算術演算

  整数 実数
QWORD DWORD WORD BYTE 倍精度 単精度 半精度
add PADDQ (S2
_mm_add_epi64
PADDD (S2
_mm_add_epi32
PADDW (S2
_mm_add_epi16
PADDSW (S2
_mm_adds_epi16
PADDUSW (S2
_mm_adds_epu16
PADDB (S2
_mm_add_epi8
PADDSB (S2
_mm_adds_epi8
PADDUSB (S2
_mm_adds_epu8
ADDPD* (S2
_mm_add_pd
ADDPS* (S1
_mm_add_ps
sub PSUBQ (S2
_mm_sub_epi64
PSUBD (S2
_mm_sub_epi32
PSUBW (S2
_mm_sub_epi16
PSUBSW (S2
_mm_subs_epi16
PSUBUSW (S2
_mm_subs_epu16
PSUBB (S2
_mm_sub_epi8
PSUBSB (S2
_mm_subs_epi8
PSUBUSB (S2
_mm_subs_epu8
SUBPD* (S2
_mm_sub_pd
SUBPS* (S1
_mm_sub_ps
 
mul VPMULLQ (V5+DQ…
_mm_mullo_epi64
PMULDQ (S4.1
_mm_mul_epi32
PMULUDQ (S2
_mm_mul_epu32
PMULLD (S4.1
_mm_mullo_epi32
PMULHW (S2
_mm_mulhi_epi16
PMULHUW (S2
_mm_mulhi_epu16
PMULLW (S2
_mm_mullo_epi16
MULPD* (S2
_mm_mul_pd
MULPS* (S1
_mm_mul_ps
div DIVPD* (S2
_mm_div_pd
DIVPS* (S1
_mm_div_ps
逆数         VRCP14PD* (V5…
_mm_rcp14_pd
VRCP28PD* (V5+ER
_mm512_rcp28_pd
RCPPS* (S1
_mm_rcp_ps
VRCP14PS* (V5…
_mm_rcp14_ps
VRCP28PS* (V5+ER
_mm512_rcp28_ps
 
平方根         SQRTPD* (S2
_mm_sqrt_pd
SQRTPS* (S1
_mm_sqrt_ps
 
平方根の逆数         VRSQRT14PD* (V5…
_mm_rsqrt14_pd
VRSQRT28PD* (V5+ER
_mm512_rsqrt28_pd
RSQRTPS* (S1
_mm_rsqrt_ps
VRSQRT14PS* (V5…
_mm_rsqrt14_ps
VRSQRT28PS* (V5+ER
_mm_rsqrt28_ps
 
2のべき乗         VEXP2PD* (V5+ER
_mm512_exp2a23_roundpd
VEXP2PS* (V5+ER
_mm512_exp2a23_round_ps
 
2のn乗を掛ける VSCALEFPD* (V5…
_mm_scalef_pd
VSCALEFPS* (V5…
_mm_scalef_ps
max 小ネタ8
VPMAXSQ (V5…
_mm_max_epi64
VPMAXUQ (V5…
_mm_max_epu64
小ネタ8
PMAXSD (S4.1
_mm_max_epi32
PMAXUD (S4.1
_mm_max_epu32
PMAXSW (S2
_mm_max_epi16
PMAXUW (S4.1
_mm_max_epu16
小ネタ8
PMAXSB (S4.1
_mm_max_epi8
PMAXUB (S2
_mm_max_epu8
小ネタ8
MAXPD* (S2
_mm_max_pd
小ネタ8
MAXPS* (S1
_mm_max_ps
 
min 小ネタ8
VPMINSQ (V5…
_mm_min_epi64
VPMINUQ (V5…
_mm_min_epu64
小ネタ8
PMINSD (S4.1
_mm_min_epi32
PMINUD (S4.1
_mm_min_epu32
PMINSW (S2
_mm_min_epi16
PMINUW (S4.1
_mm_min_epu16
小ネタ8
PMINSB (S4.1
_mm_min_epi8
PMINUB (S2
_mm_min_epu8
小ネタ8
MINPD* (S2
_mm_min_pd
小ネタ8
MINPS* (S1
_mm_min_ps
平均     PAVGW (S2
_mm_avg_epu16
PAVGB (S2
_mm_avg_epu8
     
絶対値 小ネタ4
VPABSQ (V5…
_mm_abs_epi64
小ネタ4
PABSD (SS3
_mm_abs_epi32
小ネタ4
PABSW (SS3
_mm_abs_epi16
小ネタ4
PABSB (SS3
_mm_abs_epi8
小ネタ5 小ネタ5  
符号操作   PSIGND (SS3
_mm_sign_epi32
PSIGNW (SS3
_mm_sign_epi16
PSIGNB (SS3
_mm_sign_epi8
     
丸め         ROUNDPD* (S4.1
_mm_round_pd
_mm_floor_pd
_mm_ceil_pd

VRNDSCALEPD* (V5…
_mm_roundscale_pd
ROUNDPS* (S4.1
_mm_round_ps
_mm_floor_ps
_mm_ceil_ps

VRNDSCALEPS* (V5…
_mm_roundscale_ps
 
丸めとの差         VREDUCEPD* (V5+DQ…
_mm_reduce_pd
VREDUCEPS* (V5+DQ…
_mm_reduce_ps
 
add/sub         ADDSUBPD (S3
_mm_addsub_pd
ADDSUBPS (S3
_mm_addsub_ps
 
horizontal add   PHADDD (SS3
_mm_hadd_epi32
PHADDW (SS3
_mm_hadd_epi16
PHADDSW (SS3
_mm_hadds_epi16
  HADDPD (S3
_mm_hadd_pd
HADDPS (S3
_mm_hadd_ps
 
horizontal sub   PHSUBD (SS3
_mm_hsub_epi32
PHSUBW (SS3
_mm_hsub_epi16
PHSUBSW (SS3
_mm_hsubs_epi16
  HSUBPD (S3
_mm_hsub_pd
HSUBPS (S3
_mm_hsub_ps
 
ドット積         DPPD (S4.1
_mm_dp_pd
DPPS (S4.1
_mm_dp_ps
 
multiply and add PMADDWD (S2
_mm_madd_epi16
PMADDUBSW (SS3
_mm_maddubs_epi16
fused multiply and add / sub         VFMADDxxxPD* (FMA
_mm_fmadd_pd
VFMSUBxxxPD* (FMA
_mm_fmsub_pd
VFMADDSUBxxxPD (FMA
_mm_fmaddsub_pd
VFMSUBADDxxxPD (FMA
_mm_fmsubadd_pd
VFNMADDxxxPD* (FMA
_mm_fnmadd_pd
VFNMSUBxxxPD* (FMA
_mm_fnmsub_pd
xxx=132/213/231
VFMADDxxxPS* (FMA
_mm_fmadd_ps
VFMSUBxxxPS* (FMA
_mm_fmsub_ps
VFMADDSUBxxxPS (FMA
_mm_fmaddsub_ps
VFMSUBADDxxxPS (FMA
_mm_fmsubadd_ps
VFNMADDxxxPS* (FMA
_mm_fnmadd_ps
VFNMSUBxxxPS* (FMA
_mm_fnmsub_ps
xxx=132/213/231
 

 

比較

  整数
QWORD DWORD WORD BYTE
compare for == PCMPEQQ (S4.1
_mm_cmpeq_epi64
_mm_cmpeq_epi64_mask (V5…
VPCMPEQUQ (V5…
_mm_cmpeq_epu64_mask
PCMPEQD (S2
_mm_cmpeq_epi32
_mm_cmpeq_epi32_mask (V5…
VPCMPEQUD (V5…
_mm_cmpeq_epu32_mask
PCMPEQW (S2
_mm_cmpeq_epi16
_mm_cmpeq_epi16_mask (V5+BW…
VPCMPEQUW (V5+BW…
_mm_cmpeq_epu16_mask
PCMPEQB (S2
_mm_cmpeq_epi8
_mm_cmpeq_epi8_mask (V5+BW…
VPCMPEQUB (V5+BW…
_mm_cmpeq_epu8_mask
compare for < VPCMPLTQ (V5…
_mm_cmplt_epi64_mask
VPCMPLTUQ (V5…
_mm_cmplt_epu64_mask
VPCMPLTD (V5…
_mm_cmplt_epi32_mask
VPCMPLTUD (V5…
_mm_cmplt_epu32_mask
VPCMPLTW (V5+BW…
_mm_cmplt_epi16_mask
VPCMPLTUW (V5+BW…
_mm_cmplt_epu16_mask
VPCMPLTB (V5+BW…
_mm_cmplt_epi8_mask
VPCMPLTUB (V5+BW…
_mm_cmplt_epu8_mask
compare for <= VPCMPLEQ (V5…
_mm_cmple_epi64_mask
VPCMPLEUQ (V5…
_mm_cmple_epu64_mask
VPCMPLED (V5…
_mm_cmple_epi32_mask
VPCMPLEUD (V5…
_mm_cmple_epu32_mask
VPCMPLEW (V5+BW…
_mm_cmple_epi16_mask
VPCMPLEUW (V5+BW…
_mm_cmple_epu16_mask
VPCMPLEB (V5+BW…
_mm_cmple_epi8_mask
VPCMPLEUB (V5+BW…
_mm_cmple_epu8_mask
compare for > PCMPGTQ (S4.2
_mm_cmpgt_epi64
VPCMPNLEQ (V5…
_mm_cmpgt_epi64_mask
VPCMPNLEUQ (V5…
_mm_cmpgt_epu64_mask
PCMPGTD (S2
_mm_cmpgt_epi32
VPCMPNLED (V5…
_mm_cmpgt_epi32_mask
VPCMPNLEUD (V5…
_mm_cmpgt_epu32_mask
PCMPGTW (S2
_mm_cmpgt_epi16
VPCMPNLEW (V5+BW…
_mm_cmpgt_epi16_mask
VPCMPNLEUW (V5+BW…
_mm_cmpgt_epu16_mask
PCMPGTB (S2
_mm_cmpgt_epi8
VPCMPNLEB (V5+BW…
_mm_cmpgt_epi8_mask
VPCMPNLEUB (V5+BW…
_mm_cmpgt_epu8_mask
compare for >= VPCMPNLTQ (V5…
_mm_cmpge_epi64_mask
VPCMPNLTUQ (V5…
_mm_cmpge_epu64_mask
VPCMPNLTD (V5…
_mm_cmpge_epi32_mask
VPCMPNLTUD (V5…
_mm_cmpge_epu32_mask
VPCMPNLTW (V5+BW…
_mm_cmpge_epi16_mask
VPCMPNLTUW (V5+BW…
_mm_cmpge_epu16_mask
VPCMPNLTB (V5+BW…
_mm_cmpge_epi8_mask
VPCMPNLTUB (V5+BW…
_mm_cmpge_epu8_mask
compare for != VPCMPNEQQ (V5…
_mm_cmpneq_epi64_mask
VPCMPNEQUQ (V5…
_mm_cmpneq_epu64_mask
VPCMPNEQD (V5…
_mm_cmpneq_epi32_mask
VPCMPNEQUD (V5…
_mm_cmpneq_epu32_mask
VPCMPNEQW (V5+BW…
_mm_cmpneq_epi16_mask
VPCMPNEQUW (V5+BW…
_mm_cmpneq_epu16_mask
VPCMPNEQB (V5+BW…
_mm_cmpneq_epi8_mask
VPCMPNEQUB (V5+BW…
_mm_cmpneq_epu8_mask

 

実数
倍精度 単精度 半精度
一方または両方がNaNのとき 条件不成立 条件成立とみなす 条件不成立 条件成立とみなす  
QNaNで例外 YES NO YES NO YES NO YES NO  
compare for == VCMPEQ_OSPD* (V1
_mm_cmp_pd
CMPEQPD* (S2
_mm_cmpeq_pd
VCMPEQ_USPD* (V1
_mm_cmp_pd
VCMPEQ_UQPD* (V1
_mm_cmp_pd
VCMPEQ_OSPS* (V1
_mm_cmp_ps
CMPEQPS* (S1
_mm_cmpeq_ps
VCMPEQ_USPS* (V1
_mm_cmp_ps
VCMPEQ_UQPS* (V1
_mm_cmp_ps
 
compare for < CMPLTPD* (S2
_mm_cmplt_pd
VCMPLT_OQPD* (V1
_mm_cmp_pd
    CMPLTPS* (S1
_mm_cmplt_ps
VCMPLT_OQPS* (V1
_mm_cmp_ps
     
compare for <= CMPLEPD* (S2
_mm_cmple_pd
VCMPLE_OQPD* (V1
_mm_cmp_pd
CMPLEPS* (S1
_mm_cmple_ps
VCMPLE_OQPS* (V1
_mm_cmp_ps
 
compare for > VCMPGTPD* (V1
_mm_cmpgt_pd (S2
VCMPGT_OQPD* (V1
_mm_cmp_pd
    VCMPGTPS* (V1
_mm_cmpgt_ps (S1
VCMPGT_OQPS* (V1
_mm_cmp_ps
     
compare for >= VCMPGEPD* (V1
_mm_cmpge_pd (S2
VCMPGE_OQPD* (V1
_mm_cmp_pd
    VCMPGEPS* (V1
_mm_cmpge_ps (S1
VCMPGE_OQPS* (V1
_mm_cmp_ps
     
compare for != VCMPNEQ_OSPD* (V1
_mm_cmp_pd
VCMPNEQ_OQPD* (V1
_mm_cmp_pd
VCMPNEQ_USPD* (V1
_mm_cmp_pd
CMPNEQPD* (S2
_mm_cmpneq_pd
VCMPNEQ_OSPS* (V1
_mm_cmp_ps
VCMPNEQ_OQPS* (V1
_mm_cmp_ps
VCMPNEQ_USPS* (V1
_mm_cmp_ps
CMPNEQPS* (S1
_mm_cmpneq_ps
 
compare for ! < CMPNLTPD* (S2
_mm_cmpnlt_pd
VCMPNLT_UQPD* (V1
_mm_cmp_pd
CMPNLTPS* (S1
_mm_cmpnlt_ps
VCMPNLT_UQPS* (V1
_mm_cmp_ps
 
compare for ! <=     CMPNLEPD* (S2
_mm_cmpnle_pd
VCMPNLE_UQPD* (V1
_mm_cmp_pd
    CMPNLEPS* (S1
_mm_cmpnle_ps
VCMPNLE_UQPS* (V1
_mm_cmp_ps
 
compare for ! > VCMPNGTPD* (V1
_mm_cmpngt_pd (S2
VCMPNGT_UQPD* (V1
_mm_cmp_pd
VCMPNGTPS* (V1
_mm_cmpngt_ps (S1
VCMPNGT_UQPS* (V1
_mm_cmp_ps
 
compare for ! >=     VCMPNGEPD* (V1
_mm_cmpnge_pd (S2
VCMPNGE_UQPD* (V1
_mm_cmp_pd
    VCMPNGEPS* (V1
_mm_cmpnge_ps (S1
VCMPNGE_UQPS* (V1
_mm_cmp_ps
 
compare for ordered VCMPORD_SPD* (V1
_mm_cmp_pd
CMPORDPD* (S2
_mm_cmpord_pd
VCMPORD_SPS* (V1
_mm_cmp_ps
CMPORDPS* (S1
_mm_cmpord_ps
 
compare for unordered     VCMPUNORD_SPD* (V1
_mm_cmp_pd
CMPUNORDPD* (S2
_mm_cmpunord_pd
    VCMPUNORD_SPS* (V1
_mm_cmp_ps
CMPUNORDPS* (S1
_mm_cmpunord_ps
 
TRUE VCMPTRUE_USPD* (V1
_mm_cmp_pd
VCMPTRUEPD* (V1
_mm_cmp_pd
VCMPTRUE_USPS* (V1
_mm_cmp_ps
VCMPTRUEPS* (V1
_mm_cmp_ps
 
FALSE VCMPFALSE_OSPD* (V1
_mm_cmp_pd
VCMPFALSEPD* (V1
_mm_cmp_pd
    VCMPFALSE_OSPS* (V1
_mm_cmp_ps
VCMPFALSEPS* (V1
_mm_cmp_ps
     

 

  実数
倍精度 単精度 半精度
スカラー比較して
結果をフラグにセット
COMISD (S2
_mm_comieq_sd
_mm_comilt_sd
_mm_comile_sd
_mm_comigt_sd
_mm_comige_sd
_mm_comineq_sd

UCOMISD (S2
_mm_ucomieq_sd
_mm_ucomilt_sd
_mm_ucomile_sd
_mm_ucomigt_sd
_mm_ucomige_sd
_mm_ucomineq_sd
COMISS (S1
_mm_comieq_ss
_mm_comilt_ss
_mm_comile_ss
_mm_comigt_ss
_mm_comige_ss
_mm_comineq_ss

UCOMISS (S1
_mm_ucomieq_ss
_mm_ucomilt_ss
_mm_ucomile_ss
_mm_ucomigt_ss
_mm_ucomige_ss
_mm_ucomineq_ss
 

 

ビット単位の論理演算

  整数 実数
QWORD DWORD WORD BYTE 倍精度 単精度 半精度
and PAND (S2
_mm_and_si128
ANDPD (S2
_mm_and_pd
ANDPS (S1
_mm_and_ps
 
VPANDQ (V5…
_mm512_and_epi64
など
VPANDD (V5…
_mm512_and_epi32
など
and not PANDN (S2
_mm_andnot_si128
ANDNPD (S2
_mm_andnot_pd
ANDNPS (S1
_mm_andnot_ps
 
VPANDNQ (V5…
_mm512_andnot_epi64
など
VPANDND (V5…
_mm512_andnot_epi32
など
or POR (S2
_mm_or_si128
ORPD (S2
_mm_or_pd
ORPS (S1
_mm_or_ps
 
VPORQ (V5…
_mm512_or_epi64
など
VPORD (V5…
_mm512_or_epi32
など
xor PXOR (S2
_mm_xor_si128
XORPD (S2
_mm_xor_pd
XORPS (S1
_mm_xor_ps
VPXORQ (V5…
_mm512_xor_epi64
など
VPXORD (V5…
_mm512_xor_epi32
など
test PTEST (S4.1
_mm_testz_si128
_mm_testc_si128
_mm_testnzc_si128
VTESTPD (V1
_mm_testz_pd
_mm_testc_pd
_mm_testnzc_pd
VTESTPS (V1
_mm_testz_ps
_mm_testc_ps
_mm_testnzc_ps
 
VPTESTMQ (V5…
_mm_test_epi64_mask
VPTESTNMQ (V5…
_mm_testn_epi64_mask
VPTESTMD (V5…
_mm_test_epi32_mask
VPTESTNMD (V5…
_mm_testn_epi32_mask
VPTESTMW (V5+BW…
_mm_test_epi16_mask
VPTESTNMW (V5+BW…
_mm_testn_epi16_mask
VPTESTMB (V5+BW…
_mm_test_epi8_mask
VPTESTNMB (V5+BW…
_mm_testn_epi8_mask
三項論理演算 VPTERNLOGQ (V5…
_mm_ternarylogic_epi64
VPTERNLOGD (V5…
_mm_ternarylogic_epi32

 

ビットシフト/ローテート

  整数
QWORD DWORD WORD BYTE
shift left logical PSLLQ (S2
_mm_slli_epi64
_mm_sll_epi64
PSLLD (S2
_mm_slli_epi32
_mm_sll_epi32
PSLLW (S2
_mm_slli_epi16
_mm_sll_epi16
 
VPSLLVQ (V2
_mm_sllv_epi64
VPSLLVD (V2
_mm_sllv_epi32
VPSLLVW (V5+BW…
_mm_sllv_epi16
 
shift right logical PSRLQ (S2
_mm_srli_epi64
_mm_srl_epi64
PSRLD (S2
_mm_srli_epi32
_mm_srl_epi32
PSRLW (S2
_mm_srli_epi16
_mm_srl_epi16
 
VPSRLVQ (V2
_mm_srlv_epi64
VPSRLVD (V2
_mm_srlv_epi32
VPSRLVW (V5+BW…
_mm_srlv_epi16
 
shift right arithmetic VPSRAQ (V5…
_mm_srai_epi64
_mm_sra_epi64
PSRAD (S2
_mm_srai_epi32
_mm_sra_epi32
PSRAW (S2
_mm_srai_epi16
_mm_sra_epi16
 
VPSRAVQ (V5…
_mm_srav_epi64
VPSRAVD (V2
_mm_srav_epi32
VPSRAVW (V5+BW…
_mm_srav_epi16
 
rotate left VPROLQ (V5…
_mm_rol_epi64
VPROLD (V5…
_mm_rol_epi32
VPROLVQ (V5…
_mm_rolv_epi64
VPROLVD (V5…
_mm_rolv_epi32
rotate right VPRORQ (V5…
_mm_ror_epi64
VPRORD (V5…
_mm_ror_epi32
VPRORVQ (V5…
_mm_rorv_epi64
VPRORVD (V5…
_mm_rorv_epi32

 

バイトシフト

128bit
shift left logical PSLLDQ (S2
_mm_slli_si128
shift right logical PSRLDQ (S2
_mm_srli_si128
packed align right PALIGNR (SS3
_mm_alignr_epi8

 

文字列比較

explicit length implicit length
return index PCMPESTRI (S4.2
_mm_cmpestri
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRI (S4.2
_mm_cmpistri
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz
return mask PCMPESTRM (S4.2
_mm_cmpestrm
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRM (S4.2
_mm_cmpistrm
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz

 

その他

LDMXCSR (S1
_mm_setcsr
Load MXCSR register
STMXCSR (S1
_mm_getcsr
Save MXCSR register state

PSADBW (S2
_mm_sad_epu8
Compute sum of absolute differences
MPSADBW (S4.1
_mm_mpsadbw_epu8
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers.
VDBPSADBW (V5+BW…
_mm_dbsad_epu8
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

PMULHRSW (SS3
_mm_mulhrs_epi16
Packed Multiply High with Round and Scale

PHMINPOSUW (S4.1
_mm_minpos_epu16
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

VPCONFLICTQ (V5+CD…
_mm512_conflict_epi64
VPCONFLICTD (V5+CD…
_mm512_conflict_epi32
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

VPLZCNTQ (V5+CD…
_mm_lzcnt_epi64
VPLZCNTD (V5+CD…
_mm_lzcnt_epi32
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

VFIXUPIMMPD* (V5…
_mm512_fixupimm_pd
VFIXUPIMMPS* (V5…
_mm512_fixupimm_ps
Fix Up Special Packed Float64/32 Values
VFPCLASSPD* (V5…
_mm512_fpclass_pd_mask
VFPCLASSSD* (V5…
_mm512_fpclass_sd_mask
Tests Types Of a Packed Float64/32 Values
VRANGEPD* (V5+DQ…
_mm_range_pd
VRANGEPS* (V5+DQ…
_mm_range_pd
Range Restriction Calculation For Packed Pairs of Float64/32 Values
VGETEXPPD* (V5…
_mm512_getexp_pd
VGETEXPPS* (V5…
_mm512_getexp_ps
Convert Exponents of Packed DP/SP FP Values to FP Values
VGETMANTPD* (V5…
_mm512_getmant_pd
VGETMANTPS* (V5…
_mm512_getmant_ps
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

AESDEC (AESNI
_mm_aesdec_si128
Perform an AES decryption round using an 128-bit state and a round key
AESDECLAST (AESNI
_mm_aesdeclast_si128
Perform the last AES decryption round using an 128-bit state and a round key
AESENC (AESNI
_mm_aesenc_si128
Perform an AES encryption round using an 128-bit state and a round key
AESENCLAST (AESNI
_mm_aesenclast_si128
Perform the last AES encryption round using an 128-bit state and a round key
AESIMC (AESNI
_mm_aesimc_si128
Perform an inverse mix column transformation primitive
AESKEYGENASSIST (AESNI
_mm_aeskeygenassist_si128
Assist the creation of round keys with a key expansion schedule
PCLMULQDQ (PCLMULQDQ
_mm_clmulepi64_si128
Perform carryless multiplication of two 64-bit numbers

SHA1RNDS4 (SHA
_mm_sha1rnds4_epu32
Perform Four Rounds of SHA1 Operation
SHA1NEXTE (SHA
_mm_sha1nexte_epu32
Calculate SHA1 State Variable E after Four Rounds
SHA1MSG1 (SHA
_mm_sha1msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords
SHA1MSG2 (SHA
_mm_sha1msg2_epu32
Perform a Final Calculation for the Next Four SHA1 Message Dwords
SHA256RNDS2 (SHA
_mm_sha256rnds2_epu32
Perform Two Rounds of SHA256 Operation
SHA256MSG1 (SHA
_mm_sha256msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA256 Message
SHA256MSG2 (SHA
_mm_sha256msg2_epu32
Perform a Final Calculation for the Next Four SHA256 Message Dwords

VPBROADCASTMB2Q (V5+CD…
_mm_broadcastmb_epi64
VPBROADCASTMW2D (V5+CD…
_mm_broadcastmw_epi32
Broadcast Mask to Vector Register

VZEROALL (V1
_mm256_zeroall
Zero all YMM registers
VZEROUPPER (V1
_mm256_zeroupper
Zero upper 128 bits of all YMM registers

MOVNTPS (S1
_mm_stream_ps
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory
MASKMOVDQU (S2
_mm_maskmoveu_si128
Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD (S2
_mm_stream_pd
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory
MOVNTDQ (S2
_mm_stream_si128
Non-temporal store of double quadword from an XMM register into memory
LDDQU (S3
_mm_lddqu_si128
Special 128-bit unaligned load designed to avoid cache line splits
MOVNTDQA (S4.1
_mm_stream_load_si128
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers (“streaming load buffers”). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

VGATHERPFxDPS (V5+PF
_mm512_mask_prefetch_i32gather_ps
VGATHERPFxQPS (V5+PF
_mm512_mask_prefetch_i64gather_ps
VGATHERPFxDPD (V5+PF
_mm512_mask_prefetch_i32gather_pd
VGATHERPFxQPD (V5+PF
_mm512_mask_prefetch_i64gather_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint
VSCATTERPFxDPS (V5+PF
_mm512_prefetch_i32scatter_ps
VSCATTERPFxQPS (V5+PF
_mm512_prefetch_i64scatter_ps
VSCATTERPFxDPD (V5+PF
_mm512_prefetch_i32scatter_pd
VSCATTERPFxQPD (V5+PF
_mm512_prefetch_i64scatter_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint with Intent to Write

 

 

おまけ

小ネタ1 ゼロクリア

xor命令でできます。実数型でもできます。

例: XMM1の2個のQWORD(または4個のDWORD、8個のWORD、16個のBYTE)をすべて0にする

        pxor         xmm1, xmm1

例: XMM1の4個のfloatに0.0fを入れる

        xorps        xmm1, xmm1

例: XMM1の2個のdoubleに0.0を入れる

        xorpd        xmm1, xmm1

 

小ネタ2 ひとつの値をXMMレジスタ全体にコピーする

shuffle命令でできます。

例: XMM1の最下位32ビットに入っているfloat値をXMM1の他の3つの32bitにコピーする

        shufps       xmm1, xmm1, 0

例: XMM1の最下位16ビットに入っているWORD値をXMM1の他の7つの16bitにコピーする

        pshuflw       xmm1, xmm1, 0
        pshufd        xmm1, xmm1, 0

例: XMM1の下位64ビットに入っているQWORD値をXMM1の上位64bitにコピーする

         pshufd        xmm1, xmm1, 44h     ; 01 00 01 00 B = 44h

これはこっちのほうがいいですね。

         punpcklqdq    xmm1, xmm1

 

小ネタ3 整数の符号拡張、ゼロ拡張

unpack命令でできます。

例: XMM1の8個のWORD値をDWORDにゼロ拡張してXMM1(下位4個), XMM2(上位4個)に入れる

        movdqa     xmm2, xmm1     ; 元データ WORD[7] [6] [5] [4] [3] [2] [1] [0]
        pxor       xmm3, xmm3     ; 各WORDに付加する上位16ビット=all 0
        punpcklwd  xmm1, xmm3     ; 下位4個分 0 [3] 0 [2] 0 [1] 0 [0] 
        punpckhwd  xmm2, xmm3     ; 上位4個分 0 [7] 0 [6] 0 [5] 0 [4]

例: XMM1の16個のBYTE値をWORDに符号拡張してXMM1(下位8個)、XMM2(上位8個)に入れる

        pxor       xmm3, xmm3
        movdqa     xmm2, xmm1
        pcmpgtb    xmm3, xmm1     ; 各BYTEに付加する上位8ビット 正なら0、負なら-1
        punpcklbw  xmm1, xmm3     ; 下位8個分
        punpckhbw  xmm2, xmm3     ; 上位8個分

例(intrinsics): __m128i 型変数words8に入っている8個のWORD値を符号拡張してdwords4lo(下位4個)、dwords4hi(上位4個)に入れる

    const __m128i izero = _mm_setzero_si128();
    __m128i words8hi = _mm_cmpgt_epi16(izero, words8);
    __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi);
    __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

 

小ネタ4 整数の絶対値

整数は2の補数なので、正または0ならば何もしない、負ならば全ビット反転後1を加える、ということをすると絶対値になります。

例: XMM1に入っている8個の符号付きWORD値の絶対値をXMM1に入れる

                                 ; 元データが正/0の場合; 元データが負の場合
       pxor      xmm2, xmm2      
       pcmpgtw   xmm2, xmm1      ; xmm2←0             ; xmm2←-1
       pxor      xmm1, xmm2      ; 0とxor(何もしない)  ; -1とxor(全ビット反転)
       psubw     xmm1, xmm2      ; 0を引く(何もしない) ; -1を引く(1を加える)

例(intrinsics): __m128i 型変数dwords4に入っている4個の符号付きDWORD値の絶対値をdwords4に入れる

    const __m128i izero = _mm_setzero_si128();
    __m128i tmp = _mm_cmpgt_epi32(izero, dwords4);
    dwords4 = _mm_xor_si128(dwords4, tmp);
    dwords4 = _mm_sub_epi32(dwords4, tmp);

 

小ネタ5 実数の絶対値

実数は補数でないので符号(最上位ビット)だけクリアすれば絶対値になります。

例: XMM1に入っている4個の単精度実数の絶対値をXMM1に入れる

; データ
              align   16
signoffmask   dd      4 dup (7fffffffH)       ; 最上位ビットだけを落とすマスク
        
; コード
              andps   xmm1, xmmword ptr signoffmask        

例(intrinsics): __m128 型変数floats4に入っている4個の単精度実数の絶対値をfloats4に入れる

    const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000

    floats4 = _mm_andnot_ps(signmask, floats4);

 

小ネタ6 整数の乗算命令が足りない?

符号なし・符号つきで違いがあるのは上位だけです。下位は兼用できます。

符号なしWORD×WORD→上位側PMULHUW 下位側PMULLW

符号つきWORD×WORD→上位側PMULHW 下位側PMULLW

 

小ネタ7 MOVや論理演算で整数型と実数型があるのはなぜ?

MOV系の命令はビットパターンをそのままコピーするだけなのになぜ型ごとに別の命令があるのでしょうか?

違う動作をする? いいえ。ソフトウェア的には同じ動作です。実数としては不正かもしれない任意のビットパターンをmovapsしても大丈夫です。単精度データが入っているXMMWORDをmovdqaでXMMレジスタにロードしてそのあと単精度演算しても大丈夫です。型の違う命令でmovしてもOKなことは仕様で明確に規定されています。

違いは外からは直接見えないCPUの中の動作です。正しい型の命令を使えばCPUの中で一貫性のある動作ができるので処理が速くなる可能性があります。型がわかっている場合はその型の命令を使うのがよいでしょう。

サブルーチンの入口・出口でXMMレジスタのセーブやリストアをするときなど、どの型のデータが中に入っているか知るすべがないような場合には、型の違う命令でmovしてもちゃんと動くということです。

ビット演算命令も同じです。

 

小ネタ8 max・min

比較命令でマスクを得てからビット演算するとできます。

例: XMM1とXMM2に入っている各4個のDWORD値どうしを符号付き比較して小さい方をXMM1に入れる

; A=xmm1  B=xmm2                    ; A>Bのとき     ; A<=Bのとき
        movdqa      xmm0, xmm1
        pcmpgtd     xmm1, xmm2      ; xmm1=-1       ; xmm1=0
        pand        xmm2, xmm1      ; xmm2=B        ; xmm2=0
        pandn       xmm1, xmm0      ; xmm1=0        ; xmm1=A
        por         xmm1, xmm2      ; xmm1=B        ; xmm1=A

例(intrinsics): __m128i 型変数a, bに入っている各16個のバイト値どうしを符号付き比較して大きいほうをmaxABに入れる

    __m128i mask = _mm_cmpgt_epi8(a, b);
    __m128i selectedA = _mm_and_si128(mask, a);
    __m128i selectedB = _mm_andnot_si128(mask, b);
    __m128i maxAB = _mm_or_si128(selectedA, selectedB);

 

小ネタ9 128ビットAVX命令とSSE命令の違いは?

多くの命令でデスティネーションにソースとは別のレジスタが使えるのでmovが要らなくなるのは自明ですがほかにも違いがあります。

一般的にAVX128ビット命令ではデスティネーションレジスタの上位にある128ビットがゼロクリアされます(例:デスティネーションにXMM0を指定するとYMM0の上位128ビットが0になります)。SSE命令では上位をいじりません。

一般的にSSE命令では16バイト境界調整が必須ですがAVX命令では16バイト境界調整しなくても実行できます(vmovdqA等、明示的にアラインメントを要求する命令を除く)。が性能的には調整したほうがいいでしょう。

AVX256ビット命令とAVX128ビット命令とSSE128ビット命令を混在させると著しく性能低下する場合があるようです。AVX256ビット命令とAVX128ビット命令を使う場合はSSE128ビット命令の使用を避ける(AVX128ビット命令に書き換える)のがいいかもしれません。

 

小ネタ10 全ビットを立てる

PCMPEQx命令でできます。

例: XMM1の2個のQWORD(または4個のDWORD、8個のWORD、16個のBYTE)をすべて-1にする

        pcmpeqb         xmm1, xmm1

 


ver 2017092800

ホームページ http://www.officedaytime.com/