English Japanese

x86/x64 SIMD Instruction List (SSE to AVX512) Beta

MMX register (64-bit) instructions are omitted.

S1=SSE  S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512

Instructions marked * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.

C/C++ intrinsic name is written below each instruction in blue.

AVX/AVX2

AVX512

This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.

Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm

When you find any error or something please post this feedback form or email me to the address at the bottom of this page.

 

Highlighter  to   Color    To make these default, bookmark this page after clicking here.

MOVE     ?MM = XMM / YMM / ZMM

  Integer Floating-Point YMM lane (128-bit)
QWORD DWORD WORD BYTE Double Single Half
?MM whole
from / to
?MM/mem
MOVDQA (S2
_mm_load_si128
_mm_store_si128
MOVDQU (S2
_mm_loadu_si128
_mm_storeu_si128
MOVAPD (S2
_mm_load_pd
_mm_loadr_pd
_mm_store_pd
_mm_storer_pd

MOVUPD (S2
_mm_loadu_pd
_mm_storeu_pd
MOVAPS (S1
_mm_load_ps
_mm_loadr_ps
_mm_store_ps
_mm_storer_ps

MOVUPS (S1
_mm_loadu_ps
_mm_storeu_ps
 
VMOVDQA64 (V5...
_mm_mask_load_epi64
_mm_mask_store_epi64
etc
VMOVDQU64 (V5...
_mm_mask_loadu_epi64
_mm_mask_store_epi64
etc
VMOVDQA32 (V5...
_mm_mask_load_epi32
_mm_mask_store_epi32
etc
VMOVDQU32 (V5...
_mm_mask_loadu_epi32
_mm_mask_storeu_epi32
etc
VMOVDQU16 (V5+BW...
_mm_mask_loadu_epi16
_mm_mask_storeu_epi16
etc
VMOVDQU8 (V5+BW...
_mm_mask_loadu_epi8
_mm_mask_storeu_epi8
etc
XMM upper half
from / to
mem
MOVHPD (S2
_mm_loadh_pd
_mm_storeh_pd
MOVHPS (S1
_mm_loadh_pi
_mm_storeh_pi
 
XMM upper half
from / to
XMM lower half
MOVHLPS (S1
_mm_movehl_ps
MOVLHPS (S1
_mm_movelh_ps
 
XMM lower half
from / to
mem
        MOVLPD (S2
_mm_loadl_pd
_mm_storel_pd
MOVLPS (S1
_mm_loadl_pi
_mm_storel_pi
   
XMM lowest 1 elem
from / to
r/m
MOVQ (S2
_mm_cvtsi64_si128
_mm_cvtsi128_si64
MOVD (S2
_mm_cvtsi32_si128
_mm_cvtsi128_si32
   
XMM lowest 1 elem
from / to
XMM/mem
MOVQ (S2
_mm_move_epi64
      MOVSD (S2
_mm_load_sd
_mm_store_sd
_mm_move_sd
MOVSS (S1
_mm_load_ss
_mm_store_ss
_mm_move_ss
   
XMM whole
from
1 elem
TIP 2
_mm_set1_epi64x
VPBROADCASTQ (V2
_mm_broadcastq_epi64
TIP 2
_mm_set1_epi32
VPBROADCASTD (V2
_mm_broadcastd_epi32
TIP 2
_mm_set1_epi16
VPBROADCASTW (V2
_mm_broadcastw_epi16
_mm_set1_epi8
VPBROADCASTB (V2
_mm_broadcastb_epi8
TIP 2
_mm_set1_pd
_mm_load1_pd
MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd

TIP 2
_mm_set1_ps
_mm_load1_ps

VBROADCASTSS
from mem (V1
from XMM (V2
_mm_broadcast_ss
YMM / ZMM whole
from
1 elem
VPBROADCASTQ (V2
_mm256_broadcastq_epi64
VPBROADCASTD (V2
_mm256_broadcastd_epi32
VPBROADCASTW (V2
_mm256_broadcastw_epi16
VPBROADCASTB (V2
_mm256_broadcastb_epi8
VBROADCASTSD
 from mem (V1
 from XMM (V2
_mm256_broadcast_sd
VBROADCASTSS
 from mem (V1
 from XMM (V2
_mm256_broadcast_ss
  VBROADCASTF128 (V1
_mm256_broadcast_ps
_mm256_broadcast_pd

VBROADCASTI128 (V2
_mm256_broadcastsi128_si256
YMM / ZMM whole
from
2/4/8 elems
VBROADCASTI64X2 (V5+DQ...
_mm512_broadcast_i64x2
VBROADCASTI64X4 (V5
_mm512_broadcast_i64x4
VBROADCASTI32X2 (V5+DQ...
_mm512_broadcast_i32x2
VBROADCASTI32X4 (V5...
_mm512_broadcast_i32x4
VBROADCASTI32X8 (V5+DQ
_mm512_broadcast_i32x8
VBROADCASTF64X2 (V5+DQ...
_mm512_broadcast_f64x2
VBROADCASTF64X4 (V5
_mm512_broadcast_f64x4
VBROADCASTF32X2 (V5+DQ...
_mm512_broadcast_f32x2
VBROADCASTF32X4 (V5...
_mm512_broadcast_f32x4
VBROADCASTF32X8 (V5+DQ
_mm512_broadcast_f32x8
?MM
from
multiple elems
_mm_set_epi64x
_mm_setr_epi64x
_mm_set_epi32
_mm_setr_epi32
_mm_set_epi16
_mm_setr_epi16
_mm_set_epi8
_mm_setr_epi8
_mm_set_pd
_mm_setr_pd
_mm_set_ps
_mm_setr_ps
   
?MM whole
from
zero
TIP 1
_mm_setzero_si128
TIP 1
_mm_setzero_pd
TIP 1
_mm_setzero_ps
   
extract PEXTRQ (S4.1
_mm_extract_epi64
PEXTRD (S4.1
_mm_extract_epi32
PEXTRW to r (S2
PEXTRW to r/m (S4.1
_mm_extract_epi16
PEXTRB (S4.1
_mm_extract_epi8
  EXTRACTPS (S4.1
_mm_extract_ps
  VEXTRACTF128 (V1
_mm256_extractf128_ps
_mm256_extractf128_pd
_mm256_extractf128_si256

VEXTRACTI128 (V2
_mm256_extracti128_si256
VEXTRACTI64X2 (V5+DQ...
_mm512_extracti64x2_epi64
VEXTRACTI64X4 (V5
_mm512_extracti64x4_epi64
VEXTRACTI32X4 (V5...
_mm512_extracti32x4_epi32
VEXTRACTI32X8 (V5+DQ
_mm512_extracti32x8_epi32
VEXTRACTF64X2 (V5+DQ...
_mm512_extractf64x2_pd
VEXTRACTF64X4 (V5
_mm512_extractf64x4_pd
VEXTRACTF32X4 (V5...
_mm512_extractf32x4_ps
VEXTRACTF32X8 (V5+DQ
_mm512_extractf32x8_ps
insert PINSRQ (S4.1
_mm_insert_epi64
PINSRD (S4.1
_mm_insert_epi32
PINSRW (S2
_mm_insert_epi16
PINSRB (S4.1
_mm_insert_epi8
  INSERTPS (S4.1
_mm_insert_ps
  VINSERTF128 (V1
_mm256_insertf128_ps
_mm256_insertf128_pd
_mm256_insertf128_si256

VINSERTI128 (V2
_mm256_inserti128_si256
VINSERTI64X2 (V5+DQ...
_mm512_inserrti64x2
VINSERTI64X4 (V5...
_mm512_inserti64x4
VINSERTI32X4 (V5...
_mm512_inserti32x4
VINSERTI32X8 (V5+DQ
_mm512_inserti32x8
VINSERTF64X2 (V5+DQ...
_mm512_insertf64x2
VINSERTF64X4 (V5
_mm512_insertf64x4
VINSERTF32X4 (V5...
_mm512_insertf32x4
VINSERTF32X8 (V5+DQ
_mm512_insertf32x8
unpack
PUNPCKHQDQ (S2
_mm_unpackhi_epi64
PUNPCKLQDQ (S2
_mm_unpacklo_epi64
PUNPCKHDQ (S2
_mm_unpackhi_epi32
PUNPCKLDQ (S2
_mm_unpacklo_epi32
PUNPCKHWD (S2
_mm_unpackhi_epi16
PUNPCKLWD (S2
_mm_unpacklo_epi16
PUNPCKHBW (S2
_mm_unpackhi_epi8
PUNPCKLBW (S2
_mm_unpacklo_epi8
UNPCKHPD (S2
_mm_unpackhi_pd
UNPCKLPD (S2
_mm_unpacklo_pd
UNPCKHPS (S1
_mm_unpackhi_ps
UNPCKLPS (S1
_mm_unpacklo_ps
   
shuffle/permute
VPERMQ (V2
_mm256_permute4x64_epi64
VPERMI2Q (V5...
_mm_permutex2var_epi64
PSHUFD (S2
_mm_shuffle_epi32
VPERMD (V2
_mm256_permutevar8x32_epi32
_mm256_permutexvar_epi32
VPERMI2D (V5...
_mm_permutex2var_epi32
PSHUFHW (S2
_mm_shufflehi_epi16
PSHUFLW (S2
_mm_shufflelo_epi16
VPERMW (V5+BW...
_mm_permutexvar_epi16
VPERMI2W (V5+BW...
_mm_permutex2var_epi16
PSHUFB (SS3
_mm_shuffle_epi8
SHUFPD (S2
_mm_shuffle_pd
VPERMILPD (V1
_mm_permute_pd
_mm_permutevar_pd

VPERMPD (V2
_mm256_permute4x64_pd
VPERMI2PD (V5...
_mm_permutex2var_pd
SHUFPS (S1
_mm_shuffle_ps
VPERMILPS (V1
_mm_permute_ps
_mm_permutevar_ps

VPERMPS (V2
_mm256_permutevar8x32_ps
VPERMI2PS (V5...
_mm_permutex2var_ps
  VPERM2F128 (V1
_mm256_permute2f128_ps
_mm256_permute2f128_pd
_mm256_permute2f128_si256

VPERM2I128 (V2
_mm256_permute2x128_si256
VSHUFI64X2 (V5...
_mm512_shuffle_i64x2
VSHUFI32X4 (V5...
_mm512_shuffle_i32x4
VSHUFF64X2 (V5...
_mm512_shuffle_f64x2
VSHUFF32X4 (V5...
_mm512_shuffle_f32x4
blend
VPBLENDMQ (V5...
_mm_mask_blend_epi32
VPBLENDD (V2
_mm_blend_epi32
VPBLENDMD (V5...
_mm_mask_blend_epi32
PBLENDW (S4.1
_mm_blend_epi16
VPBLENDMW (V5+BW...
_mm_mask_blend_epi16
PBLENDVB (S4.1
_mm_blendv_epi8
VPBLENDMB (V5+BW...
_mm_mask_blend_epi8
BLENDPD (S4.1
_mm_blend_pd
BLENDVPD (S4.1
_mm_blendv_pd
VBLENDMPD (V5...
_mm_mask_blend_pd
BLENDPS (S4.1
_mm_blend_ps
BLENDVPS (S4.1
_mm_blendv_ps
VBLENDMPS (V5...
_mm_mask_blend_ps
   
move and duplicate MOVDDUP (S3
_mm_movedup_pd
_mm_loaddup_pd
MOVSHDUP (S3
_mm_movehdup_ps
MOVSLDUP (S3
_mm_moveldup_ps
 
mask move VPMASKMOVQ (V2
_mm_maskload_epi64
_mm_maskstore_epi64
VPMASKMOVD (V2
_mm_maskload_epi32
_mm_maskstore_epi32
    VMASKMOVPD (V1
_mm_maskload_pd
_mm_maskstore_pd
VMASKMOVPS (V1
_mm_maskload_ps
_mm_maskstore_ps
   
extract highest bit       PMOVMSKB (S2
_mm_movemask_epi8
MOVMSKPD (S2
_mm_movemask_pd
MOVMSKPS (S1
_mm_movemask_ps
   
VPMOVQ2M (V5+DQ...
_mm_movepi64_mask
VPMOVD2M (V5+DQ...
_mm_movepi32_mask
VPMOVW2M (V5+BW...
_mm_movepi16_mask
VPMOVB2M (V5+BW...
_mm_movepi8_mask
gather
VPGATHERDQ (V2
_mm_i32gather_epi64
_mm_mask_i32gather_epi64

VPGATHERQQ (V2
_mm_i64gather_epi64
_mm_mask_i64gather_epi64
VPGATHERDD (V2
_mm_i32gather_epi32
_mm_mask_i32gather_epi32

VPGATHERQD (V2
_mm_i64gather_epi32
_mm_mask_i64gather_epi32
    VGATHERDPD (V2
_mm_i32gather_pd
_mm_mask_i32gather_pd

VGATHERQPD (V2
_mm_i64gather_pd
_mm_mask_i64gather_pd
VGATHERDPS (V2
_mm_i32gather_ps
_mm_mask_i32gather_ps

VGATHERQPS (V2
_mm_i64gather_ps
_mm_mask_i64gather_ps
   
scatter
VPSCATTERDQ (V5...
_mm_i32scatter_epi64
_mm_mask_i32scatter_epi64

VPSCATTERQQ (V5...
_mm_i64scatter_epi64
_mm_mask_i64scatter_epi64
VPSCATTERDD (V5...
_mm_i32scatter_epi32
_mm_mask_i32scatter_epi32

VPSCATTERQD (V5...
_mm_i64scatter_epi32
_mm_mask_i64scatter_epi32
    VSCATTERDPD (V5...
_mm_i32scatter_pd
_mm_mask_i32scatter_pd

VSCATTERQPD (V5...
_mm_i64scatter_pd
_mm_mask_i64scatter_pd
VSCATTERDPS (V5...
_mm_i32scatter_ps
_mm_mask_i32scatter_ps

VSCATTERQPS (V5...
_mm_i64scatter_ps
_mm_mask_i64scatter_ps
   
compress
VPCOMPRESSQ (V5...
_mm_mask_compress_epi64
_mm_mask_compressstoreu_epi64
VPCOMPRESSD (V5...
_mm_mask_compress_epi32
_mm_mask_compressstoreu_epi32
VCOMPRESSPD (V5...
_mm_mask_compress_pd
_mm_mask_compressstoreu_pd
VCOMPRESSPS (V5...
_mm_mask_compress_ps
_mm_mask_compressstoreu_ps
expand
VEXPANDQ (V5...
_mm_mask_expand_epi64
_mm_mask_expandloadu_epi64
VEXPANDD (V5...
_mm_mask_expand_epi32
_mm_mask_expandloadu_epi32
VEXPANDPD (V5...
_mm_mask_expand_pd
_mm_mask_expandloadu_pd
VEXPANDPS (V5...
_mm_mask_expand_ps
_mm_mask_expandloadu_ps
align right VALIGNQ (V5...
_mm_alignr_epi64
VALIGND (V5...
_mm_alignr_epi32
PALIGNR (SS3
_mm_alignr_epi8
expand Opmask bits VPMOVM2Q (V5+DQ...
_mm_movm_epi64
VPMOVM2D (V5+DQ...
_mm_movm_epi32
VPMOVM2W (V5+BW...
_mm_movm_epi16
VPMOVM2B (V5+BW...
_mm_movm_epi8

 

Conversions

from \ to Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
Integer QWORD VPMOVQD (V5...
_mm_cvtepi64_epi32
VPMOVSQD (V5...
_mm_cvtsepi64_epi32
VPMOVUSQD (V5...
_mm_cvtusepi64_epi32
VPMOVQW (V5...
_mm_cvtepi64_epi16
VPMOVSQW (V5...
_mm_cvtsepi64_epi16
VPMOVUSQW (V5...
_mm_cvtusepi64_epi16
VPMOVQB (V5...
_mm_cvtepi64_epi8
VPMOVSQB (V5...
_mm_cvtsepi64_epi8
VPMOVUSQB (V5...
_mm_cvtusepi64_epi8
CVTSI2SD (S2 scalar only
_mm_cvtsi64_sd
VCVTQQ2PD* (V5+DQ...
_mm_cvtepi64_pd
VCVTUQQ2PD* (V5+DQ...
_mm_cvtepu64_pd
CVTSI2SS (S1 scalar only
_mm_cvtsi64_ss
VCVTQQ2PS* (V5+DQ...
_mm_cvtepi64_ps
VCVTUQQ2PS* (V5+DQ...
_mm_cvtepu64_ps
DWORD TIP 3
PMOVSXDQ (S4.1
_mm_ cvtepi32_epi64
PMOVZXDQ (S4.1
_mm_ cvtepu32_epi64
  PACKSSDW (S2
_mm_packs_epi32
PACKUSDW (S4.1
_mm_packus_epi32
VPMOVDW (V5...
_mm_cvtepi32_epi16
VPMOVSDW (V5...
_mm_cvtsepi32_epi16
VPMOVUSDW (V5...
_mm_cvtusepi32_epi16
VPMOVDB (V5...
_mm_cvtepi32_epi8
VPMOVSDB (V5...
_mm_cvtsepi32_epi8
VPMOVUSDB (V5...
_mm_cvtusepi32_epi8
CVTDQ2PD* (S2
_mm_cvtepi32_pd
VCVTUDQ2PD* (V5...
_mm_cvtepu32_pd
CVTDQ2PS* (S2
_mm_cvtepi32_ps
VCVTUDQ2PS* (V5...
_mm_cvtepu32_ps
WORD PMOVSXWQ (S4.1
_mm_ cvtepi16_epi64
PMOVZXWQ (S4.1
_mm_ cvtepu16_epi64
TIP 3
PMOVSXWD (S4.1
_mm_ cvtepi16_epi32
PMOVZXWD (S4.1
_mm_ cvtepu16_epi32
PACKSSWB (S2
_mm_packs_epi16
PACKUSWB (S2
_mm_packus_epi16
VPMOVWB (V5+BW...
_mm_cvtepi16_epi8
VPMOVSWB (V5+BW...
_mm_cvtsepi16_epi8
VPMOVUSWB (V5+BW...
_mm_cvtusepi16_epi8
BYTE PMOVSXBQ (S4.1
_mm_ cvtepi8_epi64
PMOVZXBQ (S4.1
_mm_ cvtepu8_epi64
PMOVSXBD (S4.1
_mm_ cvtepi8_epi32
PMOVZXBD (S4.1
_mm_ cvtepu8_epi32
TIP 3
PMOVSXBW (S4.1
_mm_ cvtepi8_epi16
PMOVZXBW (S4.1
_mm_ cvtepu8_epi16
Floating-Point Double CVTSD2SI / CVTTSD2SI (S2 scalar only
_mm_cvtsd_si64 / _mm_cvttsd_si64
VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ...
_mm_cvtpd_epi64 / _mm_cvttpd_epi64
VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ...
_mm_cvtpd_epu64 / _mm_cvttpd_epu64
right ones are with truncation
CVTPD2DQ* / CVTTPD2DQ* (S2
_mm_cvtpd_epi32 / _mm_cvttpd_epi32
VCVTPD2UDQ* / VCVTTPD2UDQ* (V5...
_mm_cvtpd_epu32 / _mm_cvttpd_epu32
right ones are with truncation
CVTPD2PS* (S2
_mm_cvtpd_ps
Single CVTSS2SI / CVTTSS2SI (S1 only scalar
_mm_cvtss_si64 / _mm_cvttss_si64
VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ...
_mm_cvtps_epi64 / _mm_cvttps_epi64
VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ...
_mm_cvtps_epu64 / _mm_cvttps_epu64
right ones are with truncation
CVTPS2DQ* / CVTTPS2DQ* (S2
_mm_cvtps_epi32 / _mm_cvttps_epi32
VCVTPS2UDQ* / VCVTTPS2UDQ* (V5...
_mm_cvtps_epu32 / _mm_cvttps_epu32
right ones are with truncation
  CVTPS2PD* (S2
_mm_cvtps_pd
VCVTPS2PH (V1
_mm_cvtps_ph
Half VCVTPH2PS (V1
_mm_cvtph_ps

 

Arithmetic Operations

  Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
add PADDQ (S2
_mm_add_epi64
PADDD (S2
_mm_add_epi32
PADDW (S2
_mm_add_epi16
PADDSW (S2
_mm_adds_epi16
PADDUSW (S2
_mm_adds_epu16
PADDB (S2
_mm_add_epi8
PADDSB (S2
_mm_adds_epi8
PADDUSB (S2
_mm_adds_epu8
ADDPD* (S2
_mm_add_pd
ADDPS* (S1
_mm_add_ps
sub PSUBQ (S2
_mm_sub_epi64
PSUBD (S2
_mm_sub_epi32
PSUBW (S2
_mm_sub_epi16
PSUBSW (S2
_mm_subs_epi16
PSUBUSW (S2
_mm_subs_epu16
PSUBB (S2
_mm_sub_epi8
PSUBSB (S2
_mm_subs_epi8
PSUBUSB (S2
_mm_subs_epu8
SUBPD* (S2
_mm_sub_pd
SUBPS* (S1
_mm_sub_ps
 
mul VPMULLQ (V5+DQ...
_mm_mullo_epi64
PMULDQ (S4.1
_mm_mul_epi32
PMULUDQ (S2
_mm_mul_epu32
PMULLD (S4.1
_mm_mullo_epi32
PMULHW (S2
_mm_mulhi_epi16
PMULHUW (S2
_mm_mulhi_epu16
PMULLW (S2
_mm_mullo_epi16
MULPD* (S2
_mm_mul_pd
MULPS* (S1
_mm_mul_ps
div DIVPD* (S2
_mm_div_pd
DIVPS* (S1
_mm_div_ps
reciprocal         VRCP14PD* (V5...
_mm_rcp14_pd
VRCP28PD* (V5+ER
_mm512_rcp28_pd
RCPPS* (S1
_mm_rcp_ps
VRCP14PS* (V5...
_mm_rcp14_ps
VRCP28PS* (V5+ER
_mm512_rcp28_ps
 
square root         SQRTPD* (S2
_mm_sqrt_pd
SQRTPS* (S1
_mm_sqrt_ps
 
reciprocal of square root         VRSQRT14PD* (V5...
_mm_rsqrt14_pd
VRSQRT28PD* (V5+ER
_mm512_rsqrt28_pd
RSQRTPS* (S1
_mm_rsqrt_ps
VRSQRT14PS* (V5...
_mm_rsqrt14_ps
VRSQRT28PS* (V5+ER
_mm_rsqrt28_ps
 
power of two         VEXP2PD* (V5+ER
_mm512_exp2a23_roundpd
VEXP2PS* (V5+ER
_mm512_exp2a23_round_ps
 
multiply nth power of 2 VSCALEFPD* (V5...
_mm_scalef_pd
VSCALEFPS* (V5...
_mm_scalef_ps
max TIP 8
VPMAXSQ (V5...
_mm_max_epi64
VPMAXUQ (V5...
_mm_max_epu64
TIP 8
PMAXSD (S4.1
_mm_max_epi32
PMAXUD (S4.1
_mm_max_epu32
PMAXSW (S2
_mm_max_epi16
PMAXUW (S4.1
_mm_max_epu16
TIP 8
PMAXSB (S4.1
_mm_max_epi8
PMAXUB (S2
_mm_max_epu8
TIP 8
MAXPD* (S2
_mm_max_pd
TIP 8
MAXPS* (S1
_mm_max_ps
 
min TIP 8
VPMINSQ (V5...
_mm_min_epi64
VPMINUQ (V5...
_mm_min_epu64
TIP 8
PMINSD (S4.1
_mm_min_epi32
PMINUD (S4.1
_mm_min_epu32
PMINSW (S2
_mm_min_epi16
PMINUW (S4.1
_mm_min_epu16
TIP 8
PMINSB (S4.1
_mm_min_epi8
PMINUB (S2
_mm_min_epu8
TIP 8
MINPD* (S2
_mm_min_pd
TIP 8
MINPS* (S1
_mm_min_ps
average     PAVGW (S2
_mm_avg_epu16
PAVGB (S2
_mm_avg_epu8
     
absolute TIP 4
VPABSQ (V5...
_mm_abs_epi64
TIP 4
PABSD (SS3
_mm_abs_epi32
TIP 4
PABSW (SS3
_mm_abs_epi16
TIP 4
PABSB (SS3
_mm_abs_epi8
TIP 5 TIP 5  
sign operation   PSIGND (SS3
_mm_sign_epi32
PSIGNW (SS3
_mm_sign_epi16
PSIGNB (SS3
_mm_sign_epi8
     
round         ROUNDPD* (S4.1
_mm_round_pd
_mm_floor_pd
_mm_ceil_pd

VRNDSCALEPD* (V5...
_mm_roundscale_pd
ROUNDPS* (S4.1
_mm_round_ps
_mm_floor_ps
_mm_ceil_ps

VRNDSCALEPS* (V5...
_mm_roundscale_ps
 
difference from rounded value         VREDUCEPD* (V5+DQ...
_mm_reduce_pd
VREDUCEPS* (V5+DQ...
_mm_reduce_ps
 
add / sub         ADDSUBPD (S3
_mm_addsub_pd
ADDSUBPS (S3
_mm_addsub_ps
 
horizontal add   PHADDD (SS3
_mm_hadd_epi32
PHADDW (SS3
_mm_hadd_epi16
PHADDSW (SS3
_mm_hadds_epi16
  HADDPD (S3
_mm_hadd_pd
HADDPS (S3
_mm_hadd_ps
 
horizontal sub   PHSUBD (SS3
_mm_hsub_epi32
PHSUBW (SS3
_mm_hsub_epi16
PHSUBSW (SS3
_mm_hsubs_epi16
  HSUBPD (S3
_mm_hsub_pd
HSUBPS (S3
_mm_hsub_ps
 
dot product         DPPD (S4.1
_mm_dp_pd
DPPS (S4.1
_mm_dp_ps
 
multiply and add PMADDWD (S2
_mm_madd_epi16
PMADDUBSW (SS3
_mm_maddubs_epi16
fused multiply and add / sub         VFMADDxxxPD* (FMA
_mm_fmadd_pd
VFMSUBxxxPD* (FMA
_mm_fmsub_pd
VFMADDSUBxxxPD (FMA
_mm_fmaddsub_pd
VFMSUBADDxxxPD (FMA
_mm_fmsubadd_pd
VFNMADDxxxPD* (FMA
_mm_fnmadd_pd
VFNMSUBxxxPD* (FMA
_mm_fnmsub_pd
xxx=132/213/231
VFMADDxxxPS* (FMA
_mm_fmadd_ps
VFMSUBxxxPS* (FMA
_mm_fmsub_ps
VFMADDSUBxxxPS (FMA
_mm_fmaddsub_ps
VFMSUBADDxxxPS (FMA
_mm_fmsubadd_ps
VFNMADDxxxPS* (FMA
_mm_fnmadd_ps
VFNMSUBxxxPS* (FMA
_mm_fnmsub_ps
xxx=132/213/231
 

 

Compare

  Integer
QWORD DWORD WORD BYTE
compare for == PCMPEQQ (S4.1
_mm_cmpeq_epi64
_mm_cmpeq_epi64_mask (V5...
VPCMPEQUQ (V5...
_mm_cmpeq_epu64_mask
PCMPEQD (S2
_mm_cmpeq_epi32
_mm_cmpeq_epi32_mask (V5...
VPCMPEQUD (V5...
_mm_cmpeq_epu32_mask
PCMPEQW (S2
_mm_cmpeq_epi16
_mm_cmpeq_epi16_mask (V5+BW...
VPCMPEQUW (V5+BW...
_mm_cmpeq_epu16_mask
PCMPEQB (S2
_mm_cmpeq_epi8
_mm_cmpeq_epi8_mask (V5+BW...
VPCMPEQUB (V5+BW...
_mm_cmpeq_epu8_mask
compare for < VPCMPLTQ (V5...
_mm_cmplt_epi64_mask
VPCMPLTUQ (V5...
_mm_cmplt_epu64_mask
VPCMPLTD (V5...
_mm_cmplt_epi32_mask
VPCMPLTUD (V5...
_mm_cmplt_epu32_mask
VPCMPLTW (V5+BW...
_mm_cmplt_epi16_mask
VPCMPLTUW (V5+BW...
_mm_cmplt_epu16_mask
VPCMPLTB (V5+BW...
_mm_cmplt_epi8_mask
VPCMPLTUB (V5+BW...
_mm_cmplt_epu8_mask
compare for <= VPCMPLEQ (V5...
_mm_cmple_epi64_mask
VPCMPLEUQ (V5...
_mm_cmple_epu64_mask
VPCMPLED (V5...
_mm_cmple_epi32_mask
VPCMPLEUD (V5...
_mm_cmple_epu32_mask
VPCMPLEW (V5+BW...
_mm_cmple_epi16_mask
VPCMPLEUW (V5+BW...
_mm_cmple_epu16_mask
VPCMPLEB (V5+BW...
_mm_cmple_epi8_mask
VPCMPLEUB (V5+BW...
_mm_cmple_epu8_mask
compare for > PCMPGTQ (S4.2
_mm_cmpgt_epi64
VPCMPNLEQ (V5...
_mm_cmpgt_epi64_mask
VPCMPNLEUQ (V5...
_mm_cmpgt_epu64_mask
PCMPGTD (S2
_mm_cmpgt_epi32
VPCMPNLED (V5...
_mm_cmpgt_epi32_mask
VPCMPNLEUD (V5...
_mm_cmpgt_epu32_mask
PCMPGTW (S2
_mm_cmpgt_epi16
VPCMPNLEW (V5+BW...
_mm_cmpgt_epi16_mask
VPCMPNLEUW (V5+BW...
_mm_cmpgt_epu16_mask
PCMPGTB (S2
_mm_cmpgt_epi8
VPCMPNLEB (V5+BW...
_mm_cmpgt_epi8_mask
VPCMPNLEUB (V5+BW...
_mm_cmpgt_epu8_mask
compare for >= VPCMPNLTQ (V5...
_mm_cmpge_epi64_mask
VPCMPNLTUQ (V5...
_mm_cmpge_epu64_mask
VPCMPNLTD (V5...
_mm_cmpge_epi32_mask
VPCMPNLTUD (V5...
_mm_cmpge_epu32_mask
VPCMPNLTW (V5+BW...
_mm_cmpge_epi16_mask
VPCMPNLTUW (V5+BW...
_mm_cmpge_epu16_mask
VPCMPNLTB (V5+BW...
_mm_cmpge_epi8_mask
VPCMPNLTUB (V5+BW...
_mm_cmpge_epu8_mask
compare for != VPCMPNEQQ (V5...
_mm_cmpneq_epi64_mask
VPCMPNEQUQ (V5...
_mm_cmpneq_epu64_mask
VPCMPNEQD (V5...
_mm_cmpneq_epi32_mask
VPCMPNEQUD (V5...
_mm_cmpneq_epu32_mask
VPCMPNEQW (V5+BW...
_mm_cmpneq_epi16_mask
VPCMPNEQUW (V5+BW...
_mm_cmpneq_epu16_mask
VPCMPNEQB (V5+BW...
_mm_cmpneq_epi8_mask
VPCMPNEQUB (V5+BW...
_mm_cmpneq_epu8_mask

 

Floating-Point
Double Single Half
when either (or both) is Nan condition unmet condition met condition unmet condition met  
Exception on QNaN YES NO YES NO YES NO YES NO  
compare for == VCMPEQ_OSPD* (V1
_mm_cmp_pd
CMPEQPD* (S2
_mm_cmpeq_pd
VCMPEQ_USPD* (V1
_mm_cmp_pd
VCMPEQ_UQPD* (V1
_mm_cmp_pd
VCMPEQ_OSPS* (V1
_mm_cmp_ps
CMPEQPS* (S1
_mm_cmpeq_ps
VCMPEQ_USPS* (V1
_mm_cmp_ps
VCMPEQ_UQPS* (V1
_mm_cmp_ps
 
compare for < CMPLTPD* (S2
_mm_cmplt_pd
VCMPLT_OQPD* (V1
_mm_cmp_pd
    CMPLTPS* (S1
_mm_cmplt_ps
VCMPLT_OQPS* (V1
_mm_cmp_ps
     
compare for <= CMPLEPD* (S2
_mm_cmple_pd
VCMPLE_OQPD* (V1
_mm_cmp_pd
CMPLEPS* (S1
_mm_cmple_ps
VCMPLE_OQPS* (V1
_mm_cmp_ps
 
compare for > VCMPGTPD* (V1
_mm_cmpgt_pd (S2
VCMPGT_OQPD* (V1
_mm_cmp_pd
    VCMPGTPS* (V1
_mm_cmpgt_ps (S1
VCMPGT_OQPS* (V1
_mm_cmp_ps
     
compare for >= VCMPGEPD* (V1
_mm_cmpge_pd (S2
VCMPGE_OQPD* (V1
_mm_cmp_pd
    VCMPGEPS* (V1
_mm_cmpge_ps (S1
VCMPGE_OQPS* (V1
_mm_cmp_ps
     
compare for != VCMPNEQ_OSPD* (V1
_mm_cmp_pd
VCMPNEQ_OQPD* (V1
_mm_cmp_pd
VCMPNEQ_USPD* (V1
_mm_cmp_pd
CMPNEQPD* (S2
_mm_cmpneq_pd
VCMPNEQ_OSPS* (V1
_mm_cmp_ps
VCMPNEQ_OQPS* (V1
_mm_cmp_ps
VCMPNEQ_USPS* (V1
_mm_cmp_ps
CMPNEQPS* (S1
_mm_cmpneq_ps
 
compare for ! < CMPNLTPD* (S2
_mm_cmpnlt_pd
VCMPNLT_UQPD* (V1
_mm_cmp_pd
CMPNLTPS* (S1
_mm_cmpnlt_ps
VCMPNLT_UQPS* (V1
_mm_cmp_ps
 
compare for ! <=     CMPNLEPD* (S2
_mm_cmpnle_pd
VCMPNLE_UQPD* (V1
_mm_cmp_pd
    CMPNLEPS* (S1
_mm_cmpnle_ps
VCMPNLE_UQPS* (V1
_mm_cmp_ps
 
compare for ! > VCMPNGTPD* (V1
_mm_cmpngt_pd (S2
VCMPNGT_UQPD* (V1
_mm_cmp_pd
VCMPNGTPS* (V1
_mm_cmpngt_ps (S1
VCMPNGT_UQPS* (V1
_mm_cmp_ps
 
compare for ! >=     VCMPNGEPD* (V1
_mm_cmpnge_pd (S2
VCMPNGE_UQPD* (V1
_mm_cmp_pd
    VCMPNGEPS* (V1
_mm_cmpnge_ps (S1
VCMPNGE_UQPS* (V1
_mm_cmp_ps
 
compare for ordered VCMPORD_SPD* (V1
_mm_cmp_pd
CMPORDPD* (S2
_mm_cmpord_pd
VCMPORD_SPS* (V1
_mm_cmp_ps
CMPORDPS* (S1
_mm_cmpord_ps
 
compare for unordered     VCMPUNORD_SPD* (V1
_mm_cmp_pd
CMPUNORDPD* (S2
_mm_cmpunord_pd
    VCMPUNORD_SPS* (V1
_mm_cmp_ps
CMPUNORDPS* (S1
_mm_cmpunord_ps
 
TRUE VCMPTRUE_USPD* (V1
_mm_cmp_pd
VCMPTRUEPD* (V1
_mm_cmp_pd
VCMPTRUE_USPS* (V1
_mm_cmp_ps
VCMPTRUEPS* (V1
_mm_cmp_ps
 
FALSE VCMPFALSE_OSPD* (V1
_mm_cmp_pd
VCMPFALSEPD* (V1
_mm_cmp_pd
    VCMPFALSE_OSPS* (V1
_mm_cmp_ps
VCMPFALSEPS* (V1
_mm_cmp_ps
     

 

  Floating-Point
Double Single Half
compare scalar values
to set flag register
COMISD (S2
_mm_comieq_sd
_mm_comilt_sd
_mm_comile_sd
_mm_comigt_sd
_mm_comige_sd
_mm_comineq_sd

UCOMISD (S2
_mm_ucomieq_sd
_mm_ucomilt_sd
_mm_ucomile_sd
_mm_ucomigt_sd
_mm_ucomige_sd
_mm_ucomineq_sd
COMISS (S1
_mm_comieq_ss
_mm_comilt_ss
_mm_comile_ss
_mm_comigt_ss
_mm_comige_ss
_mm_comineq_ss

UCOMISS (S1
_mm_ucomieq_ss
_mm_ucomilt_ss
_mm_ucomile_ss
_mm_ucomigt_ss
_mm_ucomige_ss
_mm_ucomineq_ss
 

 

Bitwise Logical Operations

  Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
and PAND (S2
_mm_and_si128
ANDPD (S2
_mm_and_pd
ANDPS (S1
_mm_and_ps
 
VPANDQ (V5...
_mm512_and_epi64
etc
VPANDD (V5...
_mm512_and_epi32
etc
and not PANDN (S2
_mm_andnot_si128
ANDNPD (S2
_mm_andnot_pd
ANDNPS (S1
_mm_andnot_ps
 
VPANDNQ (V5...
_mm512_andnot_epi64
etc
VPANDND (V5...
_mm512_andnot_epi32
etc
or POR (S2
_mm_or_si128
ORPD (S2
_mm_or_pd
ORPS (S1
_mm_or_ps
 
VPORQ (V5...
_mm512_or_epi64
etc
VPORD (V5...
_mm512_or_epi32
etc
xor PXOR (S2
_mm_xor_si128
XORPD (S2
_mm_xor_pd
XORPS (S1
_mm_xor_ps
VPXORQ (V5...
_mm512_xor_epi64
etc
VPXORD (V5...
_mm512_xor_epi32
etc
test PTEST (S4.1
_mm_testz_si128
_mm_testc_si128
_mm_testnzc_si128
VTESTPD (V1
_mm_testz_pd
_mm_testc_pd
_mm_testnzc_pd
VTESTPS (V1
_mm_testz_ps
_mm_testc_ps
_mm_testnzc_ps
 
VPTESTMQ (V5...
_mm_test_epi64_mask
VPTESTNMQ (V5...
_mm_testn_epi64_mask
VPTESTMD (V5...
_mm_test_epi32_mask
VPTESTNMD (V5...
_mm_testn_epi32_mask
VPTESTMW (V5+BW...
_mm_test_epi16_mask
VPTESTNMW (V5+BW...
_mm_testn_epi16_mask
VPTESTMB (V5+BW...
_mm_test_epi8_mask
VPTESTNMB (V5+BW...
_mm_testn_epi8_mask
ternary operation VPTERNLOGQ (V5...
_mm_ternarylogic_epi64
VPTERNLOGD (V5...
_mm_ternarylogic_epi32

 

Bit Shift / Rotate

  Integer
QWORD DWORD WORD BYTE
shift left logical PSLLQ (S2
_mm_slli_epi64
_mm_sll_epi64
PSLLD (S2
_mm_slli_epi32
_mm_sll_epi32
PSLLW (S2
_mm_slli_epi16
_mm_sll_epi16
 
VPSLLVQ (V2
_mm_sllv_epi64
VPSLLVD (V2
_mm_sllv_epi32
VPSLLVW (V5+BW...
_mm_sllv_epi16
 
shift right logical PSRLQ (S2
_mm_srli_epi64
_mm_srl_epi64
PSRLD (S2
_mm_srli_epi32
_mm_srl_epi32
PSRLW (S2
_mm_srli_epi16
_mm_srl_epi16
 
VPSRLVQ (V2
_mm_srlv_epi64
VPSRLVD (V2
_mm_srlv_epi32
VPSRLVW (V5+BW...
_mm_srlv_epi16
 
shift right arithmetic VPSRAQ (V5...
_mm_srai_epi64
_mm_sra_epi64
PSRAD (S2
_mm_srai_epi32
_mm_sra_epi32
PSRAW (S2
_mm_srai_epi16
_mm_sra_epi16
 
VPSRAVQ (V5...
_mm_srav_epi64
VPSRAVD (V2
_mm_srav_epi32
VPSRAVW (V5+BW...
_mm_srav_epi16
 
rotate left VPROLQ (V5...
_mm_rol_epi64
VPROLD (V5...
_mm_rol_epi32
VPROLVQ (V5...
_mm_rolv_epi64
VPROLVD (V5...
_mm_rolv_epi32
rotate right VPRORQ (V5...
_mm_ror_epi64
VPRORD (V5...
_mm_ror_epi32
VPRORVQ (V5...
_mm_rorv_epi64
VPRORVD (V5...
_mm_rorv_epi32

 

Byte Shift

128-bit
shift left logical PSLLDQ (S2
_mm_slli_si128
shift right logical PSRLDQ (S2
_mm_srli_si128
packed align right PALIGNR (SS3
_mm_alignr_epi8

 

Compare String

explicit length implicit length
return index PCMPESTRI (S4.2
_mm_cmpestri
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRI (S4.2
_mm_cmpistri
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz
return mask PCMPESTRM (S4.2
_mm_cmpestrm
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRM (S4.2
_mm_cmpistrm
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz

 

Others

LDMXCSR (S1
_mm_setcsr
Load MXCSR register
STMXCSR (S1
_mm_getcsr
Save MXCSR register state

PSADBW (S2
_mm_sad_epu8
Compute sum of absolute differences
MPSADBW (S4.1
_mm_mpsadbw_epu8
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers.
VDBPSADBW (V5+BW...
_mm_dbsad_epu8
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

PMULHRSW (SS3
_mm_mulhrs_epi16
Packed Multiply High with Round and Scale

PHMINPOSUW (S4.1
_mm_minpos_epu16
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

VPCONFLICTQ (V5+CD...
_mm512_conflict_epi64
VPCONFLICTD (V5+CD...
_mm512_conflict_epi32
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

VPLZCNTQ (V5+CD...
_mm_lzcnt_epi64
VPLZCNTD (V5+CD...
_mm_lzcnt_epi32
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

VFIXUPIMMPD* (V5...
_mm512_fixupimm_pd
VFIXUPIMMPS* (V5...
_mm512_fixupimm_ps
Fix Up Special Packed Float64/32 Values
VFPCLASSPD* (V5...
_mm512_fpclass_pd_mask
VFPCLASSSD* (V5...
_mm512_fpclass_sd_mask
Tests Types Of a Packed Float64/32 Values
VRANGEPD* (V5+DQ...
_mm_range_pd
VRANGEPS* (V5+DQ...
_mm_range_pd
Range Restriction Calculation For Packed Pairs of Float64/32 Values
VGETEXPPD* (V5...
_mm512_getexp_pd
VGETEXPPS* (V5...
_mm512_getexp_ps
Convert Exponents of Packed DP/SP FP Values to FP Values
VGETMANTPD* (V5...
_mm512_getmant_pd
VGETMANTPS* (V5...
_mm512_getmant_ps
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

AESDEC (AESNI
_mm_aesdec_si128
Perform an AES decryption round using an 128-bit state and a round key
AESDECLAST (AESNI
_mm_aesdeclast_si128
Perform the last AES decryption round using an 128-bit state and a round key
AESENC (AESNI
_mm_aesenc_si128
Perform an AES encryption round using an 128-bit state and a round key
AESENCLAST (AESNI
_mm_aesenclast_si128
Perform the last AES encryption round using an 128-bit state and a round key
AESIMC (AESNI
_mm_aesimc_si128
Perform an inverse mix column transformation primitive
AESKEYGENASSIST (AESNI
_mm_aeskeygenassist_si128
Assist the creation of round keys with a key expansion schedule
PCLMULQDQ (PCLMULQDQ
_mm_clmulepi64_si128
Perform carryless multiplication of two 64-bit numbers

SHA1RNDS4 (SHA
_mm_sha1rnds4_epu32
Perform Four Rounds of SHA1 Operation
SHA1NEXTE (SHA
_mm_sha1nexte_epu32
Calculate SHA1 State Variable E after Four Rounds
SHA1MSG1 (SHA
_mm_sha1msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords
SHA1MSG2 (SHA
_mm_sha1msg2_epu32
Perform a Final Calculation for the Next Four SHA1 Message Dwords
SHA256RNDS2 (SHA
_mm_sha256rnds2_epu32
Perform Two Rounds of SHA256 Operation
SHA256MSG1 (SHA
_mm_sha256msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA256 Message
SHA256MSG2 (SHA
_mm_sha256msg2_epu32
Perform a Final Calculation for the Next Four SHA256 Message Dwords

VPBROADCASTMB2Q (V5+CD...
_mm_broadcastmb_epi64
VPBROADCASTMW2D (V5+CD...
_mm_broadcastmw_epi32
Broadcast Mask to Vector Register

VZEROALL (V1
_mm256_zeroall
Zero all YMM registers
VZEROUPPER (V1
_mm256_zeroupper
Zero upper 128 bits of all YMM registers

MOVNTPS (S1
_mm_stream_ps
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory
MASKMOVDQU (S2
_mm_maskmoveu_si128
Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD (S2
_mm_stream_pd
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory
MOVNTDQ (S2
_mm_stream_si128
Non-temporal store of double quadword from an XMM register into memory
LDDQU (S3
_mm_lddqu_si128
Special 128-bit unaligned load designed to avoid cache line splits
MOVNTDQA (S4.1
_mm_stream_load_si128
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

VGATHERPFxDPS (V5+PF
_mm512_mask_prefetch_i32gather_ps
VGATHERPFxQPS (V5+PF
_mm512_mask_prefetch_i64gather_ps
VGATHERPFxDPD (V5+PF
_mm512_mask_prefetch_i32gather_pd
VGATHERPFxQPD (V5+PF
_mm512_mask_prefetch_i64gather_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint
VSCATTERPFxDPS (V5+PF
_mm512_prefetch_i32scatter_ps
VSCATTERPFxQPS (V5+PF
_mm512_prefetch_i64scatter_ps
VSCATTERPFxDPD (V5+PF
_mm512_prefetch_i32scatter_pd
VSCATTERPFxQPD (V5+PF
_mm512_prefetch_i64scatter_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint with Intent to Write

 

 

TIPS

TIP 1: Zero Clear

XOR instructions do for both Integer and Floating-point.

Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1

        pxor         xmm1, xmm1

Example: Set 0.0f to 4 floats in XMM1

        xorps        xmm1, xmm1

Example: Set 0.0 to 2 doubles in XMM1

        xorpd        xmm1, xmm1

 

TIP 2: Copy the lowest 1 element to other elements in XMM register

Shuffle instructions do.

Example: Copy the lowest float element to other 3 elements in XMM1.

        shufps       xmm1, xmm1, 0

Example: Copy the lowest WORD element to other 7 elements in XMM1

        pshuflw       xmm1, xmm1, 0
        pshufd        xmm1, xmm1, 0

Example: Copy the lower QWORD element to the upper element in XMM1

        pshufd        xmm1, xmm1, 44h     ; 01 00 01 00 B = 44h

Is this better?

        punpcklqdq    xmm1, xmm1

 

TIP 3: Integer Sign Extention / Zero Extention

Unpack instructions do.

Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).

        movdqa     xmm2, xmm1     ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0]
        pxor       xmm3, xmm3     ; upper 16-bit to attach to each WORD = all 0
        punpcklwd  xmm1, xmm3     ; lower 4 DWORDS:  0 [3] 0 [2] 0 [1] 0 [0] 
        punpckhwd  xmm2, xmm3     ; upper 4 DWORDS:  0 [7] 0 [6] 0 [5] 0 [4]

Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).

        pxor       xmm3, xmm3
        movdqa     xmm2, xmm1
        pcmpgtb    xmm3, xmm1     ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1
        punpcklbw  xmm1, xmm3     ; lower 8 WORDS
        punpckhbw  xmm2, xmm3     ; upper 8 WORDS

Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)

    const __m128i izero = _mm_setzero_si128();
    __m128i words8hi = _mm_cmpgt_epi16(izero, words8);
    __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi);
    __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

 

TIP 4: Absolute Values of Integers

If an integer value is positive or zero, it is already the abosoute value. Else, adding 1 after complementing all bits makes the absolute value.

Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1

                                  ; if src is positive or 0; if src is negative
        pxor      xmm2, xmm2      
        pcmpgtw   xmm2, xmm1      ; xmm2 <- 0              ; xmm2 <- 1
        pxor      xmm1, xmm2      ; xor with 0(do nothing) ; xor with -1(complement all bits)
        psubw     xmm1, xmm2      ; subtract 0(do nothing) ; subtract -1(add 1)

Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4

    const __m128i izero = _mm_setzero_si128();
    __m128i tmp = _mm_cmpgt_epi32(izero, dwords4);
    dwords4 = _mm_xor_si128(dwords4, tmp);
    dwords4 = _mm_sub_epi32(dwords4, tmp);

 

TIPS 5: Absolute Values of Floating-Points

Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.

Example: Set absolute values of 4 floats in XMM1 to XMM1

; data
              align   16
signoffmask   dd      4 dup (7fffffffH)       ; mask for clearing the highest bit
        
; code
        andps   xmm1, xmmword ptr signoffmask        

Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4

        const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000

        floats4 = _mm_andnot_ps(signmask, floats4);

 

TIP 6: Lacking some integer MUL instructions?

Signed/unsigned makes difference only for the calculation of the upper part. Fot the lower part, the same instruction can be used both for signed and unsigned.

unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW

singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW

 

TIP 8: max / min

Bitwise operation after getting mask by compararison does.

Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1

; A=xmm1  B=xmm2                    ; if A>B        ; if A<=B
        movdqa      xmm0, xmm1
        pcmpgtd     xmm1, xmm2      ; xmm1=-1       ; xmm1=0
        pand        xmm2, xmm1      ; xmm2=B        ; xmm2=0
        pandn       xmm1, xmm0      ; xmm1=0        ; xmm1=A
        por         xmm1, xmm2      ; xmm1=B        ; xmm1=A

Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB

    __m128i mask = _mm_cmpgt_epi8(a, b);
    __m128i selectedA = _mm_and_si128(mask, a);
    __m128i selectedB = _mm_andnot_si128(mask, b);
    __m128i maxAB = _mm_or_si128(selectedA, selectedB);

 

TIP 10: Set all bits

PCMPEQx instruction does.

Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.

        pcmpeqb         xmm1, xmm1

 


ver 2017101400