English Japanese

x86/x64 SIMD Instruction List (SSE to AVX512)

MMX register (64-bit) instructions are omitted.

S1=SSE S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512 #=64-bit mode only

Instructions marked with * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.

C/C++ intrinsic name is written below each instruction in blue.

AVX/AVX2

Add prefix 'V' to change SSE instruction name to AVX instruction name.
FP AVX instructions can do 256-bit operations on YMM registers.
Integer AVX instructions can use YMM registers from AVX2.
To use 256-bit intrinsics, change prefix _mm to _mm256, and suffix si128 to si256.
Using YMM registers requires the support from OS (For Windows, 7 update 1 or later is required).
YMM register is basically separated into 2 lanes (upper 128-bit and lower 128-bit) and operates within each lane. Horizontal operations (unpacks, shuffles, horizontal calculations, byte shifts, conversions) can be anomalous. Check the manuals out carefully.

AVX512

Instructions noted only "(V5" can be used if CPUID AVX512F flag is on.
Instructions noted "(V5" and "+xx" can be used only if CPUID AVX512F flag is set and AVX512xx flag is also set.
Using AVX512 instructions requires the support from OS.
The features common to most AVX512 instructions ({k1}{z}, {er}/{sae}, bcst) are not mentioned in each instruction. See this -> AVX512 Memo
Opmask register instructions are here.

This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.

Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm

When you find any error or something please post this feedback form or email me to the address at the bottom of this page.

Highlighter to Color To make these default, bookmark this page after clicking here.

MOVE ?MM = XMM / YMM / ZMM

	Integer				Floating-Point			YMM lane (128-bit)
	QWORD	DWORD	WORD	BYTE	Double	Single	Half	YMM lane (128-bit)
?MM whole ?MM/mem	MOVDQA (S2 _mm_load_si128 _mm_store_si128 MOVDQU (S2 _mm_loadu_si128 _mm_storeu_si128				MOVAPD (S2 _mm_load_pd _mm_loadr_pd _mm_store_pd _mm_storer_pd MOVUPD (S2 _mm_loadu_pd _mm_storeu_pd	MOVAPS (S1 _mm_load_ps _mm_loadr_ps _mm_store_ps _mm_storer_ps MOVUPS (S1 _mm_loadu_ps _mm_storeu_ps
?MM whole ?MM/mem	VMOVDQA64 (V5... _mm_mask_load_epi64 _mm_mask_store_epi64 etc VMOVDQU64 (V5... _mm_mask_loadu_epi64 _mm_mask_store_epi64 etc	VMOVDQA32 (V5... _mm_mask_load_epi32 _mm_mask_store_epi32 etc VMOVDQU32 (V5... _mm_mask_loadu_epi32 _mm_mask_storeu_epi32 etc	VMOVDQU16 (V5+BW... _mm_mask_loadu_epi16 _mm_mask_storeu_epi16 etc	VMOVDQU8 (V5+BW... _mm_mask_loadu_epi8 _mm_mask_storeu_epi8 etc
XMM upper half mem					MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd	MOVHPS (S1 _mm_loadh_pi _mm_storeh_pi
XMM upper half XMM lower half						MOVHLPS (S1 _mm_movehl_ps MOVLHPS (S1 _mm_movelh_ps
XMM lower half mem	MOVQ (S2 _mm_loadl_epi64 _mm_storel_epi64				MOVLPD (S2 _mm_loadl_pd _mm_storel_pd	MOVLPS (S1 _mm_loadl_pi _mm_storel_pi
XMM lowest 1 elem r/m	MOVQ (S2# _mm_cvtsi64_si128 _mm_cvtsi128_si64	MOVD (S2 _mm_cvtsi32_si128 _mm_cvtsi128_si32	VMOVW (V5+FP16 _mm_cvtsi16_si128 _mm_cvtsi128_si16
XMM lowest 1 elem XMM/mem	MOVQ (S2 _mm_move_epi64				MOVSD (S2 _mm_load_sd _mm_store_sd _mm_move_sd	MOVSS (S1 _mm_load_ss _mm_store_ss _mm_move_ss	VMOVSH (V5+FP16 _mm_load_sh _mm_store_sh _mm_move_sh
XMM whole 1 elem	TIP 2 _mm_set1_epi64x VPBROADCASTQ (V2 _mm_broadcastq_epi64	TIP 2 _mm_set1_epi32 VPBROADCASTD (V2 _mm_broadcastd_epi32	TIP 2 _mm_set1_epi16 VPBROADCASTW (V2 _mm_broadcastw_epi16	_mm_set1_epi8 VPBROADCASTB (V2 _mm_broadcastb_epi8	TIP 2 _mm_set1_pd _mm_load1_pd MOVDDUP (S3 _mm_movedup_pd _mm_loaddup_pd	TIP 2 _mm_set1_ps _mm_load1_ps VBROADCASTSS from mem (V1 from XMM (V2 _mm_broadcast_ss
YMM / ZMM whole 1 elem	VPBROADCASTQ (V2 _mm256_broadcastq_epi64	VPBROADCASTD (V2 _mm256_broadcastd_epi32	VPBROADCASTW (V2 _mm256_broadcastw_epi16	VPBROADCASTB (V2 _mm256_broadcastb_epi8	VBROADCASTSD from mem (V1 from XMM (V2 _mm256_broadcast_sd	VBROADCASTSS from mem (V1 from XMM (V2 _mm256_broadcast_ss		VBROADCASTF128 (V1 _mm256_broadcast_ps _mm256_broadcast_pd VBROADCASTI128 (V2 _mm256_broadcastsi128_si256
YMM / ZMM whole 2/4/8 elems	VBROADCASTI64X2 (V5+DQ... _mm512_broadcast_i64x2 VBROADCASTI64X4 (V5 _mm512_broadcast_i64x4	VBROADCASTI32X2 (V5+DQ... _mm512_broadcast_i32x2 VBROADCASTI32X4 (V5... _mm512_broadcast_i32x4 VBROADCASTI32X8 (V5+DQ _mm512_broadcast_i32x8			VBROADCASTF64X2 (V5+DQ... _mm512_broadcast_f64x2 VBROADCASTF64X4 (V5 _mm512_broadcast_f64x4	VBROADCASTF32X2 (V5+DQ... _mm512_broadcast_f32x2 VBROADCASTF32X4 (V5... _mm512_broadcast_f32x4 VBROADCASTF32X8 (V5+DQ _mm512_broadcast_f32x8
?MM multiple elems	_mm_set_epi64x _mm_setr_epi64x	_mm_set_epi32 _mm_setr_epi32	_mm_set_epi16 _mm_setr_epi16	_mm_set_epi8 _mm_setr_epi8	_mm_set_pd _mm_setr_pd	_mm_set_ps _mm_setr_ps
?MM whole zero	TIP 1 _mm_setzero_si128				TIP 1 _mm_setzero_pd	TIP 1 _mm_setzero_ps
extract	PEXTRQ (S4.1# _mm_extract_epi64	PEXTRD (S4.1 _mm_extract_epi32	PEXTRW to r (S2 PEXTRW to r/m (S4.1 _mm_extract_epi16	PEXTRB (S4.1 _mm_extract_epi8	->MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd ->MOVLPD (S2 _mm_loadl_pd _mm_storel_pd	EXTRACTPS (S4.1 _mm_extract_ps		VEXTRACTF128 (V1 _mm256_extractf128_ps _mm256_extractf128_pd _mm256_extractf128_si256 VEXTRACTI128 (V2 _mm256_extracti128_si256
extract	VEXTRACTI64X2 (V5+DQ... _mm512_extracti64x2_epi64 VEXTRACTI64X4 (V5 _mm512_extracti64x4_epi64	VEXTRACTI32X4 (V5... _mm512_extracti32x4_epi32 VEXTRACTI32X8 (V5+DQ _mm512_extracti32x8_epi32			VEXTRACTF64X2 (V5+DQ... _mm512_extractf64x2_pd VEXTRACTF64X4 (V5 _mm512_extractf64x4_pd	VEXTRACTF32X4 (V5... _mm512_extractf32x4_ps VEXTRACTF32X8 (V5+DQ _mm512_extractf32x8_ps
insert	PINSRQ (S4.1# _mm_insert_epi64	PINSRD (S4.1 _mm_insert_epi32	PINSRW (S2 _mm_insert_epi16	PINSRB (S4.1 _mm_insert_epi8	->MOVHPD (S2 _mm_loadh_pd _mm_storeh_pd ->MOVLPD (S2 _mm_loadl_pd _mm_storel_pd	INSERTPS (S4.1 _mm_insert_ps		VINSERTF128 (V1 _mm256_insertf128_ps _mm256_insertf128_pd _mm256_insertf128_si256 VINSERTI128 (V2 _mm256_inserti128_si256
insert	VINSERTI64X2 (V5+DQ... _mm512_inserrti64x2 VINSERTI64X4 (V5... _mm512_inserti64x4	VINSERTI32X4 (V5... _mm512_inserti32x4 VINSERTI32X8 (V5+DQ _mm512_inserti32x8			VINSERTF64X2 (V5+DQ... _mm512_insertf64x2 VINSERTF64X4 (V5 _mm512_insertf64x4	VINSERTF32X4 (V5... _mm512_insertf32x4 VINSERTF32X8 (V5+DQ _mm512_insertf32x8
unpack	PUNPCKHQDQ (S2 _mm_unpackhi_epi64 PUNPCKLQDQ (S2 _mm_unpacklo_epi64	PUNPCKHDQ (S2 _mm_unpackhi_epi32 PUNPCKLDQ (S2 _mm_unpacklo_epi32	PUNPCKHWD (S2 _mm_unpackhi_epi16 PUNPCKLWD (S2 _mm_unpacklo_epi16	PUNPCKHBW (S2 _mm_unpackhi_epi8 PUNPCKLBW (S2 _mm_unpacklo_epi8	UNPCKHPD (S2 _mm_unpackhi_pd UNPCKLPD (S2 _mm_unpacklo_pd	UNPCKHPS (S1 _mm_unpackhi_ps UNPCKLPS (S1 _mm_unpacklo_ps
shuffle/permute	VPERMQ (V2 _mm256_permute4x64_epi64 VPERMI2Q (V5... _mm_permutex2var_epi64 VPERMT2Q (V5... _mm_permutex2var_epi64	PSHUFD (S2 _mm_shuffle_epi32 VPERMD (V2 _mm256_permutevar8x32_epi32 _mm256_permutexvar_epi32 VPERMI2D (V5... _mm_permutex2var_epi32 VPERMT2D (V5... _mm_permutex2var_epi32	PSHUFHW (S2 _mm_shufflehi_epi16 PSHUFLW (S2 _mm_shufflelo_epi16 VPERMW (V5+BW... _mm_permutexvar_epi16 VPERMI2W (V5+BW... _mm_permutex2var_epi16 VPERMT2W (V5+BW... _mm_permutex2var_epi16	PSHUFB (SS3 _mm_shuffle_epi8 VPERMB (V5+VBMI... _mm_permutexvar_epi8 VPERMI2B (V5+VBMI... _mm_permutex2var_epi8 VPERMT2B (V5+VBMI... _mm_permutex2var_epi8	SHUFPD (S2 _mm_shuffle_pd VPERMILPD (V1 _mm_permute_pd _mm_permutevar_pd VPERMPD (V2 _mm256_permute4x64_pd VPERMI2PD (V5... _mm_permutex2var_pd VPERMT2PD (V5... _mm_permutex2var_pd	SHUFPS (S1 _mm_shuffle_ps VPERMILPS (V1 _mm_permute_ps _mm_permutevar_ps VPERMPS (V2 _mm256_permutevar8x32_ps VPERMI2PS (V5... _mm_permutex2var_ps VPERMT2PS (V5... _mm_permutex2var_ps		VPERM2F128 (V1 _mm256_permute2f128_ps _mm256_permute2f128_pd _mm256_permute2f128_si256 VPERM2I128 (V2 _mm256_permute2x128_si256
shuffle/permute	VSHUFI64X2 (V5... _mm512_shuffle_i64x2	VSHUFI32X4 (V5... _mm512_shuffle_i32x4			VSHUFF64X2 (V5... _mm512_shuffle_f64x2	VSHUFF32X4 (V5... _mm512_shuffle_f32x4
blend	VPBLENDMQ (V5... _mm_mask_blend_epi32	VPBLENDD (V2 _mm_blend_epi32 VPBLENDMD (V5... _mm_mask_blend_epi32	PBLENDW (S4.1 _mm_blend_epi16 VPBLENDMW (V5+BW... _mm_mask_blend_epi16	PBLENDVB (S4.1 _mm_blendv_epi8 VPBLENDMB (V5+BW... _mm_mask_blend_epi8	BLENDPD (S4.1 _mm_blend_pd BLENDVPD (S4.1 _mm_blendv_pd VBLENDMPD (V5... _mm_mask_blend_pd	BLENDPS (S4.1 _mm_blend_ps BLENDVPS (S4.1 _mm_blendv_ps VBLENDMPS (V5... _mm_mask_blend_ps
move and duplicate					MOVDDUP (S3 _mm_movedup_pd _mm_loaddup_pd	MOVSHDUP (S3 _mm_movehdup_ps MOVSLDUP (S3 _mm_moveldup_ps
mask move	VPMASKMOVQ (V2 _mm_maskload_epi64 _mm_maskstore_epi64	VPMASKMOVD (V2 _mm_maskload_epi32 _mm_maskstore_epi32			VMASKMOVPD (V1 _mm_maskload_pd _mm_maskstore_pd	VMASKMOVPS (V1 _mm_maskload_ps _mm_maskstore_ps
extract highest bit				PMOVMSKB (S2 _mm_movemask_epi8	MOVMSKPD (S2 _mm_movemask_pd	MOVMSKPS (S1 _mm_movemask_ps
extract highest bit	VPMOVQ2M (V5+DQ... _mm_movepi64_mask	VPMOVD2M (V5+DQ... _mm_movepi32_mask	VPMOVW2M (V5+BW... _mm_movepi16_mask	VPMOVB2M (V5+BW... _mm_movepi8_mask
gather	VPGATHERDQ (V2 _mm_i32gather_epi64 _mm_mask_i32gather_epi64 VPGATHERQQ (V2 _mm_i64gather_epi64 _mm_mask_i64gather_epi64	VPGATHERDD (V2 _mm_i32gather_epi32 _mm_mask_i32gather_epi32 VPGATHERQD (V2 _mm_i64gather_epi32 _mm_mask_i64gather_epi32			VGATHERDPD (V2 _mm_i32gather_pd _mm_mask_i32gather_pd VGATHERQPD (V2 _mm_i64gather_pd _mm_mask_i64gather_pd	VGATHERDPS (V2 _mm_i32gather_ps _mm_mask_i32gather_ps VGATHERQPS (V2 _mm_i64gather_ps _mm_mask_i64gather_ps
scatter	VPSCATTERDQ (V5... _mm_i32scatter_epi64 _mm_mask_i32scatter_epi64 VPSCATTERQQ (V5... _mm_i64scatter_epi64 _mm_mask_i64scatter_epi64	VPSCATTERDD (V5... _mm_i32scatter_epi32 _mm_mask_i32scatter_epi32 VPSCATTERQD (V5... _mm_i64scatter_epi32 _mm_mask_i64scatter_epi32			VSCATTERDPD (V5... _mm_i32scatter_pd _mm_mask_i32scatter_pd VSCATTERQPD (V5... _mm_i64scatter_pd _mm_mask_i64scatter_pd	VSCATTERDPS (V5... _mm_i32scatter_ps _mm_mask_i32scatter_ps VSCATTERQPS (V5... _mm_i64scatter_ps _mm_mask_i64scatter_ps
compress	VPCOMPRESSQ (V5... _mm_mask_compress_epi64 _mm_mask_compressstoreu_epi64	VPCOMPRESSD (V5... _mm_mask_compress_epi32 _mm_mask_compressstoreu_epi32	VPCOMPRESSW (V5+VBMI2... _mm_mask_compress_epi16 _mm_mask_compressstoreu_epi16	VPCOMPRESSB (V5+VBMI2... _mm_mask_compress_epi8 _mm_mask_compressstoreu_epi8	VCOMPRESSPD (V5... _mm_mask_compress_pd _mm_mask_compressstoreu_pd	VCOMPRESSPS (V5... _mm_mask_compress_ps _mm_mask_compressstoreu_ps
expand	VPEXPANDQ (V5... _mm_mask_expand_epi64 _mm_mask_expandloadu_epi64	VPEXPANDD (V5... _mm_mask_expand_epi32 _mm_mask_expandloadu_epi32	VPEXPANDW (V5+VBMI2... _mm_mask_expand_epi16 _mm_mask_expandloadu_epi16	VPEXPANDB (V5+VBMI2... _mm_mask_expand_epi8 _mm_mask_expandloadu_epi8	VEXPANDPD (V5... _mm_mask_expand_pd _mm_mask_expandloadu_pd	VEXPANDPS (V5... _mm_mask_expand_ps _mm_mask_expandloadu_ps
align right	VALIGNQ (V5... _mm_alignr_epi64	VALIGND (V5... _mm_alignr_epi32		PALIGNR (SS3 _mm_alignr_epi8
expand Opmask bits	VPMOVM2Q (V5+DQ... _mm_movm_epi64	VPMOVM2D (V5+DQ... _mm_movm_epi32	VPMOVM2W (V5+BW... _mm_movm_epi16	VPMOVM2B (V5+BW... _mm_movm_epi8

Conversions

from \ to		Integer				Floating-Point
from \ to		QWORD	DWORD	WORD	BYTE	Double	Single	Half
Integer	QWORD		VPMOVQD (V5... _mm_cvtepi64_epi32 VPMOVSQD (V5... _mm_cvtsepi64_epi32 VPMOVUSQD (V5... _mm_cvtusepi64_epi32	VPMOVQW (V5... _mm_cvtepi64_epi16 VPMOVSQW (V5... _mm_cvtsepi64_epi16 VPMOVUSQW (V5... _mm_cvtusepi64_epi16	VPMOVQB (V5... _mm_cvtepi64_epi8 VPMOVSQB (V5... _mm_cvtsepi64_epi8 VPMOVUSQB (V5... _mm_cvtusepi64_epi8	CVTSI2SD (S2# scalar only _mm_cvtsi64_sd VCVTQQ2PD* (V5+DQ... _mm_cvtepi64_pd VCVTUQQ2PD* (V5+DQ... _mm_cvtepu64_pd	CVTSI2SS (S1# scalar only _mm_cvtsi64_ss VCVTQQ2PS* (V5+DQ... _mm_cvtepi64_ps VCVTUQQ2PS* (V5+DQ... _mm_cvtepu64_ps	VCVTQQ2PH* (V5+FP16... _mm_cvtepi64_ph VCVTUQQ2PH* (V5+FP16... _mm_cvtepu64_ph
	DWORD	TIP 3 PMOVSXDQ (S4.1 _mm_ cvtepi32_epi64 PMOVZXDQ (S4.1 _mm_ cvtepu32_epi64		PACKSSDW (S2 _mm_packs_epi32 PACKUSDW (S4.1 _mm_packus_epi32 VPMOVDW (V5... _mm_cvtepi32_epi16 VPMOVSDW (V5... _mm_cvtsepi32_epi16 VPMOVUSDW (V5... _mm_cvtusepi32_epi16	VPMOVDB (V5... _mm_cvtepi32_epi8 VPMOVSDB (V5... _mm_cvtsepi32_epi8 VPMOVUSDB (V5... _mm_cvtusepi32_epi8	CVTDQ2PD* (S2 _mm_cvtepi32_pd VCVTUDQ2PD* (V5... _mm_cvtepu32_pd	CVTDQ2PS* (S2 _mm_cvtepi32_ps VCVTUDQ2PS* (V5... _mm_cvtepu32_ps	VCVTDQ2PH* (V5+FP16... _mm_cvtepi32_ph VCVTUDQ2PH* (V5+FP16... _mm_cvtepu32_ph
	WORD	PMOVSXWQ (S4.1 _mm_ cvtepi16_epi64 PMOVZXWQ (S4.1 _mm_ cvtepu16_epi64	TIP 3 PMOVSXWD (S4.1 _mm_ cvtepi16_epi32 PMOVZXWD (S4.1 _mm_ cvtepu16_epi32		PACKSSWB (S2 _mm_packs_epi16 PACKUSWB (S2 _mm_packus_epi16 VPMOVWB (V5+BW... _mm_cvtepi16_epi8 VPMOVSWB (V5+BW... _mm_cvtsepi16_epi8 VPMOVUSWB (V5+BW... _mm_cvtusepi16_epi8			VCVTW2PH* (V5+FP16... _mm_cvtepi16_ph VCVTUW2PH* (V5+FP16... _mm_cvtepu16_ph
	BYTE	PMOVSXBQ (S4.1 _mm_ cvtepi8_epi64 PMOVZXBQ (S4.1 _mm_ cvtepu8_epi64	PMOVSXBD (S4.1 _mm_ cvtepi8_epi32 PMOVZXBD (S4.1 _mm_ cvtepu8_epi32	TIP 3 PMOVSXBW (S4.1 _mm_ cvtepi8_epi16 PMOVZXBW (S4.1 _mm_ cvtepu8_epi16
Floating-Point	Double	CVTSD2SI / CVTTSD2SI (S2# scalar only _mm_cvtsd_si64 / _mm_cvttsd_si64 VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ... _mm_cvtpd_epi64 / _mm_cvttpd_epi64 VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ... _mm_cvtpd_epu64 / _mm_cvttpd_epu64 right ones are with truncation	CVTPD2DQ* / CVTTPD2DQ* (S2 _mm_cvtpd_epi32 / _mm_cvttpd_epi32 VCVTPD2UDQ* / VCVTTPD2UDQ* (V5... _mm_cvtpd_epu32 / _mm_cvttpd_epu32 right ones are with truncation				CVTPD2PS* (S2 _mm_cvtpd_ps	VCVTPD2PH* (V5+FP16... _mm_cvtpd_ph
	Single	CVTSS2SI / CVTTSS2SI (S1# scalar only _mm_cvtss_si64 / _mm_cvttss_si64 VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ... _mm_cvtps_epi64 / _mm_cvttps_epi64 VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ... _mm_cvtps_epu64 / _mm_cvttps_epu64 right ones are with truncation	CVTPS2DQ* / CVTTPS2DQ* (S2 _mm_cvtps_epi32 / _mm_cvttps_epi32 VCVTPS2UDQ* / VCVTTPS2UDQ* (V5... _mm_cvtps_epu32 / _mm_cvttps_epu32 right ones are with truncation			CVTPS2PD* (S2 _mm_cvtps_pd		VCVTPS2PH (F16C _mm_cvtps_ph VCVTPS2PHX* (V5+FP16... _mm_cvtxps_ph
	Half	VCVTPH2QQ * / VCVTTPH2QQ* (V5+FP16... _mm_cvtph_epi64 / _mm_cvttph_epi64 VCVTPH2UQQ* / VCVTTPH2UQQ*(V5+FP16... _mm_cvtph_epu64 / _mm_cvttph_epu64 right ones are with truncation	VCVTPH2DQ* / VCVTTPH2DQ* (V5+FP16... _mm_cvtph_epi32* / _mm_cvttph_epi32 VCVTPH2UDQ* / VCVTTPH2UDQ* (V5+FP16... _mm_cvtph_epu32 / _mm_cvttph_epu32 right ones are with truncation	VCVTPH2W* / VCVTTPH2W(V5+FP16... _mm_cvtph_epi16 / _mm_cvttph_epi16 VCVTPH2UW / VCVTTPH2UW*(V5+FP16... _mm_cvtph_epu16 / _mm_cvttph_epu16 right ones are with truncation		VCVTPH2PD* (V5+FP16... _mm_cvtph_pd	VCVTPH2PS (F16C _mm_cvtph_ps VCVTPH2PSX* (V5+FP16... _mm_cvtxph_ps

Arithmetic Operations

	Integer				Floating-Point
	QWORD	DWORD	WORD	BYTE	Double	Single	Half
add	PADDQ (S2 _mm_add_epi64	PADDD (S2 _mm_add_epi32	PADDW (S2 _mm_add_epi16 PADDSW (S2 _mm_adds_epi16 PADDUSW (S2 _mm_adds_epu16	PADDB (S2 _mm_add_epi8 PADDSB (S2 _mm_adds_epi8 PADDUSB (S2 _mm_adds_epu8	ADDPD* (S2 _mm_add_pd	ADDPS* (S1 _mm_add_ps	VADDPH* (V5+FP16... _mm_add_ph
sub	PSUBQ (S2 _mm_sub_epi64	PSUBD (S2 _mm_sub_epi32	PSUBW (S2 _mm_sub_epi16 PSUBSW (S2 _mm_subs_epi16 PSUBUSW (S2 _mm_subs_epu16	PSUBB (S2 _mm_sub_epi8 PSUBSB (S2 _mm_subs_epi8 PSUBUSB (S2 _mm_subs_epu8	SUBPD* (S2 _mm_sub_pd	SUBPS* (S1 _mm_sub_ps	VSUBPH* (V5+FP16... _mm_sub_ph
mul	VPMULLQ (V5+DQ... _mm_mullo_epi64	PMULDQ (S4.1 _mm_mul_epi32 PMULUDQ (S2 _mm_mul_epu32 PMULLD (S4.1 _mm_mullo_epi32	PMULHW (S2 _mm_mulhi_epi16 PMULHUW (S2 _mm_mulhi_epu16 PMULLW (S2 _mm_mullo_epi16		MULPD* (S2 _mm_mul_pd	MULPS* (S1 _mm_mul_ps	VMULPH* (V5+FP16... _mm_mul_ph
div					DIVPD* (S2 _mm_div_pd	DIVPS* (S1 _mm_div_ps	VDIVPH* (V5+FP16... _mm_div_ph
reciprocal					VRCP14PD* (V5... _mm_rcp14_pd	RCPPS* (S1 _mm_rcp_ps VRCP14PS* (V5... _mm_rcp14_ps	VRCPPH* (V5+FP16... _mm_rcp_ph
square root					SQRTPD* (S2 _mm_sqrt_pd	SQRTPS* (S1 _mm_sqrt_ps	VSQRTPH* (V5+FP16... _mm_sqrt_ph
reciprocal of square root					VRSQRT14PD* (V5... _mm_rsqrt14_pd	RSQRTPS* (S1 _mm_rsqrt_ps VRSQRT14PS* (V5... _mm_rsqrt14_ps	VRSQRTPH* (V5+FP16... _mm_rsqrt_ph
multiply nth power of 2					VSCALEFPD* (V5... _mm_scalef_pd	VSCALEFPS* (V5... _mm_scalef_ps	VSCALEFPH* (V5+FP16... _mm_scalef_ph
max	TIP 8 VPMAXSQ (V5... _mm_max_epi64 VPMAXUQ (V5... _mm_max_epu64	TIP 8 PMAXSD (S4.1 _mm_max_epi32 PMAXUD (S4.1 _mm_max_epu32	PMAXSW (S2 _mm_max_epi16 PMAXUW (S4.1 _mm_max_epu16	TIP 8 PMAXSB (S4.1 _mm_max_epi8 PMAXUB (S2 _mm_max_epu8	TIP 8 MAXPD* (S2 _mm_max_pd	TIP 8 MAXPS* (S1 _mm_max_ps	VMAXPH* (V5+FP16... _mm_max_ph
min	TIP 8 VPMINSQ (V5... _mm_min_epi64 VPMINUQ (V5... _mm_min_epu64	TIP 8 PMINSD (S4.1 _mm_min_epi32 PMINUD (S4.1 _mm_min_epu32	PMINSW (S2 _mm_min_epi16 PMINUW (S4.1 _mm_min_epu16	TIP 8 PMINSB (S4.1 _mm_min_epi8 PMINUB (S2 _mm_min_epu8	TIP 8 MINPD* (S2 _mm_min_pd	TIP 8 MINPS* (S1 _mm_min_ps	VMINPH* (V5+FP16... _mm_min_ph
average			PAVGW (S2 _mm_avg_epu16	PAVGB (S2 _mm_avg_epu8
absolute	TIP 4 VPABSQ (V5... _mm_abs_epi64	TIP 4 PABSD (SS3 _mm_abs_epi32	TIP 4 PABSW (SS3 _mm_abs_epi16	TIP 4 PABSB (SS3 _mm_abs_epi8	TIP 5	TIP 5
sign operation		PSIGND (SS3 _mm_sign_epi32	PSIGNW (SS3 _mm_sign_epi16	PSIGNB (SS3 _mm_sign_epi8
population count	VPOPCNTQ (V5+VPOPCNTDQ... _mm_popcnt_epi64	VPOPCNTD (V5+VPOPCNTDQ... _mm_popcnt_epi32	VPOPCNTW (V5+BITALG... _mm_popcnt_epi16	VPOPCNTB (V5+BITALG... _mm_popcnt_epi8
round					ROUNDPD* (S4.1 _mm_round_pd _mm_floor_pd _mm_ceil_pd VRNDSCALEPD* (V5... _mm_roundscale_pd	ROUNDPS* (S4.1 _mm_round_ps _mm_floor_ps _mm_ceil_ps VRNDSCALEPS* (V5... _mm_roundscale_ps	VRNDSCALEPH* (V5+FP16... _mm_roundscale_ph
difference from rounded value					VREDUCEPD* (V5+DQ... _mm_reduce_pd	VREDUCEPS* (V5+DQ... _mm_reduce_ps	VREDUCEPH* (V5+FP16... _mm_reduce_ph
add / sub					ADDSUBPD (S3 _mm_addsub_pd	ADDSUBPS (S3 _mm_addsub_ps
horizontal add		PHADDD (SS3 _mm_hadd_epi32	PHADDW (SS3 _mm_hadd_epi16 PHADDSW (SS3 _mm_hadds_epi16		HADDPD (S3 _mm_hadd_pd	HADDPS (S3 _mm_hadd_ps
horizontal sub		PHSUBD (SS3 _mm_hsub_epi32	PHSUBW (SS3 _mm_hsub_epi16 PHSUBSW (SS3 _mm_hsubs_epi16		HSUBPD (S3 _mm_hsub_pd	HSUBPS (S3 _mm_hsub_ps
dot product / multiply and add			PMADDWD (S2 _mm_madd_epi16 VPDPWSSD (AVX_VNNI _mm_dpwssd_avx_epi32 VPDPWSSD (V5+VNNI... _mm_dpwssd_epi32 VPDPWSSDS (AVX_VNNI _mm_dpwssds_avx_epi32 VPDPWSSDS (V5+VNNI... _mm_dpwssds_epi32	PMADDUBSW (SS3 _mm_maddubs_epi16 VPDPBUSD (AVX_VNNI _mm_dpbusd_avx_epi32 VPDPBUSD (V5+VNNI... _mm_dpbusd_epi32 VPDPBUSDS (AVX_VNNI _mm_dpbusds_avx_epi32 VPDPBUSDS (V5+VNNI... _mm_dpbusds_epi32	DPPD (S4.1 _mm_dp_pd	DPPS (S4.1 _mm_dp_ps
fused multiply and add / sub					VFMADDxxxPD* (FMA _mm_fmadd_pd VFMSUBxxxPD* (FMA _mm_fmsub_pd VFMADDSUBxxxPD (FMA _mm_fmaddsub_pd VFMSUBADDxxxPD (FMA _mm_fmsubadd_pd VFNMADDxxxPD* (FMA _mm_fnmadd_pd VFNMSUBxxxPD* (FMA _mm_fnmsub_pd xxx=132/213/231	VFMADDxxxPS* (FMA _mm_fmadd_ps VFMSUBxxxPS* (FMA _mm_fmsub_ps VFMADDSUBxxxPS (FMA _mm_fmaddsub_ps VFMSUBADDxxxPS (FMA _mm_fmsubadd_ps VFNMADDxxxPS* (FMA _mm_fnmadd_ps VFNMSUBxxxPS* (FMA _mm_fnmsub_ps xxx=132/213/231	VFMADDxxxPH* (V5+FP16... _mm_fmadd_ph VFMSUBxxxPH* (V5+FP16... _mm_fmsub_ph VFMADDSUBxxxPH (V5+FP16... _mm_fmaddsub_ph VFMSUBADDxxxPH (V5+FP16... _mm_fmsubadd_ph VFNMADDxxxPH* (V5+FP16... _mm_fnmadd_ph VFNMSUBxxxPH* (V5+FP16... _mm_fnmsub_ph xxx=132/213/231

Compare

	Integer
	QWORD	DWORD	WORD	BYTE
compare for ==	PCMPEQQ (S4.1 _mm_cmpeq_epi64 _mm_cmpeq_epi64_mask (V5... VPCMPUQ (0) (V5... _mm_cmpeq_epu64_mask	PCMPEQD (S2 _mm_cmpeq_epi32 _mm_cmpeq_epi32_mask (V5... VPCMPUD (0) (V5... _mm_cmpeq_epu32_mask	PCMPEQW (S2 _mm_cmpeq_epi16 _mm_cmpeq_epi16_mask (V5+BW... VPCMPUW (0) (V5+BW... _mm_cmpeq_epu16_mask	PCMPEQB (S2 _mm_cmpeq_epi8 _mm_cmpeq_epi8_mask (V5+BW... VPCMPUB (0) (V5+BW... _mm_cmpeq_epu8_mask
compare for <	VPCMPQ (1) (V5... _mm_cmplt_epi64_mask VPCMPUQ (1) (V5... _mm_cmplt_epu64_mask	VPCMPD (1) (V5... _mm_cmplt_epi32_mask VPCMPUD (1) (V5... _mm_cmplt_epu32_mask	VPCMPW (1) (V5+BW... _mm_cmplt_epi16_mask VPCMPUW (1) (V5+BW... _mm_cmplt_epu16_mask	VPCMPB (1) (V5+BW... _mm_cmplt_epi8_mask VPCMPUB (1) (V5+BW... _mm_cmplt_epu8_mask
compare for <=	VPCMPQ (2) (V5... _mm_cmple_epi64_mask VPCMPUQ (2) (V5... _mm_cmple_epu64_mask	VPCMPD (2) (V5... _mm_cmple_epi32_mask VPCMPUD (2) (V5... _mm_cmple_epu32_mask	VPCMPW (2) (V5+BW... _mm_cmple_epi16_mask VPCMPUW (2) (V5+BW... _mm_cmple_epu16_mask	VPCMPB (2) (V5+BW... _mm_cmple_epi8_mask VPCMPUB (2) (V5+BW... _mm_cmple_epu8_mask
compare for >	PCMPGTQ (S4.2 _mm_cmpgt_epi64 VPCMPQ (6) (V5... _mm_cmpgt_epi64_mask VPCMPUQ (6) (V5... _mm_cmpgt_epu64_mask	PCMPGTD (S2 _mm_cmpgt_epi32 VPCMPD (6) (V5... _mm_cmpgt_epi32_mask VPCMPUD (6) (V5... _mm_cmpgt_epu32_mask	PCMPGTW (S2 _mm_cmpgt_epi16 VPCMPW (6) (V5+BW... _mm_cmpgt_epi16_mask VPCMPUW (6) (V5+BW... _mm_cmpgt_epu16_mask	PCMPGTB (S2 _mm_cmpgt_epi8 VPCMPB (6) (V5+BW... _mm_cmpgt_epi8_mask VPCMPUB (6) (V5+BW... _mm_cmpgt_epu8_mask
compare for >=	VPCMPQ (5) (V5... _mm_cmpge_epi64_mask VPCMPUQ (5) (V5... _mm_cmpge_epu64_mask	VPCMPD (5) (V5... _mm_cmpge_epi32_mask VPCMPUD (5) (V5... _mm_cmpge_epu32_mask	VPCMPW (5) (V5+BW... _mm_cmpge_epi16_mask VPCMPUW (5) (V5+BW... _mm_cmpge_epu16_mask	VPCMPB (5) (V5+BW... _mm_cmpge_epi8_mask VPCMPUB (5) (V5+BW... _mm_cmpge_epu8_mask
compare for !=	VPCMPQ (4) (V5... _mm_cmpneq_epi64_mask VPCMPUQ (4) (V5... _mm_cmpneq_epu64_mask	VPCMPD (4) (V5... _mm_cmpneq_epi32_mask VPCMPUD (4) (V5... _mm_cmpneq_epu32_mask	VPCMPW (4) (V5+BW... _mm_cmpneq_epi16_mask VPCMPUW (4) (V5+BW... _mm_cmpneq_epu16_mask	VPCMPB (4) (V5+BW... _mm_cmpneq_epi8_mask VPCMPUB (4) (V5+BW... _mm_cmpneq_epu8_mask

	Floating-Point
	Double				Single				Half
when either (or both) is Nan	condition unmet		condition met		condition unmet		condition met
Exception on QNaN	YES	NO	YES	NO	YES	NO	YES	NO
compare for ==	VCMPEQ_OSPD* (V1 _mm_cmp_pd	CMPEQPD* (S2 _mm_cmpeq_pd	VCMPEQ_USPD* (V1 _mm_cmp_pd	VCMPEQ_UQPD* (V1 _mm_cmp_pd	VCMPEQ_OSPS* (V1 _mm_cmp_ps	CMPEQPS* (S1 _mm_cmpeq_ps	VCMPEQ_USPS* (V1 _mm_cmp_ps	VCMPEQ_UQPS* (V1 _mm_cmp_ps	VCMPPH* (V5+FP16... _mm_cmp_ph
compare for <	CMPLTPD* (S2 _mm_cmplt_pd	VCMPLT_OQPD* (V1 _mm_cmp_pd			CMPLTPS* (S1 _mm_cmplt_ps	VCMPLT_OQPS* (V1 _mm_cmp_ps
compare for <=	CMPLEPD* (S2 _mm_cmple_pd	VCMPLE_OQPD* (V1 _mm_cmp_pd			CMPLEPS* (S1 _mm_cmple_ps	VCMPLE_OQPS* (V1 _mm_cmp_ps
compare for >	VCMPGTPD* (V1 _mm_cmpgt_pd (S2	VCMPGT_OQPD* (V1 _mm_cmp_pd			VCMPGTPS* (V1 _mm_cmpgt_ps (S1	VCMPGT_OQPS* (V1 _mm_cmp_ps
compare for >=	VCMPGEPD* (V1 _mm_cmpge_pd (S2	VCMPGE_OQPD* (V1 _mm_cmp_pd			VCMPGEPS* (V1 _mm_cmpge_ps (S1	VCMPGE_OQPS* (V1 _mm_cmp_ps
compare for !=	VCMPNEQ_OSPD* (V1 _mm_cmp_pd	VCMPNEQ_OQPD* (V1 _mm_cmp_pd	VCMPNEQ_USPD* (V1 _mm_cmp_pd	CMPNEQPD* (S2 _mm_cmpneq_pd	VCMPNEQ_OSPS* (V1 _mm_cmp_ps	VCMPNEQ_OQPS* (V1 _mm_cmp_ps	VCMPNEQ_USPS* (V1 _mm_cmp_ps	CMPNEQPS* (S1 _mm_cmpneq_ps
compare for ! <			CMPNLTPD* (S2 _mm_cmpnlt_pd	VCMPNLT_UQPD* (V1 _mm_cmp_pd			CMPNLTPS* (S1 _mm_cmpnlt_ps	VCMPNLT_UQPS* (V1 _mm_cmp_ps
compare for ! <=			CMPNLEPD* (S2 _mm_cmpnle_pd	VCMPNLE_UQPD* (V1 _mm_cmp_pd			CMPNLEPS* (S1 _mm_cmpnle_ps	VCMPNLE_UQPS* (V1 _mm_cmp_ps
compare for ! >			VCMPNGTPD* (V1 _mm_cmpngt_pd (S2	VCMPNGT_UQPD* (V1 _mm_cmp_pd			VCMPNGTPS* (V1 _mm_cmpngt_ps (S1	VCMPNGT_UQPS* (V1 _mm_cmp_ps
compare for ! >=			VCMPNGEPD* (V1 _mm_cmpnge_pd (S2	VCMPNGE_UQPD* (V1 _mm_cmp_pd			VCMPNGEPS* (V1 _mm_cmpnge_ps (S1	VCMPNGE_UQPS* (V1 _mm_cmp_ps
compare for ordered	VCMPORD_SPD* (V1 _mm_cmp_pd	CMPORDPD* (S2 _mm_cmpord_pd			VCMPORD_SPS* (V1 _mm_cmp_ps	CMPORDPS* (S1 _mm_cmpord_ps
compare for unordered			VCMPUNORD_SPD* (V1 _mm_cmp_pd	CMPUNORDPD* (S2 _mm_cmpunord_pd			VCMPUNORD_SPS* (V1 _mm_cmp_ps	CMPUNORDPS* (S1 _mm_cmpunord_ps
TRUE			VCMPTRUE_USPD* (V1 _mm_cmp_pd	VCMPTRUEPD* (V1 _mm_cmp_pd			VCMPTRUE_USPS* (V1 _mm_cmp_ps	VCMPTRUEPS* (V1 _mm_cmp_ps
FALSE	VCMPFALSE_OSPD* (V1 _mm_cmp_pd	VCMPFALSEPD* (V1 _mm_cmp_pd			VCMPFALSE_OSPS* (V1 _mm_cmp_ps	VCMPFALSEPS* (V1 _mm_cmp_ps

	Floating-Point
	Double	Single	Half
compare scalar values to set flag register	COMISD (S2 _mm_comieq_sd _mm_comilt_sd _mm_comile_sd _mm_comigt_sd _mm_comige_sd _mm_comineq_sd UCOMISD (S2 _mm_ucomieq_sd _mm_ucomilt_sd _mm_ucomile_sd _mm_ucomigt_sd _mm_ucomige_sd _mm_ucomineq_sd	COMISS (S1 _mm_comieq_ss _mm_comilt_ss _mm_comile_ss _mm_comigt_ss _mm_comige_ss _mm_comineq_ss UCOMISS (S1 _mm_ucomieq_ss _mm_ucomilt_ss _mm_ucomile_ss _mm_ucomigt_ss _mm_ucomige_ss _mm_ucomineq_ss	VCOMISH (V5+FP16 _mm_comieq_sh _mm_comilt_sh _mm_comile_sh _mm_comigt_sh _mm_comige_sh _mm_comineq_sh VUCOMISH (V5+FP16 _mm_ucomieq_sh _mm_ucomilt_sh _mm_ucomile_sh _mm_ucomigt_sh _mm_ucomige_sh _mm_ucomineq_sh

Bitwise Logical Operations

	Integer				Floating-Point
	QWORD	DWORD	WORD	BYTE	Double	Single	Half
and	PAND (S2 _mm_and_si128				ANDPD (S2 _mm_and_pd	ANDPS (S1 _mm_and_ps
and	VPANDQ (V5... _mm512_and_epi64 etc	VPANDD (V5... _mm512_and_epi32 etc			ANDPD (S2 _mm_and_pd	ANDPS (S1 _mm_and_ps
and not	PANDN (S2 _mm_andnot_si128				ANDNPD (S2 _mm_andnot_pd	ANDNPS (S1 _mm_andnot_ps
and not	VPANDNQ (V5... _mm512_andnot_epi64 etc	VPANDND (V5... _mm512_andnot_epi32 etc			ANDNPD (S2 _mm_andnot_pd	ANDNPS (S1 _mm_andnot_ps
or	POR (S2 _mm_or_si128				ORPD (S2 _mm_or_pd	ORPS (S1 _mm_or_ps
or	VPORQ (V5... _mm512_or_epi64 etc	VPORD (V5... _mm512_or_epi32 etc			ORPD (S2 _mm_or_pd	ORPS (S1 _mm_or_ps
xor	PXOR (S2 _mm_xor_si128				XORPD (S2 _mm_xor_pd	XORPS (S1 _mm_xor_ps
xor	VPXORQ (V5... _mm512_xor_epi64 etc	VPXORD (V5... _mm512_xor_epi32 etc			XORPD (S2 _mm_xor_pd	XORPS (S1 _mm_xor_ps
test	PTEST (S4.1 _mm_testz_si128 _mm_testc_si128 _mm_testnzc_si128				VTESTPD (V1 _mm_testz_pd _mm_testc_pd _mm_testnzc_pd	VTESTPS (V1 _mm_testz_ps _mm_testc_ps _mm_testnzc_ps
test	VPTESTMQ (V5... _mm_test_epi64_mask VPTESTNMQ (V5... _mm_testn_epi64_mask	VPTESTMD (V5... _mm_test_epi32_mask VPTESTNMD (V5... _mm_testn_epi32_mask	VPTESTMW (V5+BW... _mm_test_epi16_mask VPTESTNMW (V5+BW... _mm_testn_epi16_mask	VPTESTMB (V5+BW... _mm_test_epi8_mask VPTESTNMB (V5+BW... _mm_testn_epi8_mask
ternary operation	VPTERNLOGQ (V5... _mm_ternarylogic_epi64	VPTERNLOGD (V5... _mm_ternarylogic_epi32

Bit Shift / Rotate

	Integer
	QWORD	DWORD	WORD	BYTE
shift left logical	PSLLQ (S2 _mm_slli_epi64 _mm_sll_epi64	PSLLD (S2 _mm_slli_epi32 _mm_sll_epi32	PSLLW (S2 _mm_slli_epi16 _mm_sll_epi16
shift left logical	VPSLLVQ (V2 _mm_sllv_epi64	VPSLLVD (V2 _mm_sllv_epi32	VPSLLVW (V5+BW... _mm_sllv_epi16
shift right logical	PSRLQ (S2 _mm_srli_epi64 _mm_srl_epi64	PSRLD (S2 _mm_srli_epi32 _mm_srl_epi32	PSRLW (S2 _mm_srli_epi16 _mm_srl_epi16
shift right logical	VPSRLVQ (V2 _mm_srlv_epi64	VPSRLVD (V2 _mm_srlv_epi32	VPSRLVW (V5+BW... _mm_srlv_epi16
shift right arithmetic	VPSRAQ (V5... _mm_srai_epi64 _mm_sra_epi64	PSRAD (S2 _mm_srai_epi32 _mm_sra_epi32	PSRAW (S2 _mm_srai_epi16 _mm_sra_epi16
shift right arithmetic	VPSRAVQ (V5... _mm_srav_epi64	VPSRAVD (V2 _mm_srav_epi32	VPSRAVW (V5+BW... _mm_srav_epi16
rotate left	VPROLQ (V5... _mm_rol_epi64	VPROLD (V5... _mm_rol_epi32
rotate left	VPROLVQ (V5... _mm_rolv_epi64	VPROLVD (V5... _mm_rolv_epi32
rotate right	VPRORQ (V5... _mm_ror_epi64	VPRORD (V5... _mm_ror_epi32
rotate right	VPRORVQ (V5... _mm_rorv_epi64	VPRORVD (V5... _mm_rorv_epi32
shift left logical double	VPSHLDQ (V5+VBMI2... _mm_shldi_epi64	VPSHLDD (V5+VBMI2... _mm_shldi_epi32	VPSHLDW (V5+VBMI2... _mm_shldi_epi16
shift left logical double	VPSHLDVQ (V5+VBMI2... _mm_shldv_epi64	VPSHLDVD (V5+VBMI2... _mm_shldv_epi32	VPSHLDVW (V5+VBMI2... _mm_shldv_epi16
shift right logical double	VPSHRDQ (V5+VBMI2... _mm_shrdi_epi64	VPSHRDD (V5+VBMI2... _mm_shrdi_epi32	VPSHRDW (V5+VBMI2... _mm_shrdi_epi16
shift right logical double	VPSHRDVQ (V5+VBMI2... _mm_shrdv_epi64	VPSHRDVD (V5+VBMI2... _mm_shrdv_epi32	VPSHRDVW (V5+VBMI2... _mm_shrdv_epi16

Byte Shift

	128-bit
shift left logical	PSLLDQ (S2 _mm_slli_si128
shift right logical	PSRLDQ (S2 _mm_srli_si128
packed align right	PALIGNR (SS3 _mm_alignr_epi8

Compare Strings

	explicit length	implicit length
return index	PCMPESTRI (S4.2 _mm_cmpestri _mm_cmpestra _mm_cmpestrc _mm_cmpestro _mm_cmpestrs _mm_cmpestrz	PCMPISTRI (S4.2 _mm_cmpistri _mm_cmpistra _mm_cmpistrc _mm_cmpistro _mm_cmpistrs _mm_cmpistrz
return mask	PCMPESTRM (S4.2 _mm_cmpestrm _mm_cmpestra _mm_cmpestrc _mm_cmpestro _mm_cmpestrs _mm_cmpestrz	PCMPISTRM (S4.2 _mm_cmpistrm _mm_cmpistra _mm_cmpistrc _mm_cmpistro _mm_cmpistrs _mm_cmpistrz

Others

LDMXCSR (S1 _mm_setcsr	Load MXCSR register
STMXCSR (S1 _mm_getcsr	Save MXCSR register state

PSADBW (S2 _mm_sad_epu8	Compute sum of absolute differences
MPSADBW (S4.1 _mm_mpsadbw_epu8	Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers.
VDBPSADBW (V5+BW... _mm_dbsad_epu8	Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

PMULHRSW (SS3
_mm_mulhrs_epi16

Packed Multiply High with Round and Scale

PHMINPOSUW (S4.1
_mm_minpos_epu16

Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

VPCONFLICTQ (V5+CD...
_mm512_conflict_epi64
VPCONFLICTD (V5+CD...
_mm512_conflict_epi32

Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

VP2INTERSECTQ (V5+VP2INTERSECT...
_mm512_2intersect_epi64
VP2INTERSECTD (V5+VP2INTERSECT...
_mm512_2intersect_epi32

Compute Intersection Between DWORDS/QUADWORDS to a Pair of Mask Registers

VPLZCNTQ (V5+CD...
_mm_lzcnt_epi64
VPLZCNTD (V5+CD...
_mm_lzcnt_epi32

Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

VFIXUPIMMPD* (V5... _mm512_fixupimm_pd VFIXUPIMMPS* (V5... _mm512_fixupimm_ps	Fix Up Special Packed Float64/32 Values
VFPCLASSPD* (V5... _mm512_fpclass_pd_mask VFPCLASSPS* (V5... _mm512_fpclass_ps_mask VFPCLASSPH* (V5+FP16... _mm512_fpclass_ph_mask	Tests Types Of a Packed Float64/32 Values
VRANGEPD* (V5+DQ... _mm_range_pd VRANGEPS* (V5+DQ... _mm_range_pd	Range Restriction Calculation For Packed Pairs of Float64/32 Values
VGETEXPPD* (V5... _mm512_getexp_pd VGETEXPPS* (V5... _mm512_getexp_ps VGETEXPPH* (V5+FP16... _mm512_getexp_ph	Convert Exponents of Packed DP/SP FP Values to FP Values
VGETMANTPD* (V5... _mm512_getmant_pd VGETMANTPS* (V5... _mm512_getmant_ps VGETMANTPH* (V5+FP16... _mm512_getmant_ph	Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

AESDEC (AESNI _mm_aesdec_si128	Perform an AES decryption round using an 128-bit state and a round key
AESDECLAST (AESNI _mm_aesdeclast_si128	Perform the last AES decryption round using an 128-bit state and a round key
AESENC (AESNI _mm_aesenc_si128	Perform an AES encryption round using an 128-bit state and a round key
AESENCLAST (AESNI _mm_aesenclast_si128	Perform the last AES encryption round using an 128-bit state and a round key
AESIMC (AESNI _mm_aesimc_si128	Perform an inverse mix column transformation primitive
AESKEYGENASSIST (AESNI _mm_aeskeygenassist_si128	Assist the creation of round keys with a key expansion schedule
PCLMULQDQ (PCLMULQDQ _mm_clmulepi64_si128	Perform carryless multiplication of two 64-bit numbers

SHA1RNDS4 (SHA _mm_sha1rnds4_epu32	Perform Four Rounds of SHA1 Operation
SHA1NEXTE (SHA _mm_sha1nexte_epu32	Calculate SHA1 State Variable E after Four Rounds
SHA1MSG1 (SHA _mm_sha1msg1_epu32	Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords
SHA1MSG2 (SHA _mm_sha1msg2_epu32	Perform a Final Calculation for the Next Four SHA1 Message Dwords
SHA256RNDS2 (SHA _mm_sha256rnds2_epu32	Perform Two Rounds of SHA256 Operation
SHA256MSG1 (SHA _mm_sha256msg1_epu32	Perform an Intermediate Calculation for the Next Four SHA256 Message
SHA256MSG2 (SHA _mm_sha256msg2_epu32	Perform a Final Calculation for the Next Four SHA256 Message Dwords

GF2P8AFFINEQB (GFNI... _mm_gf2p8affine_epi64_epi8	Galois Field Affine Transformation
GF2P8AFFINEINVQB (GFNI... _mm_gf2p8affineinv_epi64_epi8	Galois Field Affine Transformation Inverse
GF2P8MULB (GFNI... _mm_gf2p8mul_epi8	Galois Field Multiply Bytes

VFMULCPH* (V5+FP16... _mm_fmul_pch _mm_mul_pch VFCMULCPH* (V5+FP16... _mm_fcmul_pch _mm_cmul_pch	Complex Multiply FP16 Values
VFMADDCPH* (V5+FP16... _mm_fmadd_pch VFCMADDCPH* (V5+FP16... _mm_fcmadd_pch	Complex Multiply and Accumulate FP16 Values

VCVTNE2PS2BF16 (V5+BF16... _mm_cvtne2ps_pbh	Convert Two Packed Single Data to One Packed BF16 Data
VCVTNEPS2BF16 (V5+BF16... _mm_cvtneps_pbh	Convert Packed Single Data to Packed BF16 Data
VDPBF16PS (V5+BF16... _mm_dpbf16_ps	Dot Product of BF16 Pairs Accumulated Into Packed Single Precision

VPMADD52HUQ (V5+IFMA... _mm_madd52hi_epu64	Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit Products to Qword Accumulators
VPMADD52LUQ (V5+IFMA... _mm_madd52lo_epu64	Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators

VPMULTISHIFTQB (V5+VBMI... _mm_multishift_epi64_epi8	Select Packed Unaligned Bytes From Quadword Sources
VPSHUFBITQMB (V5+BITALG... _mm_bitshuffle_epi64_mask	Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask

VPBROADCASTMB2Q (V5+CD...
_mm_broadcastmb_epi64
VPBROADCASTMW2D (V5+CD...
_mm_broadcastmw_epi32

Broadcast Mask to Vector Register

VZEROALL (V1 _mm256_zeroall	Zero all YMM registers
VZEROUPPER (V1 _mm256_zeroupper	Zero upper 128 bits of all YMM registers

MOVNTPS (S1 _mm_stream_ps	Non-temporal store of four packed single-precision floating-point values from an XMM register into memory
MASKMOVDQU (S2 _mm_maskmoveu_si128	Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD (S2 _mm_stream_pd	Non-temporal store of two packed double-precision floating-point values from an XMM register into memory
MOVNTDQ (S2 _mm_stream_si128	Non-temporal store of double quadword from an XMM register into memory
LDDQU (S3 _mm_lddqu_si128	Special 128-bit unaligned load designed to avoid cache line splits
MOVNTDQA (S4.1 _mm_stream_load_si128	Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

LDTILECFG (AMX-TILE# _tile_loadconfig	Load Tile Configuration
STTILECFG (AMX-TILE# _tile_storeconfig	Store Tile Configuration
TILELOADD (AMX-TILE# _tile_loadd TILELOADDT1 (AMX-TILE# _tile_stream_loadd	Load Tile Data
TILESTORED (AMX-TILE# _tile_stored	Store Tile Data
TDPBF16PS (AMX-BF16# _tile_dpbf16ps	Dot Product of BF16 Tiles Accumulated into Packed Single Precision Tile
TDPBSSD (AMX-INT8# _tile_dpbssd TDPBSUD (AMX-INT8# _tile_dpbsud TDPBUSD (AMX-INT8# _tile_dpbusd TDPBUUD (AMX-INT8# _tile_dpbuud	Dot Product of Signed/Unsigned Bytes with Dword Accumulation
TILEZERO (AMX-TILE# _tile_zero	Zero Tile
TILERELEASE (AMX-TILE# _tile_release	Release Tile

TIPS

TIP 1: Zero Clear

XOR works for both Integer and Floating-point.

Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1

        pxor         xmm1, xmm1

Example: Set 0.0f to 4 floats in XMM1

        xorps        xmm1, xmm1

Example: Set 0.0 to 2 doubles in XMM1

        xorpd        xmm1, xmm1

TIP 2: Copy the first element to other elements in XMM register

Use a shuffle instruction.

Example: Copy the lowest float element to other 3 elements in XMM1.

        shufps       xmm1, xmm1, 0

Example: Copy the lowest WORD element to other 7 elements in XMM1

        pshuflw       xmm1, xmm1, 0
        pshufd        xmm1, xmm1, 0

Example: Copy the lower QWORD element to the upper element in XMM1

        pshufd        xmm1, xmm1, 44h     ; 01 00 01 00 B = 44h

Is this better?

        punpcklqdq    xmm1, xmm1

TIP 3: Integer Sign Extension / Zero Extension

Use an unpack instruction.

Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).

        movdqa     xmm2, xmm1     ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0]
        pxor       xmm3, xmm3     ; upper 16-bit to attach to each WORD = all 0
        punpcklwd  xmm1, xmm3     ; lower 4 DWORDS:  0 [3] 0 [2] 0 [1] 0 [0] 
        punpckhwd  xmm2, xmm3     ; upper 4 DWORDS:  0 [7] 0 [6] 0 [5] 0 [4]

Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).

        pxor       xmm3, xmm3
        movdqa     xmm2, xmm1
        pcmpgtb    xmm3, xmm1     ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1
        punpcklbw  xmm1, xmm3     ; lower 8 WORDS
        punpckhbw  xmm2, xmm3     ; upper 8 WORDS

Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)

    const __m128i izero = _mm_setzero_si128();
    __m128i words8hi = _mm_cmpgt_epi16(izero, words8);
    __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi);
    __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

TIP 4: Absolute Values of Integers

If an integer value is positive or zero, no action is required. Otherwise complement and add 1.

Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1

                                  ; if src is positive or 0; if src is negative
        pxor      xmm2, xmm2      
        pcmpgtw   xmm2, xmm1      ; xmm2 <- 0              ; xmm2 <- -1
        pxor      xmm1, xmm2      ; xor with 0(do nothing) ; xor with -1(complement all bits)
        psubw     xmm1, xmm2      ; subtract 0(do nothing) ; subtract -1(add 1)

Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4

    const __m128i izero = _mm_setzero_si128();
    __m128i tmp = _mm_cmpgt_epi32(izero, dwords4);
    dwords4 = _mm_xor_si128(dwords4, tmp);
    dwords4 = _mm_sub_epi32(dwords4, tmp);

TIPS 5: Absolute Values of Floating-Points

Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.

Example: Set absolute values of 4 floats in XMM1 to XMM1

; data
              align   16
signoffmask   dd      4 dup (7fffffffH)       ; mask for clearing the highest bit
        
; code
        andps   xmm1, xmmword ptr signoffmask

Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4

        const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000

        floats4 = _mm_andnot_ps(signmask, floats4);

TIP 6: Lacking some integer MUL instructions?

Signed/unsigned makes difference only for the calculation of the upper part. For the lower part, the same instruction can be used both for signed and unsigned.

unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW

singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW

TIP 8: max / min

Use a bitwise operation after a comparison into a mask register.

Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1

; A=xmm1  B=xmm2                    ; if A>B        ; if A<=B
        movdqa      xmm0, xmm1
        pcmpgtd     xmm1, xmm2      ; xmm1=-1       ; xmm1=0
        pand        xmm2, xmm1      ; xmm2=B        ; xmm2=0
        pandn       xmm1, xmm0      ; xmm1=0        ; xmm1=A
        por         xmm1, xmm2      ; xmm1=B        ; xmm1=A

Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB

    __m128i mask = _mm_cmpgt_epi8(a, b);
    __m128i selectedA = _mm_and_si128(mask, a);
    __m128i selectedB = _mm_andnot_si128(mask, b);
    __m128i maxAB = _mm_or_si128(selectedA, selectedB);

TIP 10: Set all bits

Use PCMPEQx.

Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.

        pcmpeqb         xmm1, xmm1

TIP 11: Integer Division by Constant

can be done fast by an integer multiplication with some adjustment operations.

Example: divide each unsigned WORD in __m128i by 7.

// unsigned WORD integer division by constant 7 (SSE2)
// __m128i dividends 8 unsigned WORDs
__m128i t = _mm_mulhi_epu16(_mm_set1_epi16(9363), dividends);
t = _mm_add_epi16(t, _mm_srli_epi16(_mm_sub_epi16(dividends, t), 1));
__m128i quotients = _mm_srli_epi16(t, 2);

For each divisor constant, a magic number to multiply is required. Detailed information about this alglorithm including how to calculate the magic number is in "Hacker's Delight" book written by Henry S. Warren Jr.

After the book I made a code generator for SSE/AVX intrinsics.

ver 2023092500