lishuiwang

GPU 矩阵乘法（转）

Matrix multiplication in CUDA

Matrix multiplication is a fundamental building block for scientific computing. Moreover, the algorithmic patterns of matrix multiplication are representative. Many other algorithms share similar optimization techniques as matrix multiplication. Therefore, matrix multiplication is one of the most important examples in learning parallel programming.

This page provides a step-by-step example to optimize matrix multiplication. The explanation is given in the remaining part of this page. You can download the code and run it following the instruction below.You can download the source code " cuda.zip". Assume you copy code samples to your local directory; under the folder ~/examples/cuda/. Then you can unzip the downloaded package. You can compile this example using the Makefile in that folder, i.e., go to the folder and type the command "make". The commands are also shown below. Please first make sure you have set up the environment of the tools properly and can compile the official code examples.

cd ~/examples/cuda/

wget http://www.es.ele.tue.nl/~mwijtvliet/5KK73/downloads/cuda.zip -O cuda.zip

unzip ./cuda.zip

cd cuda

make

./ matrixmul

1 The matrixMul Problem
2 Naive Implementation On CPUs
3 Naive Implementation On GPUs
4 Increasing "Computatin-to-Memory Ratio" by Tiling
5 Global Memory Coalescing
6 Avoiding Shared Memory Bank Conflict
7 Computation Optimization
7.1 Step 1: load A0,0 to shared memory.
7.2 Step 2: use 16 iterations to update C0,0.
7.3 Step 3: each thread stores one column of C0,0 from its register to global memory.
8 Loop Unrolling
9 Prefetching

The matrixMul Problem

Given an M x K matrix A and a K x N matrix B, multiply A with B and store the result into a M x N matrix C.

The matrixMul example on this page will show several techniques to optimize matrix multiplication on GPU. Most of them are generic, which can be applied to other applications. These techniques are:

Tiling
Memory coalescing
Avoiding memory bank conflicts
Computation Optimization.
Loop unrolling
Prefetching

The performance of these optimization techniques are show in the figures below (clink the figure to enlarge). Note: These optimizations, which are tuned for NVIDIA 8800 GT GPU at matrix size of 4096 x 4096, could be sub-optimal for other GPUs and other matrix sizes.

We will start with a simple serial code running on CPU, and then go through these optimizations step by step.

Naive Implementation On CPUs

void main(){

     define A, B, C

   for i = 0 to M do
        for j = 0 to N do
           /* compute element C(i,j) */
           for k = 0 to K do
                        C(i,j) <= C(i,j) + A(i,k) * B(k,j)
           end
         end
   end
}

To simplify the explanation, square matrices (M==N==K) are used in the illustrations. The figure below shows the memory footprint to compute an element C3,11 (in red). This can be viewed as the inner product of one row of A (in blue) and one column of B (in green).

Naive Implementation On GPUs

/* Codes running on CPU */

void main(){

    define A_cpu, B_cpu, C_cpu in the CPU memory
    define A_gpu, B_gpu, C_gpu in the GPU memory

   memcopy A_cpu to A_gpu
   memcopy B_cpu to B_gpu

   dim3 dimBlock(16, 16)
   dim3 dimGrid(N/dimBlock.x, M/dimBlock.y)

    matrixMul<<>>(A_gpu,B_gpu,C_gpu,K)

   memcopy C_gpu to C_cpu

}

/* Codes running on GPU */

__global__ void matrixMul(A_gpu,B_gpu,C_gpu,K){

   temp <= 0

     i <= blockIdx.y * blockDim.y + threadIdx.y      // Row i of matrix C
   j <= blockIdx.x * blockDim.x + threadIdx.x     // Column j of matrix C

   for k = 0 to K-1 do
        accu <= accu + A_gpu(i,k) * B_gpu(k,j)
   end

     C_gpu(i,j) <= accu

}

A naive implementation on GPUs assigns one thread to compute one element of matrix C. Each thread loads one row of matrix A and one column of matrix B from global memory, do the inner product, and store the result back to matrix C in the global memory. The figure shows the memory footprint of one thread on global memory where matrix A, B, and C are stored.

In the naive implementation, the amount of computation is 2 x M x N x K flop, while the amount of global memory access is 2 x M x N x K word. The "computation-to-memory ratio" is approximately 1/4 (flop/byte). Therefore, the naive implementation is bandwidth bounded.

Increasing "Computatin-to-Memory Ratio" by Tiling

To increase the "computation-to-memory ratio", the tiled matrix multiplication can be applied. One thread block computes one tile of matrix C. One thread in the thread block computes one element of the tile. The figure shows a 32 x 32 matrix divided into four 16 x 16 tiles. To compute this, four thread blocks each with 16 x 16 threads can be created.

The GPU kernel computes C in multiple iterations. In each iteration, one thread block loads one tile of A and one tile of Bfrom global memory to shared memory, performs computation, and stores temporal result of C in register. After all the iteration is done, the thread block stores one tile of C into global memory. For example, a thread block can computer C0,0 in two iterations: C0,0 = A0,0 B0,0 + A0,1 B1,0.

In the first iteration, the thread block loads tile A0,0 and tile B0,0 from global memory into shared memory. Each thread performs inner product to produce one element of C. This element of C is stored in the register, which will be accumulated in the next iteration.

In the second iteration, the thread block loads tile A0,1 and tile B1,0 from global memory into shared memory. Each thread performs the inner product to produce one element of C, which is accumulated with previous value. If this is the final iteration, then the element of C in the register file will be stored back into global memory.

The CPU code remains the same. Here only shows the GPU kernel.

/* Codes running on GPU */

__global__ void matrixMul(A_gpu,B_gpu,C_gpu,K){

    __shared__ float A_tile(blockDim.y, blockDim.x)
   __shared__ float B_tile(blockDim.x, blockDim.y)

    accu <= 0

   /* Accumulate C tile by tile. */

   for tileIdx = 0 to (K/blockDim.x - 1) do

      /* Load one tile of A and one tile of B into shared mem */

      // Row i of matrix A
         i <= blockIdx.y * blockDim.y + threadIdx.y
       // Column j of matrix A
      j <= tileIdx * blockDim.x + threadIdx.x
      // Load A(i,j) to shared mem
        A_tile(threadIdx.y, threadIdx.x) <= A_gpu(i,j)
      // Load B(j,i) to shared mem
        B_tile(threadIdx.x, threadIdx.y) <= B_gpu(j,i) // Global Mem Not coalesced
      // Synchronize before computation
      __sync()

      /* Accumulate one tile of C from tiles of A and B in shared mem */

        for k = 0 to threadDim.x do
           // Accumulate for matrix C
           accu <= accu + A_tile(threadIdx.y,k) * B_tile(k,threadIdx.x)
        end
      // Synchronize
      __sync()

     end

   // Row i of matrix C
     i <= blockIdx.y * blockDim.y + threadIdx.y
   // Column j of matrix C
   j <= blockIdx.x * blockDim.x + threadIdx.x
   // Store accumulated value to C(i,j)
     C_gpu(i,j) <= accu

}

In the tiled implementation, the amount of computation is still 2 x M x N x K flop. However, using tile size of B, the amount of global memory access is 2 x M x N x K / B word. The "computation-to-memory ratio" is approximately B/4 (flop/byte). We now can tune the "computation-to-memory" ratio by changing the tile size B.

Global Memory Coalescing

Two dimensional arrays in C/C++ are row-major. In the tiled implementation above, neighbouring threads have coalesced access to matrix A, but do not have coalesced access to matrix B. In column-major languages, such as Fortran, the problem is the other way around. An obvious solution is to transpose matrix B by CPU before offloading it to GPU memory.

/* Codes running on GPU */

__global__ void matrixMul(A_gpu,B_gpu,C_gpu,K){

    __shared__ float A_tile(blockDim.y, blockDim.x)
   __shared__ float B_tile(blockDim.x, blockDim.y)

    accu <= 0

   /* Accumulate C tile by tile. */

   for tileIdx = 0 to (K/blockDim.x - 1) do

      /* Load one tile of A and one tile of B into shared mem */

      // Row i of matrix A
         i <= blockIdx.y * blockDim.y + threadIdx.y
       // Column j of matrix A
      j <= tileIdx * blockDim.x + threadIdx.x
      // Load A(i,j) to shared mem
        A_tile(threadIdx.y, threadIdx.x) <= A_gpu(i,j)
      // Load B(i,j) to shared mem
      B_tile(threadIdx.x, threadIdx.y) <= B_gpu(i,j) // Global Mem Coalesced
      // Synchronize before computation
      __sync()

      /* Accumulate one tile of C from tiles of A and B in shared mem */

        for k = 0 to threadDim.x do
           // Accumulate for matrix C    // Shared Mem Bank conflict
           accu <= accu + A_tile(threadIdx.y,k) * B_tile(threadIdx.x,k)
        end
      // Synchronize
      __sync()

     end

   // Row i of matrix C
     i <= blockIdx.y * blockDim.y + threadIdx.y
   // Column j of matrix C
   j <= blockIdx.x * blockDim.x + threadIdx.x
   // Store accumulated value to C(i,j)
     C_gpu(i,j) <= accu

}

Avoiding Shared Memory Bank Conflict

/* Codes running on GPU */

__global__ void matrixMul(A_gpu,B_gpu,C_gpu,K){

    __shared__ float A_tile(blockDim.y, blockDim.x)
   __shared__ float B_tile(blockDim.x, blockDim.y)

    accu <= 0

   /* Accumulate C tile by tile. */

   for tileIdx = 0 to (K/blockDim.x - 1) do

      /* Load one tile of A and one tile of B into shared mem */

      // Row i of matrix A
         i <= blockIdx.y * blockDim.y + threadIdx.y
       // Column j of matrix A
      j <= tileIdx * blockDim.x + threadIdx.x
      // Load A(i,j) to shared mem
        A_tile(threadIdx.y, threadIdx.x) <= A_gpu(i,j)
      // Load B(i,j) to shared mem
      B_tile(threadIdx.y, threadIdx.x) <= B_gpu(i,j) // No Shared Mem Bank conflict
      // Synchronize before computation
      __sync()

      /* Accumulate one tile of C from tiles of A and B in shared mem */

        for k = 0 to threadDim.x do
           // Accumulate for matrix C    // No Shared Mem Bank conflict
           accu <= accu + A_tile(threadIdx.y,k) * B_tile(k, threadIdx.x)
        end
      // Synchronize
      __sync()

     end

   // Row i of matrix C
     i <= blockIdx.y * blockDim.y + threadIdx.y
   // Column j of matrix C
   j <= blockIdx.x * blockDim.x + threadIdx.x
   // Store accumulated value to C(i,j)
     C_gpu(i,j) <= accu

}

Computation Optimization

The kernel is computation bound. Therefore, we need to increase the portion of useful floating point operation in total instructions. Because the inner product consumes most of the time, it is important to make sure this part is efficient. If we check the binary code for the inner product, we will discover one line of code in CUDA takes two instructions in the binary.

/* CUDA code for inner product */

accu <= accu + A_tile(threadIdx.y,k) * B_tile(k, threadIdx.x)

/* Disassembled from cubin binary */

mov.b32 $r0, s[$ofs4+0x0000]
mad.rn.f32 $r9, s[$ofs1+0x002c], $r0, $r9

The current architecture of Stream Multiprocessor (SM) only allows one source operand from the shared memory. However, computing the inner product requires two source operands from from the shared memory. One solution is to store matrix Aor matrix B into register file, but then the matrix in the register file can not be shared by different threads, which decreases the "computation-to-memory ratio".

A better solution is to perform outer product instead of inner product. In this case, matrix A is stored in shared memory, but matrix B and C are stored in registers. The outer product does not require sharing of matrix B and matrix C, therefore, each thread only stores one element of B and one column of the tile of C in the register. The "computation-to-memory ratio" of the outer product is the same as the inner product.

/* CUDA code for outer product */
/* accu[i] and b are stored in register file */

accu[i] <= accu [i] + A_tile(i) * b

/* Disassembled from cubin binary */

mad.rn.f32 $r9, s[$ofs2+0x0010], $r29, $r9

Here is an example of multiplying tile A0,0 and tile B0,0 to compute C0,0 using outer product. In this example, A0,0 is 16 x 16, B0,0 is 16 x 64, C0,0 is 16 x 64. A thread block of 64 threads is performing computing C0,0.

Step 1: load A0,0 to shared memory.

Step 2: use 16 iterations to update C0,0.

Each thread stores one element of B0,0 in its register. Each thread also stores one column of C0,0 in its register.

Iteration 1: outer product between the first column of A0,0 and the first row of B0,0, and update C0,0.

Iteration 2: outer product between the second column of A0,0 and the second row of B0,0, and update C0,0.

Continue the iteration 3, 4, ..., 15 in similar way.
Iteration 16: outer product between the 16th column of A0,0 and the 16th row of B0,0, and update C0,0.

Step 3: each thread stores one column of C0,0 from its register to global memory.

Loop Unrolling

Use the pragma to tell the compiler to unroll the loops. The nvcc will unroll the inner loops by default. But it will not unroll the outer loop unless told by the pragma.

#pragma unroll

Loop unrolling sometimes has side effects on register usage, which may limit the number of concurrent threads. However, the loop unrolling does not increase register usage in the matrixMul example.

Prefetching

/* Codes running on GPU */

__global__ void matrixMul(A_gpu,B_gpu,C_gpu,K){

    __shared__ float A_tile0(blockDim.y, blockDim.x)
   __shared__ float A_tile1(blockDim.x, blockDim.y)


    float *pointer0 = A_tile0
    float *pointer1 = A_tile1

     fetch one tile of matrix A_gpu to pointer0

     __sync()

     /* Accumulate C tile by tile. */

   for tileIdx = 0 to (K/blockDim.x - 1) do

        prefetch one tile of matrix A_gpu to pointer1

        accumulate C using pointer0

        __sync()

        swap pointer0 and pointer1

     end

      store tile C to global memory

}

Example written by Zhenyu Ye ([email protected])

你可能感兴趣的:(GPU 矩阵乘法（转）)

英伟达靠什么支撑起了4万亿？AI泡沫还能撑多久？
英伟达市值突破4万亿美元，既是AI算力需求爆发的直接体现，也暗含市场对未来的狂热预期。其支撑逻辑与潜在风险并存，而AI泡沫的可持续性则取决于技术、商业与地缘政治的复杂博弈。⚙️一、英伟达4万亿市值的核心支撑因素技术垄断与生态壁垒硬件优势：英伟达GPU在AI训练市场占有率超87%，H100芯片的FP16算力达1979TFLOPS，领先竞品3-5倍。CUDA生态：400万开发者构建的软件护城河，成为A
MotionLCM 部署优化踩坑解决bug AI算法网奇 aigc与数字人深度学习宝典文生motion
目录依赖项windowstorchok：渲染黑白图问题解决：humanml3d：sentence-t5-large下载数据：报错：Nomodulenamed'sentence_transformers'继续报错：fromtransformers.integrationsimportCodeCarbonCallback解决方法：推理相关转mesh：module‘matplotlib.cm‘hasno
《路远连着天》第二章在路上 7 亚宁
大路镇的街道两旁尽是店铺，气派者是红门柱子雕花门窗，一般则多为布匹小百货店，还有几家门面朝外的车马大店，和一家颇有气势的典当铺。街上来往人还真不少，有挑担叫卖水果的，有背篓子路过的，还有衣冠楚楚，悠哉悠哉，甩着双手散步的有钱爷。耿六想着先寻姑妈家，还是先到兵营看那几个土匪呢？也只是一转念，他选择了后者，跟在几个闲人后，就来到了在镇外山头上曾看到过的那处飘着晴天白日旗的兵营门外。这里，围观的人乱哄哄
旋转安静的影子
图片发自App傍晚你带我们去万象城说是试营业到了大厦底下好家伙这么壮观我望向楼顶直达蓝天白云啊整个大楼似乎在转圈我蒙了又很奇妙的感觉我透过大厦望天空就感觉到在旋转真的感觉到动态的效果很神奇我慌忙叫你看快看快看大厦好像会旋转你说我知道知道知道也不知道你是否感觉到了真的很美妙的体会是天空在转吗我感觉大厦也在旋转很美妙很美妙图片发自App图片发自App
中原焦点团队党秀丽分享276天，约练268次，5月29日，周五润物无声dang
第十二次课，答疑解惑。1.咨询约练过程中遇到来访者上来就说想处理情绪，可是聊的过程中他又不想具体聊他的情绪怎么产生的，多久了，就是希望用外化技术处理他的情绪，我除了好奇是什么让他希望用外化呢，也用阳谋告诉他，外化前也是需要具体的聊聊这个情绪的，还可以怎么做呢？给他安全感这一块大概怎么做呢？不想具体说有他的道理，聊聊情绪大概是什么事儿？专业的认识，专业度的认可。不带功利心，转介几次的不太好，信任程度
感赏客户一克拉燕子
1，我感赏每次能吸引轻微的客户！一两瓶帮他们调理好！2，我感赏我的每一个客户用好了后都热心帮我介绍！3，我感赏每个镇上都能遇到客户转代理！5，我感赏附近有疼痛的人都主动加我！6，我感赏每天都有人主动找我做代理！7，我感赏以前体验过的客户都找我卖产品！8，我感赏我的团队越来越大！100人！9，我感赏团队每个人都是销售精英！都是积极主动，好学！10，我感赏以前咨询过我的人都来跟我代理可暖！11，我感赏
如何纠正过度养育，将孩子培养成合格的成年人呢？陌语啊
有些家长在孩子的成长过程中发现，自己对孩子的养育出现了问题。父母每天围着孩子转，孩子觉得应该的；家务活不会干，自己的东西总是找不到；娇蛮任性，动不动就对父母大喊大叫；性格孤僻，不愿结识朋友，整天与游戏为伴等。其实，这是过度养育的后果，那么我们改如何改正呢？作为父母，我们首先要思考一下孩子为什么会变成这样，是他们天性如此吗？显然不是，孩子刚出生时就是一张白纸，需要我们经过后天培养的。那么孩子出现一些
word转pdf、pdf转word在线工具分享 bpmh 常用工具 word pdf
️一、在线转换网站（方便快捷，无需安装）MicrosoftOfficeOnline(官方推荐，最安全可靠)：网址：直接使用你的Microsoft账户登录https://www.office.com/方法：将你的.docx或.doc文件上传到OneDrive。在OfficeOnline中打开该Word文档。点击文件>另存为>下载PDF副本。优点：官方出品，完全免费，无需额外上传到第三方服务器，安全性
探索高效文档转换新路径：Aspose.Words v18.7助力Word无缝变PDF 邴卉露Robust
探索高效文档转换新路径：Aspose.Wordsv18.7助力Word无缝变PDF【下载地址】Aspose.Wordsv18.7C示例源码Word转PDF无需安装Office本仓库提供了一个使用Aspose.Wordsv18.7将Word文档转换为PDF文档的C#示例源码。Aspose.Words是一个强大的.NET控件，允许开发者在不安装MicrosoftOffice的情况下读写Word文档，并
PDF转Markdown - Python 实现方案与代码 Eiceblue Python Python PDF pdf python 开发语言 vscode
PDF作为广泛使用的文档格式，转换为轻量级标记语言Markdown后，可无缝集成到技术文档、博客平台和版本控制系统中，提高内容的可编辑性和可访问性。本文将详细介绍如何使用国产Spire.PDFforPython库将PDF文档转换为Markdown格式。技术优势：精准保留原始文档结构（段落/列表/表格）完整提取文本和图像内容无需Adobe依赖的纯Python实现支持Linux/Windows/mac
PaddleOCR 快速开始张欣-男 PaddlePaddle PaddleOCR OCR
1.安装1.1安装PaddlePaddle#GPUcudapipinstallpaddlepaddle-gpu#CPUpipinstallpaddlepaddle1.2安装PaddleOCRwhl包pipinstallpaddleocr2.便捷使用2.1命令行使用2.1.1中英文模型检测+方向分类器+识别全流程：–use_angle_clstrue设置使用方向分类器识别180度旋转文字，–use_
非欧空间计算加速：图神经网络与微分几何计算的GPU优化（流形数据的内存布局优化策略）九章云极AladdinEdu 空间计算神经网络人工智能 gpu算力算法 java 开发语言
一、非欧空间计算的革命性意义与核心挑战在三维形状分析、社交网络建模、分子动力学模拟等领域，非欧几里得空间数据（流形数据）的处理正推动人工智能技术向更复杂的几何结构迈进。传统欧式空间优化方法在处理流形数据时面临根本性局限：黎曼度量导致距离计算失效、局部坐标系动态变化引发内存访问模式混乱、曲率变化影响并行计算效率。本文提出基于分块流形存储（BlockedManifoldStorage,BMS）与层次化
看图写诗No235 假如生活欺骗了你不要忧郁也不要愤慨短诗：只要环球慢旅程
只要我知道你在寻找爱的解药一颗心千疮百孔风吹过落雪无痕生活总是琢磨不透爱的路上总是百转千回天空不会永远是灰色你看希望似春天的花蕾在角落里绽放只要相信爱是生命中最大的奇迹时间的手里捧着治愈一切伤痛的解药只要耐心等待世间所有的色彩都会回来———环球慢旅程2020年12月28日关于“只要”这首诗的一点说明，怎么也不会想到，2020年会是这样可怕，2021年马上就要开始，看着欧洲愈演愈烈的疫情，真的是不知
第六十七章受刑一指弹江南
黑影从床边直起了身子，我朝他一瞧，个头不高，还是个驼子，心里顿时一跳，难道真是疤脸？我想动身子，却发现身子居然已经给捆的结结实实，感觉全身上下都给捆上了。黑影这时候一转身，从他那里传来“吧嗒”一声，房间里顿时有了光亮，借着光亮我朝那黑影一看，脑袋当即“嗡”了一声，完了，彻底完了，躲躲藏藏提心吊胆这么些天，还是没能躲过去，心一下子沉到了谷底，不知道接下来该咋办了。房间里这条黑影，正是疤脸，这时候他一
核心板：嵌入式系统的核心驱动力 MYZR1 核心板人工智能 SSD2351
核心板（CoreBoard）作为嵌入式系统开发的核心组件，已成为现代电子设备智能化的重要基石。这种高度集成的电路板将处理器、内存、存储和基本外设接口浓缩在一个紧凑的模块中，为各类智能设备提供强大的"大脑"。核心板的技术特点核心板通常采用先进的系统级封装(SiP)技术，在微小空间内集成了CPU/GPU、DDR内存、Flash存储以及电源管理单元。这种设计不仅大幅减小了体积，还提高了系统可靠性。以常见
Unity_UI_NGUI_DrawCall BuHuaX Unity unity ui 游戏引擎 c#游戏程序
Unity_UI五、NGUI进阶2.DrawCall相关2.1DrawCall的概念DrawCall定义：字面理解：DrawCall就是"绘制呼叫"的意思，表示CPU（中央处理器）通知GPU（图形处理器-显卡）开始渲染概念定义：DrawCall是CPU（处理器）准备好渲染数据（包括顶点、纹理、法线、Shader等等），然后告知GPU（图形处理器-显卡）开始渲染（将命令放入命令缓冲区）的命令简单来说
1022.与喵共舞496~周末颐和园摹喵居士
2018.11.25周日啦，天气没有完全转差，正好户外户外活动一下。来到了颐和园，现在已经是淡季了，没想到6岁还能免票。进入新建宫门。迎面就是十七孔桥。一起拍照，来的游客很少。远处的西堤，还没有游客。据说有五百多个石狮子。一只小狗出场了。兴高采烈地跑来跑去。树木已经光秃秃了。亲一下吧。远处的石舫，雾蒙蒙的。外面的树，好像一团烟雾。文昌阁，有点破落。万寿山上。落叶缤纷。这个小样子还挺标志。扑在妈妈身
一种植物合欢树 yingyingjilv789
2020.6.10周三32-24度雨转多云有时有阵雨这两天我路过广场，发现很多合欢树，开得挺漂亮的。合欢，又名绒花树，马缨花。落叶乔木，夏季开花，头状花序，合瓣花冠，雄蕊多条，淡红色。合欢生于山坡或栽培。合欢喜温暖湿润和阳光充足环境，对气候和土壤适应性强，宜在排水良好、肥沃土壤生长，但也耐瘠薄土壤和干旱气候，但不耐水涝。生长迅速。合欢，性喜光，喜温暖，耐寒、耐旱、耐土壤瘠薄及轻度盐碱，对二氧化硫、
2023-5-10晨间日记佳悦_1b1d
今天是什么日子今天是周三，下雨了，昨晚上因为微信转账转错人了的事情，搞得我睡得不安稳，差不多5点就醒了，睡不着了，去账单看那人还没有确认收款，我就想着我还能做些什么呢？于是我就通过自己摸索，在腾讯后台，自己把微信支付的账户给冻结了，没想到账户冻结了以后，只要对方在24小时之内接收也是可以收款的，而且十点多钟，客服给我打电话说是否需要他们帮助通知对方，我说需要，那时钱还是没接收的，没想到可能是因为他
骗术萧小放
为什么突然换了笔风，想写这样一篇文章，主要目的是为了给正整游走于股市边缘家人们提个醒，一定要记得天上不会掉馅饼这句箴言，赚钱不易，且用且谨慎！废话话不多说，直接进入正题。2016年的时候我在办理银行业务的时候顺带开了一个股票却一直未启用，菜鸟级别，说出来不怕你笑话，同事们都说不玩股票可以申购转债，中签即白捡，基本无风险，一千块钱搞定。即便是这样我也没有激活账户。2020年的疫情来领，行业不景气，坐
2022-09-05 双髻山府正堂
A项,公司章程是全体股东协商一致制定的公司内部最高效力的管理规范,且公司有利润并不意味着必须向股东分红,为了公司的发展战略,将利润留在公司并不违法,所以章程中规定公司成立前三年不分红的内容是有效的,A项错误。B项,异议股权回购的前提条件是:一,有股东会决议;二,决议涉及法定情形(连续5年盈利不分红、合并分立转财产、届满续命改章程);三,股东持反对意见。本案中,公司成立前三年不分红,是包括李某在内的
2022-03-23 良人相伴
2022.3.23星期三晴20/7咱也不知道绿码转黄码究竟都有哪些标准。中午和同事一起在外面小店吃过午饭，回到上班地点进门时习惯性扫码，其中一个同事很突然的就出现了黄码。听到她扫出了黄码，旁边站着的一个人走过来说他也扫出了黄码。奇怪的是，早上上班时他们都还是绿码。门口的保安大叔很负责，黄码是坚决不会让进大厅的。就这样，吃个饭的功夫就回不到工作岗位了。
微习惯的养成160（补8.17）大眼妹宝贝
任务：1、早睡早起：23点前休息，起床7：30-7：50。2、专业课学习开始，每科2页阅读。将这个习惯改为：每周至少3天学习专业课，2天书法练习，从6月1日起。3、肩颈运动：手臂上下摇摆50次，或者转肩膀50次。4、站桩或者静坐3分钟。5、读书一页。6、桑麻丸或芝麻丸：共2颗7、“90天秀发活动”梳头36下1、赶车，到了火车站，发现车票出了问题，改成明天早上走。折腾回家已经接近凌晨2点，早上8：3
利用Gpu训练兮℡檬，深度学习人工智能
方法一：分别对网络模型，数据（输入，标注），损失函数调用.cuda()网络模型：iftorch.cuda.is_available():net=net.cuda()数据（训练和测试）：iftorch.cuda.is_available():imgs=imgs.cuda()targets=targets.cuda()损失函数：iftorch.cuda.is_available():loss_fn=l
Tensorflow-gpu运行时报错Non-OK-status: GpuLaunchKernel GEM的左耳返 python tensorflow 深度学习 python
Tensorflow-gpu运行时报错Non-OK-status:GpuLaunchKernel(FillPhiloxRandomKernelLaunch,num_blocks,block_size,0,d.stream(),gen,data,size,dist)status:Internal:invaliddevicefunctionFatalPythonerror:Aborted说明你安装的C
立春的后几天涵笑_1654
今天是立春的第二天，可是天气还是没有变化，我原本以为立春天气就会变暖和。今天早上补课奶奶送我去的，寒风把我吹的瑟瑟发抖。脸上像被刀刮了一样。路上的雪还没有融化。路上的行人都把自己裹得严严实实的……中午的时候天气稍微暖和了，太阳晒在身上暖洋洋的。晚上我回家问妈妈立春了为什么天气还是这么冷呀。妈妈说立春只是二十节气的第一个而已，并不代表天气就会暖和呀，等过了数九寒天天气就会慢慢转暖了。图片发自App
[转]MFC窗体中打开第三方exe程序到指定区域 XiangDong_ MFC
2018年8月18日转发至：https://blog.csdn.net/tfygg/article/details/51174801流程如下：1、CreateProcess创建外部EXE进程2、获取指定区域的坐标3、查找进程的主窗口4、将外部程序移到指定区域5、调用ShowWindow显示窗口主程序如下：handle=StartProcess(“D:\\programtool\\SecureCRT
凤凰何其少童无忌的三樱园
宋代的文学家苏轼，不但诗词写得精彩，中国画也画得好。传说明朝一个员外偶然间得到了苏轼的《百鸟归巢图》，遍请状元伦文叙题一首诗，伦文叙也毫不含糊，提笔就写：天生一只又一只，三四五六七八只。员外一看，瞬间懵了？这是什么？打油诗吗？然而没等员外发问，伦文叙笔锋一转，又写下：凤凰何少鸟何多，啄尽人间千万石。员外看完后拍案叫绝，为什么呢？首先我们来看一下，画的标题中说是“百鸟”:题诗中却不见“百”字踪影，似
一级7机并联，二级单机构型的两级液体运载火箭入轨弹道设计_Part2 小亨GNC颐园火箭弹道数据驱动人工智能优化控制零攻角转弯
书接上文，上篇文章介绍了两级液体运载火箭的设计过程和设计结果。这期文章将介绍火箭飞行过程中弹道特征点的参数和详细的弹道数据，文末有两级液体火箭详细弹道数据的下载链接，对于需要大量样本数据进行大模型学习训练的人工智能研发人员和数据驱动弹道优化设计的从业者可以重点关注一下。特征点弹道参数为飞行时序Time(s)H(km)Vg(m/s)射程S(km)火箭起飞0.0000.0500.0000.000一级转
C#与halcon联合（3）文本写入可以改变字体大小类型及绘画直线图形十字叉箭头轮廓
这里写目录标题1.操作demo2.绘画直线及显示①在halcon中的操作代码如下②转换成C#代码并将其封装成函数如下3.绘画圆形及显示其轮廓（XLD）①在halcon中的操作代码如下②转换成C#代码并将其封装成函数如下4.绘画普通矩形及轮廓显示（XLD）①在halcon中的操作代码如下②转换成C#代码并将其封装成函数如下5.绘画角度可调矩形及其轮廓显示（XLD）①在halcon中的操作代码如下②转
书其实只有三类西蜀石兰类
一个人一辈子其实只读三种书，知识类、技能类、修心类。知识类的书可以让我们活得更明白。类似十万个为什么这种书籍，我一直不太乐意去读，因为单纯的知识是没法做事的，就像知道地球转速是多少一样（我肯定不知道），这种所谓的知识，除非用到，普通人掌握了完全是一种负担，维基百科能找到的东西，为什么去记忆？知识类的书，每个方面都涉及些，让自己显得不那么没文化，仅此而已。社会认为的学识渊博，肯定不是站在
《TCP/IP 详解，卷1：协议》学习笔记、吐槽及其他 bylijinnan tcp
《TCP/IP 详解，卷1：协议》是经典，但不适合初学者。它更像是一本字典，适合学过网络的人温习和查阅一些记不清的概念。这本书，我看的版本是机械工业出版社、范建华等译的。这本书在我看来，翻译得一般，甚至有明显的错误。如果英文熟练，看原版更好： http://pcvr.nl/tcpip/ 下面是我的一些笔记，包括我看书时有疑问的地方，也有对该书的吐槽，有不对的地方请指正： 1.
Linux—— 静态IP跟动态IP设置 eksliang linux IP
一.在终端输入 vi /etc/sysconfig/network-scripts/ifcfg-eth0 静态ip模板如下： DEVICE="eth0" #网卡名称 BOOTPROTO="static" #静态IP（必须） HWADDR="00:0C:29:B5:65:CA" #网卡mac地址 IPV6INIT=&q
Informatica update strategy transformation 18289753290
更新策略组件：标记你的数据进入target里面做什么操作，一般会和lookup配合使用，有时候用0,1,1代表 forward rejected rows被选中，rejected row是输出在错误文件里，不想看到reject输出，将错误输出到文件，因为有时候数据库原因导致某些column不能update，reject就会output到错误文件里面供查看，在workflow的
使用Scrapy时出现虽然队列里有很多Request但是却不下载，造成假死状态酷的飞上天空 request
现象就是：程序运行一段时间，可能是几十分钟或者几个小时，然后后台日志里面就不出现下载页面的信息，一直显示上一分钟抓取了0个网页的信息。刚开始已经猜到是某些下载线程没有正常执行回调方法引起程序一直以为线程还未下载完成，但是水平有限研究源码未果。经过不停的google终于发现一个有价值的信息，是给twisted提出的一个bugfix 连接地址如下http://twistedmatrix.
利用预测分析技术来进行辅助医疗蓝儿唯美医疗
2014年，克利夫兰诊所（Cleveland Clinic）想要更有效地控制其手术中心做膝关节置换手术的费用。整个系统每年大约进行2600例此类手术，所以，即使降低很少一部分成本，都可以为诊所和病人节约大量的资金。为了找到适合的解决方案，供应商将视野投向了预测分析技术和工具，但其分析团队还必须花时间向医生解释基于数据的治疗方案意味着什么。克利夫兰诊所负责企业信息管理和分析的医疗
java 线程(一)：基础篇 DavidIsOK java 多线程线程
&nbs
Tomcat服务器框架之Servlet开发分析 aijuans servlet
最近使用Tomcat做web服务器，使用Servlet技术做开发时，对Tomcat的框架的简易分析：疑问：为什么我们在继承HttpServlet类之后，覆盖doGet(HttpServletRequest req, HttpServetResponse rep)方法后，该方法会自动被Tomcat服务器调用，doGet方法的参数有谁传递过来？怎样传递？分析之我见： doGet方法的
揭秘玖富的粉丝营销之谜与小米粉丝社区类似 aoyouzi 揭秘玖富的粉丝营销之谜
玖富旗下悟空理财凭借着一个微信公众号上线当天成交量即破百万，第七天成交量单日破了1000万;第23天时，累计成交量超1个亿……至今成立不到10个月，粉丝已经超过500万，月交易额突破10亿，而玖富平台目前的总用户数也已经超过了1800万，位居P2P平台第一位。很多互联网金融创业者慕名前来学习效仿，但是却鲜有成功者，玖富的粉丝营销对外至今仍然是个谜。　　近日，一直坚持微信粉丝营销
Java web的会话跟踪技术百合不是茶 url会话 Cookie会话 Seession会话 Java Web 隐藏域会话
会话跟踪主要是用在用户页面点击不同的页面时,需要用到的技术点会话:多次请求与响应的过程 1,url地址传递参数,实现页面跟踪技术格式:传一个参数的 url?名=值传两个参数的 url?名=值 &名=值关键代码
web.xml之Servlet配置 bijian1013 java web.xml Servlet配置
定义： <servlet> <servlet-name>myservlet</servlet-name> <servlet-class>com.myapp.controller.MyFirstServlet</servlet-class> <init-param> <param-name>
利用svnsync实现SVN同步备份 sunjing SVN 同步 E000022 svnsync 镜像
1. 在备份SVN服务器上建立版本库 svnadmin create test 2. 创建pre-revprop-change文件 cd test/hooks/ cp pre-revprop-change.tmpl pre-revprop-change 3. 修改pre-revprop-
【分布式数据一致性三】MongoDB读写一致性 bit1129 mongodb
本系列文章结合MongoDB，探讨分布式数据库的数据一致性，这个系列文章包括：数据一致性概述与CAP 最终一致性(Eventually Consistency) 网络分裂(Network Partition)问题多数据中心(Multi Data Center) 多个写者(Multi Writer)最终一致性一致性图表(Consistency Chart) 数据
Anychart图表组件-Flash图转IMG普通图的方法白糖_ Flash
问题背景：项目使用的是Anychart图表组件，渲染出来的图是Flash的，往往一个页面有时候会有多个flash图，而需求是让我们做一个打印预览和打印功能，让多个Flash图在一个页面上打印出来。那么我们打印预览的思路是获取页面的body元素，然后在打印预览界面通过$("body").append(html)的形式显示预览效果，结果让人大跌眼镜：Flash是
Window 80端口被占用 WHY? bozch 端口占用 window
平时在启动一些可能使用80端口软件的时候，会提示80端口已经被其他软件占用，那一般又会有那些软件占用这些端口呢？下面坐下总结： 1、web服务器是最经常见的占用80端口的，例如：tomcat , apache , IIS , Php等等； 2
编程之美-数组的最大值和最小值-分治法（两种形式） bylijinnan 编程之美
import java.util.Arrays; public class MinMaxInArray { /** * 编程之美数组的最大值和最小值分治法 * 两种形式 */ public static void main(String[] args) { int[] t={11,23,34,4,6,7,8,1,2,23}; int[]
Perl正则表达式 chenbowen00 正则表达式 perl
首先我们应该知道 Perl 程序中，正则表达式有三种存在形式，他们分别是：匹配：m/<regexp>;/ （还可以简写为 /<regexp>;/ ，略去 m）替换：s/<pattern>;/<replacement>;/ 转化：tr/<pattern>;/<replacemnt>;
[宇宙与天文]行星议会是否具有本行星大气层以外的权力呢? comsci
举个例子: 地球,地球上由200多个国家选举出一个代表地球联合体的议会,那么现在地球联合体遇到一个问题,地球这颗星球上面的矿产资源快要采掘完了....那么地球议会全体投票,一致通过一项带有法律性质的议案,既批准地球上的国家用各种技术手段在地球以外开采矿产资源和其它资源........ &
Oracle Profile 使用详解 daizj oracle profile 资源限制
Oracle Profile 使用详解转一、目的： Oracle系统中的profile可以用来对用户所能使用的数据库资源进行限制，使用Create Profile命令创建一个Profile，用它来实现对数据库资源的限制使用，如果把该profile分配给用户，则该用户所能使用的数据库资源都在该profile的限制之内。二、条件：创建profile必须要有CREATE PROFIL
How HipChat Stores And Indexes Billions Of Messages Using ElasticSearch & Redis dengkane elasticsearch Lucene
This article is from an interview with Zuhaib Siddique, a production engineer at HipChat, makers of group chat and IM for teams. HipChat started in an unusual space, one you might not
循环小示例，菲波拉契序列，循环解一元二次方程以及switch示例程序 dcj3sjt126com c 算法
# include <stdio.h> int main(void) { int n; int i; int f1, f2, f3; f1 = 1; f2 = 1; printf("请输入您需要求的想的序列："); scanf("%d", &n); for (i=3; i<n; i
macbook的lamp环境 dcj3sjt126com lamp
sudo vim /etc/apache2/httpd.conf /Library/WebServer/Documents 是默认的网站根目录重启Mac上的Apache服务这个命令很早以前就查过了，但是每次使用的时候还是要在网上查：停止服务：sudo /usr/sbin/apachectl stop 开启服务：s
java ArrayList源码下 shuizhaosi888 ArrayList源码
版本 jdk-7u71-windows-x64 JavaSE7 ArrayList源码上：http://flyouwith.iteye.com/blog/2166890 /** * 从这个列表中移除所有c中包含元素 */ public boolean removeAll(Collection<?> c) {
Spring Security（08）——intercept-url配置 234390216 Spring Security intercept-url 访问权限访问协议请求方法
intercept-url配置目录 1.1 指定拦截的url 1.2 指定访问权限 1.3 指定访问协议 1.4 指定请求方法 1.1 &n
Linux环境下的oracle安装 jayung oracle
linux系统下的oracle安装本文档是Linux(redhat6.x、centos6.x、redhat7.x) 64位操作系统安装Oracle 11g(Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production)，本文基于各种网络资料精心整理而成，共享给有需要的朋友。如有问题可联系：QQ：52-7
hotspot虚拟机 leichenlei java HotSpot jvm 虚拟机文档
JVM参数 http://docs.oracle.com/javase/6/docs/technotes/guides/vm/index.html JVM工具 http://docs.oracle.com/javase/6/docs/technotes/tools/index.html JVM垃圾回收 http://www.oracle.com
读《Node.js项目实践：构建可扩展的Web应用》 ——引编程慢慢变成系统化的“砌砖活” noaighost Web node.js
读《Node.js项目实践：构建可扩展的Web应用》 ——引编程慢慢变成系统化的“砌砖活” 眼里的Node.JS 初初接触node是一年前的事，那时候年少不更事。还在纠结什么语言可以编写出牛逼的程序，想必每个码农都会经历这个月经性的问题：微信用什么语言写的？facebook为什么推荐系统这么智能，用什么语言写的？dota2的外挂这么牛逼，用什么语言写的？……用什么语言写这句话，困扰人也是阻碍
快速开发Android应用 rensanning android
Android应用开发过程中，经常会遇到很多常见的类似问题，解决这些问题需要花时间，其实很多问题已经有了成熟的解决方案，比如很多第三方的开源lib，参考 Android Libraries 和 Android UI/UX Libraries。编码越少，Bug越少，效率自然会高。但可能由于根本没听说过、听说过但没用过、特殊原因不能用、自己已经有了解决方案等等原因，这些成熟的解决
理解Java中的弱引用 tomcat_oracle java 工作面试
　不久之前，我面试了一些求职Java高级开发工程师的应聘者。我常常会面试他们说，“你能给我介绍一些Java中得弱引用吗？”，如果面试者这样说，“嗯，是不是垃圾回收有关的？”，我就会基本满意了，我并不期待回答是一篇诘究本末的论文描述。　　然而事与愿违，我很吃惊的发现，在将近20多个有着平均5年开发经验和高学历背景的应聘者中，居然只有两个人知道弱引用的存在，但是在这两个人之中只有一个人真正了
标签输出html标签" target="_blank">关于标签输出html标签 xshdch jsp
http://back-888888.iteye.com/blog/1181202 关于<c:out value=""/>标签的使用，其中有一个属性是escapeXml默认是true(将html标签当做转移字符，直接显示不在浏览器上面进行解析)，当设置escapeXml属性值为false的时候就是不过滤xml，这样就能在浏览器上解析html标签， &nb