本文最后更新于：2024年12月20日下午

排序算法

用过C语言的sort函数，也用过python的排序函数，这次居然轮到设计硬件电路来排序了，还真是应了那一句老话：我可以有100种方法点亮一个LED灯，也有100种输出“hello world”的方法。

思绪回到大一本科，虽不是软件设计科班出身，但隐约也知道C语言有一种很经典的排序算法叫什么冒泡排序，上网一查不得了，又是比较排序又是非比较排序，加起来林林总总数十种算法，让人眼花缭乱。

查阅了相关资料：冒泡排序与插入排序实现起来最简单，但是速度最慢，时间复杂度居然有O(n^{2)，这对于硬件设计来说是不可接受的，于是我果断选择使用其他算法，寻寻觅觅后，一些文章中提到的全排序引起了我的兴趣，号称“以极限的面积换取极限的速度”，一次性能够进行所有数据的并行排序，居然只需要两到三个时钟周期就能完成所有数据的排序，但其带来的面积开销也是非常恐怖：每个数据都要与其余所有数据进行比较，也就是说n个数据就几乎需要n}2个比较器，这还没有考虑恐怖的加法操作。

于是我想着能不能做一些tradeoff，速度不需要那么极限，最好面积也不要那么极限。对于32个8位数据的比较操作，我将32个数据分为4组，每组8个数据与其余数据进行比较，这样相较于一次使用32个*32个8位数据比较器，已经能节省不小的硬件开销，当然代价就是花费更多的时钟周期，粗略估计一次比较32个数据需要10个时钟周期，这也在我们能够接受的范围之内。

需要注意的是，我这里相当于是起了一个头，如果还觉得面积开销不能接受，完全可以将32个数据分为8组、16组等不同的数字，取决于系统对面积开销敏感还是对性能开销更敏感，是一个不断tradeoff的过程。

具体的设计思路如下：

首先将所有的数据缓存在寄存器堆中，利用case语句定义一个4路选择器，同时定义一个计数器用来在这4路中不断的切换，实现每次执调用一组8个的数据进行比较；
接着就是8*32的8bit比较器，一次将8个数据与所有的32个数据进行比较，计算出得分（这里用于使用了generate中的for循环结构描述，所以比较器是冗余的，因为每个数据都和自己比较了一次，这其实是不必要的）；
然后使用8个32位加法器对得分求和，在等待4个时钟周期后将四组总计32个数据比较完毕，同时也计算出32个数据的得分；
利用得分将输入数据进行排序；
输出最终的排序结果。

全比较排序电路的核心就是比较器+计分模块，这里我需要解释一下，由于排序的数据中可能会涉及到相同的数据，所以在进行数据比较的时候，按照原数据谁在前谁优先的原则：在每个数据与其他数据进行比较的时候，根据两个数据在原来数组中的顺序，比较器的类型也要做出相应的调整。

例如，Am与An进行比较时：

若m>n，则使用>=比较器；

若m<=n，则使用>比较器。

//将8组数据与32组数据进行比较，每组数据获得32个打分结果
generate
    genvar i;
    integer j;
    for (i = 0;i < 8;i = i + 1 ) 
    begin:CMP_DATA
        always @(posedge clk or negedge rst_n) 
        begin
            if(!rst_n)
                cmp_score[i] <= 32'd0;
            else begin
                for(j = 0; j < 32; j = j + 1)begin
                    case ((mux_cnt[2] * 8 + i) <= j)
                    1'b1: cmp_score[i][j] <= (cmp_data[i] > input_temp[j]) ? 1'b1 : 1'b0;
                    1'b0: cmp_score[i][j] <= (cmp_data[i] >=input_temp[j]) ? 1'b1 : 1'b0;
                    endcase
                end
            end
        end
    end
endgenerate

假设输入数据为{1, 5, 2, 6, 6, 3}并设A0=1, A1=5, A2=2, A3=6, A4=6, A5=3，那么在进行比较计分时可以用下面这个表格进行表述：

	A0(1)	A1(5)	A2(2)	A3(6)	A5(3)	总分
A0(1)	0	0	0	0	0	0
A1(5)	1	0	1	0	1	3
A2(2)	1	0	0	0	0	1
A3(6)	1	1	1	0	1	4
A4(6)	1	1	1	1	1	5
A5(3)	1	0	1	0	0	2

比较器类型的调整能够保证出现相同数据时，数据的排序能够按照输入数据的顺序输出，也就是能够保证稳定性。

这一点对数字电路设计来说是必要的，你可以试试不调整比较器类型，也就是统一使用>比较器，最后A3与A4的得分是相同的，这样在最后利用得分进行数据排序时，A3与A4的顺序是一样的，最后都放入了相同的寄存器，那必然会导致有一个寄存器是空的，那么就不能得到正确的排序结果，就像我下面这张图展示的一样：

代码+仿真

module sort_32_u8 (
    input   wire        clk,
    input   wire        rst_n,
    input   wire        vld_in,
    input   wire    [7:0]   din_0,
    input   wire    [7:0]   din_1,
    input   wire    [7:0]   din_2,
    input   wire    [7:0]   din_3,
    input   wire    [7:0]   din_4,
    input   wire    [7:0]   din_5,
    input   wire    [7:0]   din_6,
    input   wire    [7:0]   din_7,
    input   wire    [7:0]   din_8,
    input   wire    [7:0]   din_9,
    input   wire    [7:0]   din_10,
    input   wire    [7:0]   din_11,
    input   wire    [7:0]   din_12,
    input   wire    [7:0]   din_13,
    input   wire    [7:0]   din_14,
    input   wire    [7:0]   din_15,
    input   wire    [7:0]   din_16,
    input   wire    [7:0]   din_17,
    input   wire    [7:0]   din_18,
    input   wire    [7:0]   din_19,
    input   wire    [7:0]   din_20,
    input   wire    [7:0]   din_21,
    input   wire    [7:0]   din_22,
    input   wire    [7:0]   din_23,
    input   wire    [7:0]   din_24,
    input   wire    [7:0]   din_25,
    input   wire    [7:0]   din_26,
    input   wire    [7:0]   din_27,
    input   wire    [7:0]   din_28,
    input   wire    [7:0]   din_29,
    input   wire    [7:0]   din_30,
    input   wire    [7:0]   din_31,

    output  wire            vld_out,
    output  reg     [7:0]   dout_0,
    output  reg     [7:0]   dout_1,
    output  reg     [7:0]   dout_2,
    output  reg     [7:0]   dout_3,
    output  reg     [7:0]   dout_4,
    output  reg     [7:0]   dout_5,
    output  reg     [7:0]   dout_6,
    output  reg     [7:0]   dout_7,
    output  reg     [7:0]   dout_8,
    output  reg     [7:0]   dout_9,
    output  reg     [7:0]   dout_10,
    output  reg     [7:0]   dout_11,
    output  reg     [7:0]   dout_12,
    output  reg     [7:0]   dout_13,
    output  reg     [7:0]   dout_14,
    output  reg     [7:0]   dout_15,
    output  reg     [7:0]   dout_16,
    output  reg     [7:0]   dout_17,
    output  reg     [7:0]   dout_18,
    output  reg     [7:0]   dout_19,
    output  reg     [7:0]   dout_20,
    output  reg     [7:0]   dout_21,
    output  reg     [7:0]   dout_22,
    output  reg     [7:0]   dout_23,
    output  reg     [7:0]   dout_24,
    output  reg     [7:0]   dout_25,
    output  reg     [7:0]   dout_26,
    output  reg     [7:0]   dout_27,
    output  reg     [7:0]   dout_28,
    output  reg     [7:0]   dout_29,
    output  reg     [7:0]   dout_30,
    output  reg     [7:0]   dout_31
);
    
reg [7:0]   input_temp[0:31];       //暂存输入
reg [7:0]   output_temp[0:31];      //暂存输出
reg [4:0]   seq_num[0:31];          //排序后的序号

reg [7:0]   cmp_data[0:7];          //每次进行比较的数据
reg [31:0]  cmp_score[0:7];         //比较后的得分
reg [4:0]   sum_score[0:7];         //得分求和
reg [1:0]   mux_cnt[0:3];           //mux计数器及对应的打拍

reg [1:0]   dmux_cnt;               //dmux计数器，本质上是mux_cnt打了4拍

reg         state_cnt[0:9];         //数据排序需要10个周期，因此vld_out为vld_in打10拍

//初始化mux与dmux计数器
always @(posedge clk or negedge rst_n) begin
    if((!rst_n) || (mux_cnt[0] == 2'd3) || (!vld_in))begin
        mux_cnt[0] <= 2'd0;
    end
    else
        mux_cnt[0] <= mux_cnt[0] + 1'b1;
end

always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        mux_cnt[1] <= 2'd0;
        mux_cnt[2] <= 2'd0;
        mux_cnt[3] <= 2'd0;
        dmux_cnt <= 2'd0;
    end
    else if(vld_in)begin
        mux_cnt[1] <= mux_cnt[0];
        mux_cnt[2] <= mux_cnt[1];
        mux_cnt[3] <= mux_cnt[2];
        dmux_cnt <= mux_cnt[3];
    end
    else begin
        mux_cnt[1] <= 2'd0;
        mux_cnt[2] <= 2'd0;
        mux_cnt[3] <= 2'd0;
        dmux_cnt <= 2'd0;
    end
        
end

integer in;
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        for(in = 0; in < 32; in = in + 1)begin
            input_temp[in] <= 32'd0;
        end
    end
    else if(vld_in)begin
        input_temp[0] <= din_0;
        input_temp[1] <= din_1;
        input_temp[2] <= din_2;
        input_temp[3] <= din_3;
        input_temp[4] <= din_4;
        input_temp[5] <= din_5;
        input_temp[6] <= din_6;
        input_temp[7] <= din_7;
        input_temp[8] <= din_8;
        input_temp[9] <= din_9;
        input_temp[10] <= din_10;
        input_temp[11] <= din_11;
        input_temp[12] <= din_12;
        input_temp[13] <= din_13;
        input_temp[14] <= din_14;
        input_temp[15] <= din_15;
        input_temp[16] <= din_16;
        input_temp[17] <= din_17;
        input_temp[18] <= din_18;
        input_temp[19] <= din_19;
        input_temp[20] <= din_20;
        input_temp[21] <= din_21;
        input_temp[22] <= din_22;
        input_temp[23] <= din_23;
        input_temp[24] <= din_24;
        input_temp[25] <= din_25;
        input_temp[26] <= din_26;
        input_temp[27] <= din_27;
        input_temp[28] <= din_28;
        input_temp[29] <= din_29;
        input_temp[30] <= din_30;
        input_temp[31] <= din_31;
    end
    else begin
        for(in = 0; in < 32; in = in + 1)begin
            input_temp[in] <= 32'd0;
        end
    end
end

//将数据分成4组，每组8个数据
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        cmp_data[0] <= 8'd0;
        cmp_data[1] <= 8'd0;
        cmp_data[2] <= 8'd0;
        cmp_data[3] <= 8'd0;
        cmp_data[4] <= 8'd0;
        cmp_data[5] <= 8'd0;
        cmp_data[6] <= 8'd0;
        cmp_data[7] <= 8'd0;
    end
    else if(vld_in)begin
        case (mux_cnt[1])
            2'd0:begin
                cmp_data[0] <= input_temp[0];
                cmp_data[1] <= input_temp[1];
                cmp_data[2] <= input_temp[2];
                cmp_data[3] <= input_temp[3];
                cmp_data[4] <= input_temp[4];
                cmp_data[5] <= input_temp[5];
                cmp_data[6] <= input_temp[6];
                cmp_data[7] <= input_temp[7];   
            end
            2'd1:begin
                cmp_data[0] <= input_temp[8];
                cmp_data[1] <= input_temp[9];
                cmp_data[2] <= input_temp[10];
                cmp_data[3] <= input_temp[11];
                cmp_data[4] <= input_temp[12];
                cmp_data[5] <= input_temp[13];
                cmp_data[6] <= input_temp[14];
                cmp_data[7] <= input_temp[15];   
            end
            2'd2:begin
                cmp_data[0] <= input_temp[16];
                cmp_data[1] <= input_temp[17];
                cmp_data[2] <= input_temp[18];
                cmp_data[3] <= input_temp[19];
                cmp_data[4] <= input_temp[20];
                cmp_data[5] <= input_temp[21];
                cmp_data[6] <= input_temp[22];
                cmp_data[7] <= input_temp[23];   
            end
            2'd3:begin
                cmp_data[0] <= input_temp[24];
                cmp_data[1] <= input_temp[25];
                cmp_data[2] <= input_temp[26];
                cmp_data[3] <= input_temp[27];
                cmp_data[4] <= input_temp[28];
                cmp_data[5] <= input_temp[29];
                cmp_data[6] <= input_temp[30];
                cmp_data[7] <= input_temp[31];   
            end
        endcase
    end
    else begin
        cmp_data[0] <= 8'd0;
        cmp_data[1] <= 8'd0;
        cmp_data[2] <= 8'd0;
        cmp_data[3] <= 8'd0;
        cmp_data[4] <= 8'd0;
        cmp_data[5] <= 8'd0;
        cmp_data[6] <= 8'd0;
        cmp_data[7] <= 8'd0;
    end
end

//将8组数据与32组数据进行比较，每组数据获得32个打分结果
generate
    genvar i;
    integer j;
    for (i = 0;i < 8;i = i + 1 ) 
    begin:CMP_DATA
        always @(posedge clk or negedge rst_n) 
        begin
            if(!rst_n)
                cmp_score[i] <= 32'd0;
            else begin
                for(j = 0; j < 32; j = j + 1)begin
                    case ((mux_cnt[2] * 8 + i) <= j)
                    1'b1: cmp_score[i][j] <= (cmp_data[i] > input_temp[j]) ? 1'b1 : 1'b0;
                    1'b0: cmp_score[i][j] <= (cmp_data[i] >=input_temp[j]) ? 1'b1 : 1'b0;
                    endcase
                end
            end
        end
    end
endgenerate

//将每组32个打分求和
integer sum;
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        sum_score[0] <= 5'd0;
        sum_score[1] <= 5'd0;
        sum_score[2] <= 5'd0;
        sum_score[3] <= 5'd0;
        sum_score[4] <= 5'd0;
        sum_score[5] <= 5'd0;
        sum_score[6] <= 5'd0;
        sum_score[7] <= 5'd0;
    end
    else if(vld_in)begin
        for(sum = 0; sum < 8; sum = sum + 1)begin
            sum_score[sum] <= cmp_score[sum][0] + cmp_score[sum][1] + cmp_score[sum][2] + cmp_score[sum][3] + cmp_score[sum][4] + cmp_score[sum][5] + cmp_score[sum][6] + cmp_score[sum][7] + 
                        cmp_score[sum][8] + cmp_score[sum][9] + cmp_score[sum][10] + cmp_score[sum][11] + cmp_score[sum][12] + cmp_score[sum][13] + cmp_score[sum][14] + cmp_score[sum][15] + 
                        cmp_score[sum][16] + cmp_score[sum][17] + cmp_score[sum][18] + cmp_score[sum][19] + cmp_score[sum][20] + cmp_score[sum][21] + cmp_score[sum][22] + cmp_score[sum][23] + 
                        cmp_score[sum][24] + cmp_score[sum][25] + cmp_score[sum][26] + cmp_score[sum][27] + cmp_score[sum][28] + cmp_score[sum][29] + cmp_score[sum][30] + cmp_score[sum][31]; 
        end
    end
    else begin
        sum_score[0] <= 5'd0;
        sum_score[1] <= 5'd0;
        sum_score[2] <= 5'd0;
        sum_score[3] <= 5'd0;
        sum_score[4] <= 5'd0;
        sum_score[5] <= 5'd0;
        sum_score[6] <= 5'd0;
        sum_score[7] <= 5'd0;
    end
end


integer seq;
//将四次比较的得分汇总
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        for (seq = 0; seq < 32; seq = seq + 1) begin
            seq_num[seq] <= 5'd0;
        end
    end
    else if(vld_in)begin
        case (dmux_cnt)
            2'd0:begin
                seq_num[0] <= sum_score[0];
                seq_num[1] <= sum_score[1];
                seq_num[2] <= sum_score[2];
                seq_num[3] <= sum_score[3];
                seq_num[4] <= sum_score[4];
                seq_num[5] <= sum_score[5];
                seq_num[6] <= sum_score[6];
                seq_num[7] <= sum_score[7];
            end
            2'd1:begin
                seq_num[8] <= sum_score[0];
                seq_num[9] <= sum_score[1];
                seq_num[10] <= sum_score[2];
                seq_num[11] <= sum_score[3];
                seq_num[12] <= sum_score[4];
                seq_num[13] <= sum_score[5];
                seq_num[14] <= sum_score[6];
                seq_num[15] <= sum_score[7];
            end
            2'd2:begin
                seq_num[16] <= sum_score[0];
                seq_num[17] <= sum_score[1];
                seq_num[18] <= sum_score[2];
                seq_num[19] <= sum_score[3];
                seq_num[20] <= sum_score[4];
                seq_num[21] <= sum_score[5];
                seq_num[22] <= sum_score[6];
                seq_num[23] <= sum_score[7];
            end
            2'd3:begin
                seq_num[24] <= sum_score[0];
                seq_num[25] <= sum_score[1];
                seq_num[26] <= sum_score[2];
                seq_num[27] <= sum_score[3];
                seq_num[28] <= sum_score[4];
                seq_num[29] <= sum_score[5];
                seq_num[30] <= sum_score[6];
                seq_num[31] <= sum_score[7];
            end
        endcase
    end
    else begin
        for (seq = 0; seq < 32; seq = seq + 1) begin
            seq_num[seq] <= 5'd0;
        end
    end
end


//按照得分进行数据排序
integer out;
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        for (out = 0; out < 32; out = out + 1) begin
            output_temp[out] <= 32'd0;
        end
    end
    else if(vld_in)begin
        for (out = 0; out < 32; out = out + 1) begin
            output_temp[seq_num[out]] <= input_temp[out];
        end
    end
    else begin
        for (out = 0; out < 32; out = out + 1) begin
            output_temp[out] <= 32'd0;
        end
    end
end

//输出
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        dout_0 <= 8'd0;
        dout_1 <= 8'd0;
        dout_2 <= 8'd0;
        dout_3 <= 8'd0;
        dout_4 <= 8'd0;
        dout_5 <= 8'd0;
        dout_6 <= 8'd0;
        dout_7 <= 8'd0;
        dout_8 <= 8'd0;
        dout_9 <= 8'd0;
        dout_10 <= 8'd0;
        dout_11 <= 8'd0;
        dout_12 <= 8'd0;
        dout_13 <= 8'd0;
        dout_14 <= 8'd0;
        dout_15 <= 8'd0;
        dout_16 <= 8'd0;
        dout_17 <= 8'd0;
        dout_18 <= 8'd0;
        dout_19 <= 8'd0;
        dout_20 <= 8'd0;
        dout_21 <= 8'd0;
        dout_22 <= 8'd0;
        dout_23 <= 8'd0;
        dout_24 <= 8'd0;
        dout_25 <= 8'd0;
        dout_26 <= 8'd0;
        dout_27 <= 8'd0;
        dout_28 <= 8'd0;
        dout_29 <= 8'd0;
        dout_30 <= 8'd0;
        dout_31 <= 8'd0;
    end
    else if(vld_in)begin
        dout_0 <= output_temp[0];
        dout_1 <= output_temp[1];
        dout_2 <= output_temp[2];
        dout_3 <= output_temp[3];
        dout_4 <= output_temp[4];
        dout_5 <= output_temp[5];
        dout_6 <= output_temp[6];
        dout_7 <= output_temp[7];
        dout_8 <= output_temp[8];
        dout_9 <= output_temp[9];
        dout_10 <= output_temp[10];
        dout_11 <= output_temp[11];
        dout_12 <= output_temp[12];
        dout_13 <= output_temp[13];
        dout_14 <= output_temp[14];
        dout_15 <= output_temp[15];
        dout_16 <= output_temp[16];
        dout_17 <= output_temp[17];
        dout_18 <= output_temp[18];
        dout_19 <= output_temp[19];
        dout_20 <= output_temp[20];
        dout_21 <= output_temp[21];
        dout_22 <= output_temp[22];
        dout_23 <= output_temp[23];
        dout_24 <= output_temp[24];
        dout_25 <= output_temp[25];
        dout_26 <= output_temp[26];
        dout_27 <= output_temp[27];
        dout_28 <= output_temp[28];
        dout_29 <= output_temp[29];
        dout_30 <= output_temp[30];
        dout_31 <= output_temp[31];
    end
    else begin
        dout_0 <= 8'd0;
        dout_1 <= 8'd0;
        dout_2 <= 8'd0;
        dout_3 <= 8'd0;
        dout_4 <= 8'd0;
        dout_5 <= 8'd0;
        dout_6 <= 8'd0;
        dout_7 <= 8'd0;
        dout_8 <= 8'd0;
        dout_9 <= 8'd0;
        dout_10 <= 8'd0;
        dout_11 <= 8'd0;
        dout_12 <= 8'd0;
        dout_13 <= 8'd0;
        dout_14 <= 8'd0;
        dout_15 <= 8'd0;
        dout_16 <= 8'd0;
        dout_17 <= 8'd0;
        dout_18 <= 8'd0;
        dout_19 <= 8'd0;
        dout_20 <= 8'd0;
        dout_21 <= 8'd0;
        dout_22 <= 8'd0;
        dout_23 <= 8'd0;
        dout_24 <= 8'd0;
        dout_25 <= 8'd0;
        dout_26 <= 8'd0;
        dout_27 <= 8'd0;
        dout_28 <= 8'd0;
        dout_29 <= 8'd0;
        dout_30 <= 8'd0;
        dout_31 <= 8'd0;
    end
end

//计算vld_out
integer state;
always @(posedge clk or negedge rst_n) begin
    if(!rst_n)begin
        for(state = 0; state < 10; state = state + 1)begin
            state_cnt[state] <= 1'd0;   
        end
    end
    else if(vld_in)begin
        state_cnt[0] <= vld_in;
        state_cnt[1] <= state_cnt[0];
        state_cnt[2] <= state_cnt[1];
        state_cnt[3] <= state_cnt[2];
        state_cnt[4] <= state_cnt[3];
        state_cnt[5] <= state_cnt[4];
        state_cnt[6] <= state_cnt[5];
        state_cnt[7] <= state_cnt[6];
        state_cnt[8] <= state_cnt[7];
        state_cnt[9] <= state_cnt[8];
    end
    else begin
        for(state = 0; state < 10; state = state + 1)begin
            state_cnt[state] <= 1'd0;   
        end
    end
end

assign vld_out = state_cnt[9];

endmodule

最后只用了一组数据进行仿真，感兴趣的可以多试试几组数据。

芯片IO资源不出意料的爆表了，想想也是，32个8bit输入+32个8bit输入累计512个端口，zynq确实承受不起。

不过在这里只是为了测试功能，所以不用管，如果实在是看着不舒服，可以额外写一个顶层模块将这个模块封装起来，这样能省掉32个8bit输出的端口，然后为了防止信号被优化，可以把32个output与一下。

后话

在写代码的时候，我突然意识到一个问题，搞了这么久硬件设计，也写了一些硬件电路了，在写verilog时，好像都是想当然的认为所有的操作都可以在一个时钟周期内搞定。比如你可能在这个时钟为1时执行了一个乘法操作，然后你在时钟为2时就需要用到这个乘法的结果。这好像与我们在体系结构中学到的知识相悖，老师一般告诉我们乘法往往需要十几甚至几十个时钟才能完成。甚至在学数集的时候，一个加法电路，就算是使用了CLA架构，也不可能用一个时钟周期就能完成加法。那我们到底写了个啥东西出来？modelsim到底仿了个啥玩意出来？我们在写verilog的时候，如果真的用到了加法和乘法，肯定都是理所当然的认为一个周期就算出来了，应该没有人会去考虑到底是耗费了多少个时钟，然后在此基础上考虑接下来的电路设计吧？

我突然感觉好像道心破碎了，写了这么久的硬件电路，怎么感觉连数字电路是啥都不知道。

于是我又去查阅相关资料，然后我就得出了一个结论：归根结底来说，我不是一个合格的IC工程师，而是一个FPGA使用者。

迄今为止，我几乎所有的数字电路设计都是在半导体厂商提供的FPGA上完成的，没有参与过真正意义上的IC设计，所以对具体的时序、延迟以及面积开销的认知还比较模糊：当模块中涉及到加法或者乘法时，有认真考虑过具体的硬件实现吗？有考虑过路径延迟、建立保持时间、时序违例吗？有考虑过面积与功耗开销吗？

答案是没有，我只是考虑算法的实现，具体实现起来的硬件电路如何与我无关。

正如老石在这篇文章中提到的，一个简单的assign c=a*b，综合工具会根据a和b的位宽自动决定使用哪种乘法器的实现形式，比如是使用基于FPGA片上DSP里的乘法单元还是使用通过组合逻辑实现的乘法器IP，二者都是FPGA厂商提供的经过高度优化的硬件模块，速度很快，相比逻辑延时和布线延时等可以忽略不计，且调用时不会占用额外的可编程逻辑资源。实现一个9位乘法与17位乘法的关键路径延迟分别是4.85ns与6.09ns，分别相当于206MHz与164MHz，而我们一般跑个仿真都是在50MHz的时钟下进行测试，自然也就看不出什么端倪。

在我看来，IC设计更类似于造ASIC，所有的模块——哪怕一个乘法器一个加法器，都需需要定制化：做处理器的设计ALU模块、做FPGA的设计DSP模块等等，然后封装起来，提供给上层开发者使用，好处是使用起来很方便，但也屏蔽了底层的细节，只能查阅相关的芯片手册。对于我这种懒人来说，尤其是没有接触过正规的项目，时间一长，就把一个时钟处理一个操作当成理所应当。

各种排序算法总结(全面)-CSDN博客

FPGA篇（三）基于FPGA的几种排序算法_fpga 排序-CSDN博客

数据结构合集 - 基数排序(算法过程, 效率分析, 稳定性分析)_哔哩哔哩_bilibili

256种排序算法，全网最全的排序算法_哔哩哔哩_bilibili

1.0 十大经典排序算法 | 菜鸟教程

读论文之《基于 FPGA 的并行全比较排序算法》-云社区-华为云

在Verilog中直接调用*实现乘法器，其延迟和占用资源如何？

请问如何verilog实现100个变量的加法 - 数字IC设计讨论(IC前端|FPGA|ASIC) - EETOP 创芯网论坛 (原名：电子顶级开发网) -

学习笔记 > 数字IC

#硬件算法 #sort

纳尼，硬件排序电路

http://example.com/2024/12/20/纳尼，硬件排序电路/