The Cascaded Integrator Comb (CIC) decimation filter is a pivotal technology extensively employed in digital signal processing (DSP). This paper delves into a comprehensive examination of the CIC algorithm within software-defined radio (SDR) systems from the perspective of parallel computing and introduces a novel Non-Recursive Implementation (NR-I) on an NVIDIA GPU using CUDA. The NR-I approach significantly reduces computational load by unfolding the recursive CIC structure with pre-derived Unfold Factors. Further optimization was achieved through data-transfer enhancements using PM Implementation (PM-I) and ODT Implementation (ODT-I). Experimental results demonstrate that NR-I achieves a speedup of over 449.48. Additionally, the data-transfer optimizations resulted in substantial performance improvements, with PM-I and ODT-I reducing execution time by 43.24% and 64.22%, respectively. The GPU implementation’s speedup is significantly greater than that of OpenMP, ranging from 3.34 to 10.22 times. These results underscore the effectiveness of the proposed Non-Recursive Implementation in accelerating time-intensive and data-intensive computations.