Generative adversarial network (GAN) technology has enabled the automatic synthesis of realistic face images from text. This paper proposes a model for generating face images from Chinese text by integrating a text mapping module with the StyleGAN generator. The text mapping module utilizes the CLIP model for pre-training Chinese text, employs a convolutional-inverse convolutional structure to enhance feature extraction, and incorporates a BiLSTM model to construct complete sentences as inputs for the StyleGAN generator. The generator interprets semantic features to generate face images. Validation on Face2Text and COCO datasets yields F1 values of 83.43% and 84.97%, respectively, while achieving the lowest FID and FSD scores of 103.25 and 1.26. The combination of CLIP pre-training and word-level semantic embedding improves image quality, offering a novel approach for face recognition applications in public safety.