Forum Discussion

Raodmehr's avatar
Raodmehr
Copper Contributor
Nov 29, 2024

Video Script to Generating Video with Voiceover

Can anybody provide a step-by-step guide for a beginner user to make an app for Azure to work like Visla (https://app.visla.us/) that converts video text script to high-quality videos with Azure voiceover similar to what Visla offers?

2 Replies

  • Mks_1973's avatar
    Mks_1973
    Iron Contributor

    Creating an application similar to Visla, which converts text scripts into high-quality videos with Azure voiceovers, involves several steps. This guide will walk you through the process using Azure's AI services and Python programming.

    1. Set Up Your Azure Environment
    Create an Azure Account: If you don't have one, sign up at the Azure portal.

    Provision Necessary Services:

    Azure OpenAI Service: Provides access to language models for text processing.
    Azure Cognitive Services - Speech Service: Enables text-to-speech conversion.
    Please refer to Azure's documentation for detailed steps on creating these resources.

    2. Prepare Your Development Environment

    Install Python: Ensure Python is installed on your system.

    Set Up a Virtual Environment:
    python -m venv azure_video_env
    source azure_video_env/bin/activate  # On Windows: azure_video_env\Scripts\activate


    Install Required Libraries:
    pip install openai azure-cognitiveservices-speech moviepy


    3. Summarize the Text Script
    Utilize Azure OpenAI to generate a concise summary of your script:

    import openai

    openai.api_type = "azure"
    openai.api_base = "https://<Your_Resource_Name>.openai.azure.com/"
    openai.api_version = "2022-12-01"
    openai.api_key = "<Your_API_Key>"

    def summarize_text(content, num_sentences=5):
        prompt = f'Provide a summary of the text below in {num_sentences} sentences:\n{content}'
        response = openai.Completion.create(
            engine="text-davinci",
            prompt=prompt,
            temperature=0.3,
            max_tokens=250,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        return response.choices[0].text.strip()

    # Example usage
    script = "Your full text script here."
    summary = summarize_text(script)
    print(summary)


    4. Extract Key Phrases
    Use Azure Cognitive Services to identify key phrases:

    from azure.ai.textanalytics import TextAnalyticsClient
    from azure.core.credentials import AzureKeyCredential

    def extract_key_phrases(text):
        credential = AzureKeyCredential("<Your_Cognitive_Service_Key>")
        endpoint = "https://<Your_Cognitive_Service>.cognitiveservices.azure.com/"
        client = TextAnalyticsClient(endpoint=endpoint, credential=credential)
        response = client.extract_key_phrases(documents=[text])
        return response[0].key_phrases

    # Example usage
    key_phrases = extract_key_phrases(summary)
    print(key_phrases)


    5. Generate Images with DALL·E
    Create prompts from key phrases to generate images using Azure's DALL·E API:

    import openai

    openai.api_key = "<Your_DALL_E_API_Key>"

    def generate_image(prompt, output_path):
        response = openai.Image.create(
            prompt=prompt,
            n=1,
            size="1024x1024"
        )
        image_url = response['data'][0]['url']
        # Download and save the image
        # (Implementation depends on your environment)
        return image_url

    # Example usage
    for phrase in key_phrases:
        image_url = generate_image(phrase, f"images/{phrase}.png")
        print(f"Image for '{phrase}' saved at {image_url}")


    6. Convert Text to Speech
    Generate audio from the summarized text:

    import azure.cognitiveservices.speech as speechsdk

    def text_to_speech(text, output_path):
        speech_config = speechsdk.SpeechConfig(subscription="<Your_Speech_Key>", region="<Your_Speech_Region>")
        audio_config = speechsdk.audio.AudioOutputConfig(filename=output_path)
        synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
        result = synthesizer.speak_text_async(text).get()
        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print(f"Audio saved to {output_path}")
        else:
            print(f"Error: {result.error_details}")

    # Example usage
    text_to_speech(summary, "audio/summary.wav")


    7. Compile the Video
    Combine the generated images and audio into a video:

    from moviepy.editor import ImageClip, AudioFileClip, concatenate_videoclips

    def create_video(image_paths, audio_path, output_path):
        clips = []
        audio = AudioFileClip(audio_path)
        duration_per_image = audio.duration / len(image_paths)
        for image_path in image_paths:
            clip = ImageClip(image_path).set_duration(duration_per_image)
            clips.append(clip)
        video = concatenate_videoclips(clips, method="compose")
        video = video.set_audio(audio)
        video.write_videofile(output_path, fps=24)

    # Example usage
    image_files = [f"images/{phrase}.png" for phrase in key_phrases]
    create_video(image_files, "audio/summary.wav", "final_video.mp4")


    8. Review and Refine
    Ensure the video and audio are synchronized and meet quality standards.
    Modify image durations, transitions, or re-generate assets as needed.

Resources

OSZAR »