Generating Audio Reviews with DocsGPT

As a developer, I often struggled to keep up with the latest research due to a hectic schedule. To solve this, I leveraged DocsGPT to create audio reviews of scientific papers. This allowed me to listen to the newest research developments on the go. It made staying updated so much easier, and now I want to share how you can do the same by automating this process.

Introduction

DocsGPT is a powerful AI tool for working with your documents. By combining its strengths with TTS technology and following the steps outlined in this guide, you can create detailed audio reviews of scientific papers (or any content of your choice), making it easier to stay informed without dedicating extra time to reading.

Step 1: Upload the Paper to DocsGPT

Start by uploading the paper or document you want to convert into an audio format to DocsGPT. This can be done through the DocsGPT interface or via API.

Step 2: Create New Prompts

In the DocsGPT settings, create a prompt to generate a detailed plan for the audio review. Here is the one I've used for summary generation:


I will create an audio review for this science paper. Please provide a detailed plan for the review, 
separated into clear, standalone parts. Each part should be understandable on its own, even if 
generated independently. Use the format `...`, `...`, 
etc., to separate the parts.

**Example Plan:**

<review>
    <part number="1">
        <title>Introduction to the Paper<title>
        <content>
            This section introduces the paper's topic and significance. It outlines the key 
            questions the paper aims to address and provides a brief background on the subject.
        <content>
    <part>
    <part number="2">
        <title>Summary of Methods<title>
        <content>
            This section explains the methodologies used in the research. It highlights the 
            importance of the chosen methods in the context of the study and how they contribute 
            to the overall research.
        <content>
    <part>
    <part number="3">
        <title>Key Findings<title>
        <content>
            This section summarizes the major results and discoveries of the paper. It discusses 
            the implications of these findings and their impact on the field.
        <content>
    <part>
    <part number="4">
        <title>Discussion<title>
        <content>
            This section analyzes the findings in detail. It compares the results with previous 
            research, discusses the study's limitations, and explores the broader implications 
            of the findings.
        <content>
    <part>
        <part number="5">
        <title>Conclusion<title>
        <content>
            This section provides final thoughts on the paper. It summarizes the main points and 
            suggests future research directions based on the findings of the study.
        <content>
    <part>
<review>

Please use the above structure and format to generate the plan.
----------------
{summaries}

And this one for transcript generation:


I will create an audio review for a science paper. Please write a transcript for the specified 
part of the paper. The language should be clear, concise, and suitable for text-to-speech conversion. 
Avoid using special characters or overly complex terminology. Output only transcript, no need for 
comments that will not be in the review.
----------------
{summaries}

Step 3: Generate and Edit the Initial Summary

Use first prompt to generate the initial summary of a paper. Review and edit the summary as needed to ensure accuracy and clarity. For better results, increase the "Chunks processed per query" setting to handle larger sections of text.

Step 4: Create an API Key

Navigate to the API Keys section in the DocsGPT settings and create an API key linked to your document. This key will be used for accessing the DocsGPT API programmatically.

Choosing a Text-to-Speech Method

You can choose from many different TTS methods to convert the text into audio. For this article I have tried 3 and each method has its strengths and weaknesses. You can listen and compare each of these methods to determine which one suits your needs best. In this guide, I will go into detail about setting up the OpenAI API, but I will also provide code for all three methods below.

OpenAI API	This option provides the highest quality but incurs a cost
TTS Library from Coqui-AI	Open-source and capable of producing high-quality results, but slower
Google Text-to-Speech	Fast and easy to set up, but with the lowest audio quality

Step 5: Setting Up the Environment

The setup will differ depending on the chosen method. Here are the installation commands for each:

For OpenAI API:

pip install openai

For Coqui-AI/TTS:

pip install TTS

For GTTS:

pip install gTTS

Step 6: Generate the Audio Review

Here’s how to set up and use the TTS for generating audio files:

import os
import requests
from TTS.api import TTS
from pathlib import Path
import xml.etree.ElementTree as ET

# Set your OpenAI and DocsGPT API keys
DOCSGPT_API_URL = 'https://gptcloud.arc53.com/api/answer' #I've used Cloud version API, but local will work also
DOCSGPT_API_KEY = 'YOUR_API'

# Input path to the summary plan .txt
input_path = 'sum.txt'

# Capture the name of the file (without extension) to create folders
file_name = Path(input_path).stem
text_folder = Path(f'{file_name}_text')
audio_folder = Path(f'{file_name}_audio')

# Create the folders
text_folder.mkdir(parents=True, exist_ok=True)
audio_folder.mkdir(parents=True, exist_ok=True)

# Read the content of the input file
with open(input_path, 'r', encoding='utf-8') as file:
    plan_content = file.read()

# Parse the XML content
root = ET.fromstring(plan_content)

# Function to request DocsGPT API
def request_docsgpt(question):
    headers = {'Content-Type': 'application/json; charset=utf-8'}
    data = {
        "question": question,
        "api_key": DOCSGPT_API_KEY,
    }
    response = requests.post(DOCSGPT_API_URL, json=data, headers=headers)
    response_json = response.json()
    return response_json['answer']

# Process each part in the plan
for part in root.findall('part'):
    number = part.get('number')
    title = part.find('title').text
    content = part.find('content').text.strip()
    
    # Create the question for DocsGPT API
    question = f"{title} {content}"
    
    # Request DocsGPT API
    answer = request_docsgpt(question)
    
    # Write the answer to a text file
    text_file_path = text_folder / f'{number}.{title.replace(" ", "_")}.txt'
    with open(text_file_path, 'w', encoding='utf-8') as text_file:
        text_file.write(answer)
    
    print(f"Generated text: {text_file_path}")

    audio_file_path = audio_folder / f'{number}.{title.replace(" ", "_")}.wav'
    
    # Initialize the TTS model
    model_name = "tts_models/en/ljspeech/tacotron2-DDC"
    tts = TTS(model_name)
    
    #Generate file
    tts.tts_to_file(text=answer, file_path=audio_file_path)

    print(f"Generated audio: {audio_file_path}")

print("Processing completed.")

Using this script is very easy. Simply save your summary plan, which we generated earlier, as a `.txt` file and specify the input path to it inside the script. The script will automatically create transcripts for each part and save them to a folder. Then, it will generate a folder full of audio files for each part of your summary.

Here are the links to this script using the OpenAI API and the gTTS library.

Conclusion

The OpenAI TTS is incredibly realistic, and with the latest spring update, it can be hard to distinguish from a real human voice. However, the open-source options are not far behind, with some truly high-quality models available.

Automations like this, powered by DocsGPT, are transforming the way we interact with documents. They help us optimize our workflows and save valuable time. Whether you prefer the highest quality audio with OpenAI, the open-source flexibility of Coqui-AI, or the speed of GTTS, this guide has you covered. If you want to implement something like this or have a completely different project in mind, our team at Arc53 is here to help.

Enhance Your Learning Experience with AI-Powered Text to Speach