Add GCP Text-to-Speech and Cloud Run (#1)

tszumowski · web-flow · commit 4124f273d615 · 2023-03-15T17:36:24.000-04:00
* add gcp tts interface

* add TODOs to readme

* add optional TTS and adjust autoplay

* more updates

* add precommit and flake8

* parameterize more

* parameterize more

* formatting

* config update

* add dockerfile

* update readme

* docker updates

* more docker updates, now tested

* more docker updates, now tested

* add cloud run tested
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,18 @@
+[flake8]
+# Some sane defaults for the code style checker flake8
+exclude =
+    .tox
+    build
+    dist
+    .eggs
+    docs/conf.py
+max-line-length = 88
+ignore =
+    # Whitespace before ':'
+    E203
+    # Whitepsace at end of line
+    W291
+    # Line break before logical
+    W503
+    # Missing f-string placeholders
+    F541
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,11 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# Other
+**/*.jpg
+**/*.jpeg
+**/*.mp3
+**/*.wav
+**/*.json
+**/*.txt
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,31 @@
+---
+fail_fast: true
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.1.0
+    hooks:
+      - id: check-executables-have-shebangs
+      - id: check-json
+      - id: pretty-format-json
+        args: ["--autofix"]
+      - id: check-merge-conflict
+      - id: debug-statements
+      - id: detect-private-key
+      - id: forbid-new-submodules
+      - id: trailing-whitespace
+      - id: requirements-txt-fixer
+  - repo: https://github.com/adrienverge/yamllint
+    rev: v1.14.0
+    hooks:
+      - id: yamllint
+        args: ['-d {rules: {line-length: disable}}', '-s']
+  - repo: https://github.com/ambv/black
+    rev: 22.3.0
+    hooks:
+      - id: black
+        language_version: python3
+  - repo: https://github.com/pycqa/flake8
+    rev: 3.8.3
+    hooks:
+      - id: flake8
+        args: ["--config=.flake8"]
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,23 @@
+# Use the official Python image as the parent image
+FROM python:3.8-slim-buster
+
+# Set the working directory to /app
+WORKDIR /app
+
+# Install FFmpeg and other system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ffmpeg \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install the required dependencies
+COPY requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the necessary files to the Docker image
+COPY storyteller.py config.py ./
+
+# Expose port 7860
+EXPOSE 7860
+
+# Set the default command to execute the `storyteller.py` script
+CMD ["python", "storyteller.py"]
diff --git a/README.md b/README.md
@@ -2,21 +2,107 @@
 
 ---
 
-TODO: Document. Medium?
-Make note of pricing.
+This is a Gradio UI application that takes in a request for a story from the microphone
+and speaks an interactive Choose-Your-Own-Adventure style children's story. It leverages:
+
+- [OpenAI Whisper](https://openai.com/research/whisper): to transcribe user audio input request
+- [OpenAI ChatGPT (3.5-turbo)](https://platform.openai.com/docs/models/gpt-3-5):
+  to generate a story chapter given the user's inputs
+- (Optional) [Google Cloud Text-to-Speech](https://cloud.google.com/text-to-speech/):
+  to use realistic voices when telling the story.
+
+## Pricing
+
+**WARNING: This application uses paid API services. Create quotas and watch your usage.**
+
+At the time of writing, the pricing is as follows:
+
+- [whisper](https://openai.com/pricing): $0.006 / minute (rounded to the nearest second)
+- [gpt-3.5-turbo](https://openai.com/pricing): $0.002 / 1K tokens
+- [Google Text-to-Speech](https://cloud.google.com/text-to-speech/pricing):
+  - 0 to 1 million bytes free per month
+  - $0.000016 USD per byte ($16.00 USD per 1 million bytes)
+
+Check the links as these can change often. But at the time of writing it costs less
+than one USD for light use.
+
+Both OpenAI and Google offer free credits for new users.
 
 ## Setup
 
-1. Get an OpenAI API key.
+Note there are two ways to speak the story: Mac or GCP Text-to-Speech. If using a Mac,
+the Mac `say` command is used and that's the easiest/fastest route to running this.
+It uses the System voice set up in the Accessibility settings.
+However, if not on a Mac or if you prefer a more realistic voice, the GCP Text-to-Speech may be used.
+This requires you having (a) a GCP project, (b) the TTS API enabled, and (c) your account authenticated
+in gcloud (or GOOGLE_APPLICATION_CREDENTIALS environment variable set).
+
+This application has only been tested on a Macbook.
+
+1. Sign up at OpenAI and acquire an [OpenAI API key](https://platform.openai.com/account/api-keys).
 1. Add to environment variable with: `export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxx`"
 1. Create virtual environment
-1. `pip install -r requirements.txt`
-1. Brew install `ffmpeg`: `brew install ffmpeg`
-1. Update config in `config.py` as desired
+1. Run `pip install -r requirements.txt`
+1. If on Mac, brew install `ffmpeg`: `brew install ffmpeg`
+
+- Linux may need to install also but untested.
+
+1. Review and update config in `config.py` as desired
+1. If using GCP TTS
+1. set in `config.py`: `SPEECH_METHOD = SpeechMethod.GCP`
+1. Navigate to the [Google API page](https://console.cloud.google.com/apis/api/texttospeech.googleapis.com/) and enable the API
+1. Confirm you are authenticated in gcloud and your account has access to that API.
 1. Run with: `python storyteller.py`
 1. Navigate to `http://127.0.0.1:7860/` and have fun!
 
-## TODO
+## Running as Docker Container
+
+Replace `<service-name>` with a name of your choice.
+
+1. Build Docker image: `docker build -t <image-name> .`
+1. Run locally with something similar to:
+
+```
+docker run -it --rm \
+    -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/creds.json \
+    -v ${HOME}/.config/gcloud/application_default_credentials.json:/tmp/creds.json \
+    -e OPENAI_API_KEY=<openai-api-key> \
+    -p <port>:7860 \
+    audio-storyteller \
+    python storyteller.py \
+    --address=0.0.0.0 \
+    --port=7860 \
+    --user=<username> \
+    --password=<password>
+```
+
+Fill in: `<openai-api-key>, <port>, and optional <username>:<password>.
+Then once running, navigate on a browser to `127.0.0.1:<port>` and fill in the
+optional username:password you provided.
+
+## Deploying to Google Cloud Run
+
+1. Follow the directions above to create a local docker image.
+1. Tag and push (Note: Follow [these directions](https://cloud.google.com/container-registry/docs/advanced-authentication) to authenticate)
+   ```
+   docker tag <image-name> gcr.io/<project-id>/<image-name>
+   docker push gcr.io/<project-id>/<image-name>
+   ```
+1. Create a service account on your GCP project IAM page named: `audio-storytelling-bot@<project-id>.iam.gserviceaccount.com`
+1. Deploy with the following command, setting anything in `<>` appropriately:
+
+   ```
+   gcloud run deploy audio-storytelling-bot \
+       --image gcr.io/<project-id>/<image-name> \
+       --platform managed \
+       --service-account=audio-storytelling-bot@<project-id>.iam.gserviceaccount.com \
+       --set-env-vars=OPENAI_API_KEY=<openai-key-string> \
+       --no-allow-unauthenticated \
+       --port=7860 \
+       --cpu=1 \
+       --memory=512Mi \
+       --min-instances=0 \
+       --max-instances=1
+   ```
 
-- [ ] Fix the audio thread error that pops up
-- [ ] Document
+Cloud Run will automatically scale the number of instances based on the incoming traffic. You can access the deployed Gradio application via the URL provided by the Cloud Run service.
diff --git a/config.py b/config.py
@@ -1,9 +1,41 @@
+from enum import Enum
 import time
 
+"""
+Speech method
+    None: No speech
+    "gcp": Google Cloud Platform Text-to-Speech API
+    "mac": Mac OS X say command
+
+Note: For GCP, you must be authenticated with the gcloud CLI or set the
+GOOGLE_APPLICATION_CREDENTIALS environment variable
+"""
+
+
+# Define the class enum
+class SpeechMethod(Enum):
+    NONE = 1
+    GCP = 2
+    MAC = 3
+
+
+# Set the method here
+SPEECH_METHOD = SpeechMethod.GCP
+
+
+"""
+Other configuration
+"""
 RESOLUTION = "512x512"  # One of 256x256, 512x512, 1024x1024
 PROMPT_MAX_LEN = 1000  # Max length of prompt for DALL-E
 IMAGE_PATH = "generated_image.jpg"  # path to save generated image
 TRANSCRIPT_PATH = f"transcript-{int(time.time())}.txt"
+GENERATED_SPEECH_PATH = "generated_speech.mp3"
+TTS_SPEECH_DELAY = 5.0  # seconds to wait before playing generated speech
+
+# Voice for GCP Text-to-Speech API
+# Samples: https://cloud.google.com/text-to-speech/docs/voices
+TTS_VOICE = "en-GB-Neural2-C"
 
 """
 Example Prompts
@@ -33,3 +65,9 @@
     of each chapter first pause for a moment. Then ask the reader a single question
     that chooses the path for their next chapter in their story.
 """
+
+"""
+DERIVED CONFIG
+"""
+# Derive only xx-xx from TTS_VOICE
+TTS_VOICE_LANGUAGE_CODE = "-".join(TTS_VOICE.split("-")[0:2])
diff --git a/requirements.txt b/requirements.txt
@@ -1,2 +1,6 @@
+black
+flake8
+google-cloud-texttospeech
+gradio
 openai
-gradio
+pre-commit
diff --git a/storyteller.py b/storyteller.py