# Using Claude 3 to transform a video tutorial in a blog post

This notebook provides a baseline to reproduce Anthropic's solution to Karpathy's challenge of converting a video tutorial in a blog post. See associated Medium article [here](https://medium.com/@ya-lb/using-claude-3-to-transform-a-video-tutorial-in-a-blog-post-d2c1e04e7a7b).

### Original data

- Karpathy's video tutorial on tokenization : https://www.youtube.com/watch?v=zduSFxRajkE
- Hand-written tutorial summary : https://github.com/karpathy/minbpe/blob/master/lecture.md




# Install and import libraries

- `pytube`: used to download a Youtube video
- `youtube-transcript-api`: used to directly download the video transcript from Youtube, if available
- `faster_whisper`: used to get transcript from audio
- `anthropic`: used to access Claude 3.0 large multimodal model

In [9]:
%%capture
!pip install -q pytube
!pip install -q youtube-transcript-api
!pip install -q anthropic
!pip install faster_whisper

Let us load the libraries

In [1]:
import os
import glob
from pathlib import Path
import re

import pytube
from youtube_transcript_api import YouTubeTranscriptApi
from faster_whisper import WhisperModel
import torch
import anthropic

import cv2 #Used to extract frames from video
import base64 #Used to convert JPG image in base64 format


Put your Anthropic API key here:

In [2]:
ANTHROPIC_API_KEY = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

Define Youtube video ID, folder to store video, chapters, and resulting blog post.

In [3]:
# Andrej Karpathy : Let's build the GPT Tokenizer - https://www.youtube.com/watch?v=zduSFxRajkE
youtube_video_id = "zduSFxRajkE"

DATA_DIR = youtube_video_id
CHAPTERS_DIR = DATA_DIR+"/chapters"
MERGE_DIR = DATA_DIR+"/final_output"

if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

if not os.path.exists(CHAPTERS_DIR):
    os.makedirs(CHAPTERS_DIR)

if not os.path.exists(MERGE_DIR):
    os.makedirs(MERGE_DIR)
    

## Download video and get transcript



### Download video

In [4]:
def download_youtube_video(video_id, output_path):
    """
    Download a YouTube video given its ID, stores it in output_path, and returns the output path with the video ID as filename.
    """

    # Create a YouTube object with the video ID
    youtube = pytube.YouTube(f"https://www.youtube.com/watch?v={video_id}")
    # Get the highest resolution video stream
    stream = youtube.streams.get_highest_resolution()
    # Download the video
    video_path = stream.download(output_path=output_path, filename=video_id+".mp4")

    return video_path



In [5]:
# About 20 seconds for 330MB video
%time video_path=download_youtube_video(youtube_video_id, DATA_DIR)

CPU times: user 483 ms, sys: 51.7 ms, total: 534 ms
Wall time: 3.44 s


### Get transcript



#### With YouTubeTranscriptApi

In [6]:
transcript = YouTubeTranscriptApi.get_transcript(youtube_video_id, languages=["en"])
transcript[0:4]

[{'text': "hi everyone so in this video I'd like us",
  'start': 0.04,
  'duration': 4.04},
 {'text': 'to cover the process of tokenization in',
  'start': 2.04,
  'duration': 4.4},
 {'text': 'large language models now you see here',
  'start': 4.08,
  'duration': 4.2},
 {'text': "that I have a set face and that's",
  'start': 6.44,
  'duration': 3.88}]

In [7]:
len(transcript)

3422

#### With Whisper

In [None]:
whisper_model = WhisperModel("large-v3",
                              device="cuda" if torch.cuda.is_available() else "cpu",
                              compute_type="float16",
                            )

In [22]:
def speech_to_text(whisper_model, audio_file, initial_prompt="", language="en", segments=None):

        segments, transcript_info = whisper_model.transcribe(audio_file,  initial_prompt=initial_prompt, language=language)
        segments = list(segments)
        segments = [
            {
                "start": round(s.start,2),
                "duration": round(s.end-s.start,2),
                "text": s.text,
            }
            for s in segments
        ]

        return segments

In [None]:
#25 minutes for a 2h13 video on T4
transcript = speech_to_text(whisper_model, video_path)

## Chop up in chapters of aligned text and screenshots

In [8]:
def get_text_chapter(transcript, chapter_start_time, chapter_end_time, output_dir):
    """
    Extract and save a specified chapter's text from a transcript.

    This function iterates through a transcript, extracting text that falls within the specified start and end times
    of a chapter. The extracted text is concatenated into a single string, which is then saved to a file named
    'transcript.txt' within the specified output directory.

    Args:
        transcript (list of dicts): The transcript from which to extract text, where each entry in the list
            represents a segment of the transcript with a start time, end time, and text.
        chapter_start_time (int): The start time of the chapter, used to identify which segments of the transcript to include.
        chapter_end_time (int): The end time of the chapter, used to identify which segments of the transcript to include.
        output_dir (str): The directory where the extracted chapter text will be saved.

    The function does not return any value but writes the extracted chapter text to 'transcript.txt' in the specified directory.
    """
    text_chapter = ""

    for i in range(len(transcript)):
        transcript_i = transcript[i]

        # Check if the current transcript segment falls within the chapter's start and end times
        if int(transcript_i['start']) >= chapter_start_time and int(transcript_i['start']) <= chapter_end_time:
            # Concatenate text from the audio transcript, removing any new lines and leading/trailing whitespace
            text_chapter += transcript_i['text'].replace('\n', ' ').strip() + " "

    # Define the path to the output transcript file
    transcript_file = output_dir + '/transcript.txt'

    # Save the concatenated chapter text to the specified file
    with open(transcript_file, "w") as f:
        f.write(text_chapter)

In [9]:
def get_frames_chapter(video_path, chapter_start_time, chapter_end_time, output_dir, timestamps_screenshots=None):
    """
    Extract and save frames from a specified chapter of a video at given timestamps or at regular intervals.

    This function calculates a list of timestamps to take screenshots if not provided, defaulting to 10 evenly spaced
    intervals within the chapter duration. If the calculated interval is less than 60 seconds, it defaults to 60 seconds.
    It then opens the video file, iterates over the calculated or provided timestamps, captures frames at these timestamps,
    and saves them as JPEG files in the specified output directory.

    Args:
        video_path (str): The path to the video file.
        chapter_start_time (int): The start time of the chapter in seconds.
        chapter_end_time (int): The end time of the chapter in seconds.
        output_dir (str): The directory where the extracted frames will be saved.
        timestamps_screenshots (list of int, optional): Specific timestamps to capture screenshots. If None,
            screenshots will be taken at regular intervals within the chapter.

    The function does not return any value but saves the captured frames in the specified output directory.
    """
    # Calculate default timestamps if not provided
    if timestamps_screenshots is None:
        screenshot_interval = int((chapter_end_time - chapter_start_time) / 10)
        # Ensure a minimum interval of 60 seconds between screenshots
        if screenshot_interval < 60:
            screenshot_interval = 60
        timestamps_screenshots = list(range(chapter_start_time, chapter_end_time, screenshot_interval))
    else:
        #Make sure timestamps are integers
        timestamps_screenshots = [int(ts) for ts in timestamps_screenshots]

    # Open the video file using OpenCV
    video = cv2.VideoCapture(video_path)

    # Determine the frames per second (FPS) of the video for frame index calculation
    fps = video.get(cv2.CAP_PROP_FPS)

    # Capture and save frames at each specified timestamp
    for timestamp in timestamps_screenshots:
        # Calculate the frame index based on the timestamp and video FPS
        index = int(timestamp * fps)
        video.set(cv2.CAP_PROP_POS_FRAMES, index)

        # Attempt to read the frame at the calculated index
        success, frame = video.read()

        # If the frame is successfully read, save it as a JPEG file
        if success:
            # Format the timestamp for the output filename
            timestamp_str = "{:05d}".format(timestamp)
            output_path = f"{output_dir}/{timestamp_str}.jpg"
            # Save the frame to the output directory
            cv2.imwrite(output_path, frame)

    # Release the video file resources
    video.release()


In [10]:
def chop_up_in_chapters(chapters_list, video_path, transcript, timestamps_screenshots_list_seconds=None):
    """
    Split the video in chapters based on the video chapters list.
    """

    n_chapters=len(chapters_list)-1
    print(f"Number of chunks: {n_chapters}")

    # Iterate over the timestamps and topics
    for current_chapter in range(n_chapters):

        output_dir=CHAPTERS_DIR+"/"+str(current_chapter)

         # Create the output directory if it does not exist
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        # Get the current and next timestamp
        current_chunk_start_time=chapters_list[current_chapter]['timestamp']
        current_chunk_end_time=chapters_list[current_chapter+1]['timestamp']-1

        print(f"Chapter {current_chapter}; Start: {current_chunk_start_time}, End: {current_chunk_end_time}")

        # Extract text and frames for the current chapter
        get_text_chapter(transcript, current_chunk_start_time, current_chunk_end_time, output_dir)
        
        if timestamps_screenshots_list_seconds is not None:
            get_frames_chapter(video_path, current_chunk_start_time, current_chunk_end_time, output_dir,timestamps_screenshots_list_seconds[current_chapter])
        else:
            get_frames_chapter(video_path, current_chunk_start_time, current_chunk_end_time, output_dir)

In [11]:
chapters_24="""
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)
"""

def chapters_to_list(chapters):
    chapters_list = chapters.strip().split('\n')
    chapters_dict_list = []

    for chapter in chapters_list:
        time_str, topic = chapter.split(' ', 1)
        hours, minutes, seconds = map(int, time_str.split(':'))
        total_seconds = hours * 3600 + minutes * 60 + seconds
        chapters_dict_list.append({"timestamp": total_seconds, "topic": topic})

    return chapters_dict_list

chapters_list = chapters_to_list(chapters_24)
last_timestamp=int(transcript[-1]['start']+transcript[-1]['duration'])
#chapters_list.append({"timestamp": last_timestamp, "topic": "end"})
chapters_list

[{'timestamp': 0,
  'topic': 'intro: Tokenization, GPT-2 paper, tokenization-related issues'},
 {'timestamp': 350,
  'topic': 'tokenization by example in a Web UI (tiktokenizer)'},
 {'timestamp': 896, 'topic': 'strings in Python, Unicode code points'},
 {'timestamp': 1095,
  'topic': 'Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32'},
 {'timestamp': 1367, 'topic': 'daydreaming: deleting tokenization'},
 {'timestamp': 1430,
  'topic': 'Byte Pair Encoding (BPE) algorithm walkthrough'},
 {'timestamp': 1622, 'topic': 'starting the implementation'},
 {'timestamp': 1715,
  'topic': 'counting consecutive pairs, finding most common pair'},
 {'timestamp': 1836, 'topic': 'merging the most common pair'},
 {'timestamp': 2098,
  'topic': 'training the tokenizer: adding the while loop, compression ratio'},
 {'timestamp': 2360,
  'topic': 'tokenizer/LLM diagram: it is a completely separate stage'},
 {'timestamp': 2567, 'topic': 'decoding tokens to strings'},
 {'timestamp': 2901, 'topic': 'encodi

In [12]:
chop_up_in_chapters(chapters_list, video_path, transcript) 

Number of chunks: 24
Chapter 0; Start: 0, End: 349
Chapter 1; Start: 350, End: 895
Chapter 2; Start: 896, End: 1094
Chapter 3; Start: 1095, End: 1366
Chapter 4; Start: 1367, End: 1429
Chapter 5; Start: 1430, End: 1621
Chapter 6; Start: 1622, End: 1714
Chapter 7; Start: 1715, End: 1835
Chapter 8; Start: 1836, End: 2097
Chapter 9; Start: 2098, End: 2359
Chapter 10; Start: 2360, End: 2566
Chapter 11; Start: 2567, End: 2900
Chapter 12; Start: 2901, End: 3455
Chapter 13; Start: 3456, End: 4297
Chapter 14; Start: 4298, End: 4498
Chapter 15; Start: 4499, End: 4705
Chapter 16; Start: 4706, End: 5127
Chapter 17; Start: 5128, End: 5321
Chapter 18; Start: 5322, End: 6206
Chapter 19; Start: 6207, End: 6490
Chapter 20; Start: 6491, End: 6597
Chapter 21; Start: 6598, End: 6700
Chapter 22; Start: 6701, End: 7819
Chapter 23; Start: 7820, End: 7969


## LLM transform



This is the core step. For each chapter, the audio transcript and selected screenshots are provided to the LMM, with the goal of transforming these input data into an output suitable for inclusion in a textbook.

Documentation for querying Claude with images: https://docs.anthropic.com/claude/docs/vision

Prompt inspired by https://github.com/hundredblocks/transcription_demo/tree/main

In [13]:
prompt_instructions = f"""
<instructions>
You have been given images of a video at different timestamps, followed by the audio transcript in <transcript>
The transcript was generated by an AI speech recognition tool and may contain some errors/infelicities.
Your task is to transform the transcript into a markdown blog post.
This transcript is noisy. Please rewrite it using the following guidelines:
- output valid markdown
- insert section headings and other formatting where appropriate
- you are given only part of a transcript, so do not include introductory or concluding paragraphs. Only include the main topics discussed in the transcript
- use styling to make images, text, code, callouts and the page layout and margins look like a typical blog post or textbook
- remove any verbal tics
- if there are redundant pieces of information, only present it once
- keep the conversational content in the style of the transcript. Including headings to make the narrative structure easier to follow along
- the transcript includes too many images, so you should only include the most important 1-2 images in your output
- choose images that provide illustrations that are relevant to the transcript
- prefer to include images which display complete code, rather than in progress
- when relevant transcribe important pieces of code and other valuable text
- if an image would help illustrate a part of a transcript, include it
- to include an image, insert a tag with <img src="xxxxx.jpg"/> where xxxxx is replaced by the exact image timestamp inserted above the image data
- do not add any extraneous information: only include what is either mentioned in the transcript or the images

Your final output should be suitable for inclusion in a textbook.
</instructions>
"""

Transform the JPG screenshots in a format suitable for Anthorpic's API. 

The function iterates over all screenshots in order to describe each of them with two messages:

- a text message that specifies the timestamp for the screenshot, 
- and an image message containing its base64-encoded representation. 

The text message with the timestamp will allow later to add a hyperlink from the final document to the original video.

In [14]:
def get_screenshots_as_messages(screenshots):

	screenshots_as_messages = []

	for i in range(len(screenshots)):
		screenshots_as_messages.extend([
		{
			"type": "text",
			"text": f"The timestamp for the following image is {Path(screenshots[i]).stem}."
		},
		{
		"type": "image",
		"source": {
			"type": "base64",
			"media_type": "image/jpeg",
			"data": base64.b64encode(open(screenshots[i], "rb").read()).decode("utf-8"),
		}
		}
		])

return screenshots_as_messages


Bring together the screenshots, transcript and instructions. 

The function additionally prefills Claude's output to make it start its answer with a markdown title - https://docs.anthropic.com/claude/docs/prefill-claudes-response

In [15]:
def get_prompt_as_messages(chapter_id):

    folder_path=CHAPTERS_DIR+'/'+str(chapter_id)

    with open(folder_path+'/transcript.txt', "r") as f:
        transcript = f.read()

    screenshots=sorted(glob.glob(folder_path+'/*.jpg'))
    
    screenshots_as_messages=get_screenshots_as_messages(screenshots)

    prompt_as_messages = [
        {
            "role": "user",
            "content": screenshots_as_messages+
            [
                {
                    "type": "text",
                    "text": f"<transcript>\n{transcript}\n</transcript>"
                },
                {
                    "type": "text",
                    "text": prompt_instructions
                }
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "#"
                }
            ]
        }
    ]

    return prompt_as_messages

In [None]:
# Check content
prompt_as_messages = get_prompt_as_messages(0)
prompt_as_messages

Iteratively call Claude, and writing the result as a markdown file in the corresponding chapter folder.

In [17]:

# Iterate through the list of chapters
for chapter in range(len(chapters_list)-1): 

    # Display the current processing chapter number to the console.
    print(f"Processing chunk {chapter}")

    # Generate the prompt for the current chapter (list of messages with screenshots, transcript and instructions).
    prompt_generate_markdown = get_prompt_as_messages(chapter)

    # Create a message by invoking Claude with the prompt.
    message = client.messages.create(
        model="claude-3-opus-20240229",
        system="You are an expert at writing markdown blog post.",
        temperature=0,
        max_tokens=4000,
        messages=prompt_generate_markdown
    )

    # Extract the generated markdown content from the response.
    answer = message.content[0].text
    markdown = "#"+answer  # Prepend a header tag to the markdown content.
    
    # Define the path for the markdown file corresponding to the current chapter.
    markdown_file = CHAPTERS_DIR + '/' + str(chapter) + '/markdown.md'

    # Write the generated markdown content to the file.
    with open(markdown_file, "w") as f:
        f.write(markdown)



Processing chunk 0
Processing chunk 1
Processing chunk 2
Processing chunk 3
Processing chunk 4
Processing chunk 5
Processing chunk 6
Processing chunk 7
Processing chunk 8
Processing chunk 9
Processing chunk 10
Processing chunk 11
Processing chunk 12
Processing chunk 13
Processing chunk 14
Processing chunk 15
Processing chunk 16
Processing chunk 17
Processing chunk 18
Processing chunk 19
Processing chunk 20
Processing chunk 21
Processing chunk 22
Processing chunk 23


## Merge all chapters and finalize blog post

The final and last step of the workflow consists in two main tasks. First, it merges together the different markdown outputs. Second, it adds hyperlinks to chapter titles and images. This allows to connect the final markdown file to the original YouTube video at relevant timestamps.

In [19]:
merged_markdown=""

# Iterate over the chapter folders to merge the markdown files
for chapter in range(len(chapters_list)-1):

    markdown_file=CHAPTERS_DIR+'/'+str(chapter)+'/markdown.md'

    with open(markdown_file, "r") as f:
        markdown = f.readlines()

    # Let us add, for each chapter title, a hyperlink to the video at the right timestamp
    url_chapter = f"https://www.youtube.com/watch?v={youtube_video_id}&t={chapters_list[chapter]['timestamp']}s"
    markdown[0] = f"# [{chapter+1}) {markdown[0][2:].strip()}]({url_chapter})"
    markdown = '\n'.join(markdown)

    merged_markdown+="\n"+markdown

# Find all <img> tags with timestamps in the src attribute, so we can add a hyperlink to the video at the right timestamp
timestamps_screenshots = re.findall(r'<img src="(\d+)\.jpg"/>', merged_markdown)
timestamps_screenshots = [timestamp for timestamp in timestamps_screenshots]

# Add a hyperlink to the video at the right timestamp for each image
for timestamp in timestamps_screenshots:
    video_link = f'<a href="https://www.youtube.com/watch?v={youtube_video_id}&t={int(timestamp)}s">Link to video</a>'
    merged_markdown = merged_markdown.replace(f'<img src="{timestamp}.jpg"/>', f'<img src="{timestamp}.jpg"/>\n\n{video_link}')

# Get frames based on screenshots effectively selected in the merged markdown and save in merge folder
get_frames_chapter(video_path, None, None, MERGE_DIR, timestamps_screenshots=timestamps_screenshots)

# Save the merged markdown to a markdown blogpost.md file
markdown_file=MERGE_DIR+'/blogpost.md'
with open(markdown_file, "w") as f:
        f.write(merged_markdown)

## Useful links

- Companion Medium article
- [Claude 3 - Vision documentation](https://docs.anthropic.com/claude/docs/vision)
- Karpathy's challenge and Ameisen and colleague's repository
