ε Pulse: Issue #20

Lab Bench 🔬, Computer Use 💻, and the 12 Days of OpenAI 🎄.

Dec 11, 2024

Note: I am currently working on an AI side project with a friend, and we are searching for a full-stack developer (see details). Please reach out to me via obifarin [at] yahoo [dot] com or simply reply to this newsletter so we can schedule a time to chat.

My Other Publications:

Around the Web Issue 33: Artificializing Intelligence

Video Essay: Best Bet You Can Make.

Podcast

Note: this was AI generated with NotebookLM, read the actual newsletter and source materials, as applicable, for grounding.

In this newsletter:

🧬

LAB-Bench: Evaluating AI in Scientific Research.

Nobel Laureates Discuss the Future of AI in Science

Smart People and Boring Jobs in Biology

👨‍💻

Human Layer: Human in the Loop for AI Agents

Anthropic’s Computer Use.

Anthropic’s New Agent Protocol

🏭

chatGPT: Work with Apps

Microsoft Launches 10 New Agents.

OpenAI: 12 days of Shipmas.

📝

Chegg, ChatGPT, and the Economics of BigTech

When We Become Cogs

What Jobs Do AIs Do Better than Humans.

15 Times to use AI, and 5 Not to

[LLM/AI for Science et al] 🤖 🦠 🧬

[I]: 🧪 LAB-Bench: Evaluating AI in Scientific Research

I have been planning to write about this story for several months now, I finally got to it. FutureHouse recently introduced a Language Agent Biology Benchmark (LAB-Bench), a new evaluation suite aimed at assessing AI’s ability to perform tasks critical to scientific research. With 2,457 questions across eight categories, LAB-Bench emphasizes procedural evaluations over knowledge-based ones, testing capabilities such as literature extraction (LitQA2), database retrieval (DbQA), biological sequence manipulation (SeqQA), and protocol design (ProtocolQA). It also features "Cloning Scenarios," complex multi-step problems designed to reflect real-world challenges in molecular biology, serving as a stringent measure of AI’s potential as a scientific collaborator. The dataset is available on Hugging Face.

Initial results indicate that current AI models lag significantly behind human performance in most tasks, except for Claude 3.5 Sonnet, which outperform humans in TableQA precision. Performance gaps are especially notable in visual reasoning tasks like FigQA, highlighting deficiencies in models’ attention to detail and multi-step reasoning. These findings underscore the need for better tools, infrastructure, and benchmarks to advance AI’s role in scientific research, paving the way for future innovations in AI-assisted science.

I am looking forward to how the new OpenAI o1 and o1 Pro model, and the new Reinforcement Fine-Tuning will perform on this benchmark. My guess is that they will do really well. (See the section on AI in the industry and products session for summary of the new OpenAI offerings)

[II]: 🏅Nobel Laureates Discuss the Future of AI in Science

In a recent AI for Science forum co hosted by Google DeepMind and The Royal Society, one of the events featured a panel of four Nobel laureates – Demis Hassabis, Sir Paul Nurse, Jennifer Doudna, and John Jumper – discussing the groundbreaking applications of AI in science and the transformative potential of this technology. The conversation touched on various topics, including the impact of AlphaFold on protein structure prediction, the challenges of building a virtual cell, and the importance of incorporating social sciences into AI development. The panelists also emphasized the need for interdisciplinary collaboration and public engagement to ensure the responsible development and application of AI for the benefit of society.

[III]: 🧫Smart People and Boring Jobs in Biology

In this really nice essay recently the author argued that Biotech is often defined by its pursuit of ambitious moonshot ideas, from curing aging to de-novo protein synthesis. However, the field tends to overlook the "boring but impactful" solutions that drive meaningful progress. By contrast, software companies like Stripe and DocuSign thrive by addressing practical problems at scale. Biotech could benefit from adopting a similar mindset, focusing on areas such as lab services, drug repurposing, and automation.

He pointed out that Contract Research Organizations (CROs) frequently fall short due to thin margins and inexperienced staff, leaving researchers to handle routine tasks themselves. Meanwhile, drug repurposing—reviving shelved drugs or identifying new uses for existing ones—offers a highly impactful yet underutilized approach. Lab automation, through advanced robotics and remote control, could significantly boost efficiency, while reproducibility and standardization remain pressing challenges, as inconsistent protocols often lead to wasted time and resources. These are some of the few things discussed in the essay.

As a Biochemistry and Molecular Biology PhD, who did my fair share of wet lab work, his arguments resonate with me. And I would like to stick a tiny tree branch into the essay. How can AI, particularly large language models (LLMs) help. I think they present an opportunity to transform a lot of these areas. Here is a few:

Improving CRO Operations: AI agents could help streamline workflows in CROs by automating experiment planning, protocol optimization, and even quality control. LLMs could serve as virtual lab assistants, interpreting protocols and guiding inexperienced staff through complex procedures. This could address the lack of skilled personnel and enhance consistency in results.

Drug Repurposing and Repositioning: LLMs could comb through existing literature, clinical trial data, and real-world evidence to identify overlooked drug candidates for new indications. By combining natural language processing with knowledge graphs, they could connect dots that human researchers might miss, making drug repurposing more efficient.

Market Accessibility for Tools and Reagents: LLM-based platforms could democratize access to biotech tools by offering intelligent search and recommendation systems for reagents, lab supplies, or CRO services. For example, researchers could describe their project, and the AI would suggest the best tools or services, saving time and effort.

Scaling “Boring but Necessary” Ideas: We could even imagine using AI agents to identify unmet needs in biotech and help entrepreneurs develop business models around “boring” but essential innovations. By analyzing gaps in the ecosystem, they could guide new ventures toward areas where operational improvements are most needed.

[AI/LLM Engineering] 🤖🖥⚙

[I]: 🕵️Human Layer: Human in the Loop for AI Agents

Very interesting YC company here, they are building a human-in-the-loop layer for AI agents.

HumanLayer is an API and SDK that enables AI Agents to contact humans for feedback, input, and approvals. Guarantee human oversight of high-stakes function calls with approval workflows across slack, email and more. Bring your LLM and Framework of choice and empower your AI agents with safe access to the world.

[II]: 🖱Anthropic’s Computer Use.

Claude’s computer-use functionality, unveiled in the new Claude 3.5 Sonnet model, enables the AI model to interact with computer interfaces similarly to a human. This beta feature allows Claude to execute tasks by visually analyzing a screen, moving a cursor, clicking on elements, and inputting text. It has the capacity to automate operations, orchestrate tasks, and even coding. Although experimental and occasionally prone to errors, this capability significantly enhances the AI’s potential to complete multi-step tasks autonomously, reducing the need for continuous user involvement and potentially revolutionizing AI-assisted workflows. To mitigate associated risks, Anthropic advises employing safeguards such as dedicated virtual machines and restricted access to sensitive information.

See Anthropic Computer Use with Replit. Needless to say, this Computer use is going to be a game-changer.

[III]: 👨‍💻Anthropic’s New Agent Protocol

Anthropic recently introduced the Model Context Protocol (MCP), an innovative open-source framework poised to transform AI interaction with data repositories. MCP establishes a standardized method for linking AI assistants to diverse databases, enterprise tools, and development platforms. This approach addresses the challenges of fragmented integrations, enabling developers to create secure, bi-directional connections between AI-driven applications and various data sources using a straightforward client-server architecture. Early adopters like Block and Apollo have already implemented MCP, and tools such as Zed, Replit, and Codeium are in the process of incorporating support. By facilitating AI systems in maintaining context across multiple tools and datasets, MCP promises to enhance the precision and utility of AI outputs.

Here is a demo on how to set it up for your computer, see also the documentation.

Some asked about the comparison between LangChain tools vs MCP in LangChain Subreddit channel. See this Perplexity Page I curated on the subject.

[AI X Industry + Products] 🤖🖥👨🏿‍💻

[I]: 👨🏿‍💻 chatGPT: Work with Apps

OpenAI’s “Work with Apps” feature for the ChatGPT desktop app on macOS introduces a seamless way for the AI to interact with compatible applications, offering smarter, context-aware assistance. It can read and process content from tools like VS Code, Xcode, TextEdit, and Terminal, allowing users to streamline workflows by automatically sharing relevant text or code without the need for manual copy-pasting. Users can also highlight specific sections to guide the AI’s focus, enhancing productivity in coding and writing tasks. While it doesn’t write code directly into applications or handle visual elements like images or videos, this feature is designed to excel in text-based environments and requires extensions for some apps, such as VS Code, to function effectively. Here is a short tutorial.

If you don't have Cursor’s subscription, for example, you will find this new feature to be very impressive.

[II]: 🕵️ Microsoft Launches 10 New Agents

Microsoft recently unveiled a series of new AI agents at Ignite 2024

They announced 10 new autonomous AI agents that will revolutionize the way businesses operate. These agents are integrated into Microsoft's Dynamics 365 platform and target key areas such as sales, customer service, finance, and supply chain management. Some notable examples include a sales qualification agent that automates lead generation and prioritization, a customer intent agent that intelligently routes customer inquiries, and a case management agent that streamlines customer service interactions. Microsoft's aggressive approach to developing and deploying these agents underscores its commitment to providing businesses with AI-powered solutions that enhance efficiency, reduce manual workloads, and improve overall productivity.

[III]: 🌲OpenAI: 12 days of Shipmas.

As a spin on the 12 days of Christmas, openAI is running its own 12 days of OpenAI, where they launch a feature or a product every weekday for 12 days. At the time of writing this newsletter, they are four days in.

Day 1: OpenAI o1 and o1 pro mode

This demo introduced the full version of the 01 model, which is smarter, faster, and multimodal compared to its preview version. It excels in various tasks, including math competitions, coding, and general question answering. They also launched ChatGPT Pro, a new tier offering unlimited access to models like 01 and 4o, advanced voice mode, and a special 01 Pro mode for even more compute power. The Pro mode is ideal for power users tackling complex problems, they say. 01 Pro mode also prioritizes reliability, ensuring more accurate answers. In the demo, the team demonstrated 01's capabilities, including its speed, proficiency in handling hard problems, and multimodal capabilities, particularly with image understanding.

Day 2: Reinforcement Fine-Tuning

OpenAI introduced a new model customization feature called reinforcement fine-tuning (RFT) that allows users to fine-tune OpenAI's 01 models on their own datasets using reinforcement learning (See how Reinforcement finetuning is different). This technique enables the models to reason more effectively over custom domains, achieving expert-level performance with just a few dozen examples. A computational biologist from Berkeley Lab joined in the demo, and they demonstrated how RFT helped improve the model's ability to identify potential genetic causes for rare diseases based on patient symptoms. Additionally, they showed how RFT significantly boosted the performance of 01 mini, surpassing even the full 01 model on this task.

Day 3: Sora.

OpenAI released Sora to the general public after several months in closed release. The AI model generates videos from text prompts, images, or even a sequence of instructions. The demo showcased Sora's ability to create realistic and creative videos, along with features like "storyboard" for directing multi-scene videos, "remix" for easily modifying existing videos, and "blend" for merging different scenes. They highlighted the "explore" feed where users can share and learn from each other's creations, much similar to Midjourney’s. I have tried it myself to generate this not so very complicated scene.

I used it together with Runway ML, Midjourney, and some other AI tools, to create this demo video of a Pixar styled animation of myself.

If I can make out the time, I plan to launch a full series of this animation on my Instagram page in 2025.

Day 4: Canvas

On day four, they introduced significant updates to Canvas, their collaborative workspace within ChatGPT. Previously a beta feature for Plus users, Canvas is now available to all users, including those on the free plan. I have tried Canvas but I haven’t really gotten into it, probably because of my use cases. The new updates include the ability to run Python code directly within Canvas (fun), complete with syntax highlighting, auto-completion, and a web assembly Python emulator for instant execution and graphic generation. They also integrated Canvas with custom GPTs, allowing users to build specialized AI assistants that leverage Canvas for tasks like drafting letters or generating reports.

[AI + Commentary] 📝🤖📰

[I]: 💵Chegg, ChatGPT, and the Economics of BigTech

As a backdrop to the ongoing AI revolution, this very well written essay explores how companies must adapt to a new era of rapid and disruptive change.

Here is the gist of it. Chegg, once a leading provider of online homework help, has seen its value plummet due to the rise of AI, particularly ChatGPT. This highlights a shift from operational uncertainty (challenges within a stable environment) to structural uncertainty (where the entire playing field changes). Companies like Chegg, built on providing human-generated answers, now face competition from AI that offers instant and comprehensive knowledge. To survive such shifts, the author argues that businesses must rethink the competitive landscape, focusing on substitutes rather than just competitors, and reimagine their winning game. This may involve adapting to new dominant designs, as Blackberry failed to do with the iPhone, or leveraging temporary demand shifts to create lasting supply-side advantages, as seen in the quick commerce industry. Ultimately, navigating this new era of rapid change requires a clear understanding of the nature of change itself and the ability to distinguish between operational and structural responses.

I like this one quote from the essay:

“Stay focused on your competitors and watch substitutes eat your lunch.”

[II]: ⚙ When We Become Cogs

This essay is a short, fascinating exploration of the paradox of AI and automation, where tools that dramatically boost productivity often erode job satisfaction by removing the most engaging aspects of work. Examples from material scientists and software developers, cited in the essay, show how automation shifts the focus from creative, exploratory tasks to evaluating or executing machine-generated outputs. While this empowers less skilled individuals, closing productivity gaps, it alienates experts whose strengths lie in creative or strategic thinking. This reflects a broader historical trend where automation fragments roles, reducing autonomy and purpose, from assembly lines to algorithmic management.

Fulfillment in work is closely tied to mastery, autonomy, and purpose, yet automation often undermines these pillars. I suppose the challenge is to redefine what makes work meaningful in an AI-driven world. For some, maybe it means finding satisfaction in curating and refining machine outputs or shifting to broader, interdisciplinary pursuits. For others, it might involve stepping outside work entirely to embrace hobbies, arts, or community contributions. Ultimately, the future hinges on whether humans adapt to find new forms of creativity and purpose or face deeper alienation as automation takes over increasingly complex tasks.

[III]: 💼What Jobs Do AIs Do Better than Humans.

Theory ventures writes about domains where AI has structural advantage in job automations. They argued that as LLMs mature, it’s clear they can’t consistently match human-level expertise in many roles, but they excel in areas where high volume and consistency are more important than nuanced judgment. Just as a team of near-infinite interns might review documents or draft communications at scale, AI systems thrive in “high-throughput” tasks: combing through thousands of pages, crafting highly personalized marketing messages, analyzing massive sets of financial data, or managing endless supplier emails—all with tireless focus and perfect recall.

In these domains, AI doesn’t merely keep pace with human workers; it surpasses them, enabling entirely new capabilities rather than just cutting costs. Security operations centers can investigate every alert, not just a subset. Marketers can tailor campaigns to millions of users individually. Investment analysts can evaluate entire sectors, not just a handful of companies. Supply chain managers can track countless vendors and detect issues in real-time. For startups and enterprises alike, the future lies in embracing these AI-driven roles to enhance productivity, strategy, and growth.

[IV] 💻15 Times to use AI, and 5 Not to

This essay explores when and how to use AI tools effectively in various kinds of work, emphasizing that understanding their capabilities and limits is a form of wisdom rather than a simple checklist. AI can help generate large volumes of ideas, summarize and translate information, offer multiple perspectives, and provide a solid first pass for certain tasks—particularly when you already have the expertise to verify its suggestions. However, you should be wary of using AI when you need deep learning, absolute accuracy, or when the process of struggling through a task is itself valuable. The balance between AI’s transformative potential and its subtle pitfalls is constantly shifting as the technology evolves, reminding us that discernment and adaptability are crucial in deciding when, and when not, to rely on AI.

[V] 🎙 Podcast on AI and GenAI

(Additional) podcast episodes I listened to over the past few weeks

Please share this post with your friends and network if you found it informative!

The Epsilon

Discussion about this post