Forem: Srav Nayani

AI Agent Building Block: Native App Automation

Srav Nayani — Thu, 09 Oct 2025 04:03:38 +0000

Code for this article is available at https://github.com/shravyanayani/automation

What is Artificial Intelligence

Artificial Intelligence (AI) refers to a computer system's ability to perform tasks that usually need human intelligence. These tasks include learning, reasoning, problem-solving, and understanding language.

The key components of AI are data, algorithms, and models.

Data provides the examples or information that AI learns from. Algorithms are the step-by-step methods that process this data to find patterns or make decisions.
The model is the outcome of the trained algorithms. It uses what it has learned from the data to predict or act on new inputs.
These components work together so AI systems can keep improving their performance and make smart decisions in real-world situations.

What is AI Agent

An AI agent is a system that can sense its environment, make decisions, and act to reach specific goals, often without ongoing human help.

The key components of an AI agent include the perception module, decision-making module, and action module.

The perception module collects information from the environment using sensors or data inputs and makes sense of it.
The decision-making module uses algorithms or models to select the best action based on goals, rules, or past experiences.
Finally, the action module executes those decisions, and the agent continuously repeats this cycle to learn and improve over time.

What is Native Desktop App Automation

Native desktop app automation or native automation uses scripts to automatically perform tasks on native web apps, like Windows native application. This includes filling out forms, clicking buttons, scraping data, recognizing data, or testing applications without needing manual effort.

A key components of native automation is the native object detection technologies like windows object detection or OCR (Optical Character Recognition) based text detection or AI based pattern and object detection.

Native Object Detection : Apps made with native technology or what ever high eve code that compiles into native code, such apps will have native operating system controls. For example the buttons on such windows apps use the native windows button object, and it can be detected and controlled using Windows SDK. For example Java AWT code when run on Windows OS uses native Windows controls.
OCR (Optical Character Recognition) : OCR uses the patterns to detect the objects and characters on the screen. Such characters can be used to scrape or extract the text using code. Also the object detected on the screen can be used to control them like clicking, text entering etc.
AI based pattern and object detection : AI can be used for pattern detection of objects like specific button based on the text on the button, or associating the labes and controls by proximity etc.

Once the objects are identified then native mouse and keyboard api can be used to simulate human actions.

Apart from the object detection, the scripts are needed to perform actions on the controls simulating huma actions like mouse and keyboard actions. Once such building block scripts are available, they can be used for performing a high-level functionality towards specific goals.

These components enable developers to test, monitor, or interact with apps effectively and reliably.

Native Automation vs API

API (Application Programming Interface) integration lets an AI agent interact directly with an external system’s backend through structured requests. API integration is quick, efficient, and dependable. However, an API must exist for all external system integrations. Also, API access must be granted for the AI Agent.

Native Automation depends on UI based user functionality, so setting up the API interface is not necessary. However, there are some drawbacks to Native Automation, such as occasional unreliability, slowness, and complexity of the automation etc.

Web Automation vs Native App Automation

Web automation and native app automation involve using software to automatically test or interact with applications, but they target different platforms and use different tools.

Web automation focuses on automating actions in web browsers, such as Chrome or Edge. It uses tools like Selenium or Playwright. This method interacts with web elements, including HTML, CSS, and JavaScript, through a web driver. It works across various browsers and operating systems.

Native app automation, in contrast, targets mobile or desktop applications specifically designed for platforms like Android, iOS, or Windows. It employs tools like Appium, Espresso, or XCUITest, which communicate directly with the app’s native UI components instead of going through a browser.

Web automation relies on the Web Browser's DOM (Document Object Model) while native app automation relies on UI elements defined by the operating system.

In short, web automation tests websites, while native app automation tests standalone apps. Both ensure that software functions correctly in their respective environments.

How Native Automation can be a building block for AI Agents

Native automation can be one of the most powerful features of AI systems, as it basically allows them to perform actions on the native apps and interact with it much in the same way as a human user would do.

By the means of native automation, an AI agent can simply navigate through the apps, collect data, fill in the forms, or start a certain process — thus giving it the ability to access the up-to-date information and perform the tasks without any human intervention.

The automation layer is the one that physically does the clicking, typing, or scraping while the AI layer gives the intelligence – by deciding what to do, why, when, based on objectives or learned patterns.

As an illustration, an AI agent may employ natural language understanding to get an idea of a request (“purchase a stock at a specific price”) and then, through native automation, it goes to the respective app, compares the stock price, waits till the price condition is met and purchases the stock.

Ultimately, the combination of AI-driven decision-making and native automation execution gives agents the power to move seamlessly between thought and deed, thus implementing the smart insights to the world.

Technology Choices for Native Automation

Technology options for the purpose of native automation are primarily dependent on the to-be-executed tasks, the platforms to be targeted, and the degree of the intelligence or scalability desired.

Programming languages, support tools, and automation frameworks are the main classes of technologies that feature.

Automation Frameworks – Tools of this kind such as Native SDK based, OCR based, AI based are the most cited ones.

Native SDK is the most fundamental and comparatively most reliable technology of all listed here.
OCR based object and text detection works when the app doesn't use the native objects. For example, if the UI is built using Java Swing API, the controls are not the native ones, but the Swing API draws the controls with low-level drawing operations. In such cases Native SDK cannot help to detect the UI controls, so OCR helps to identify the shapes and the coordinates of the controls.
AI based object detection goes 1 step further than simple OCR. It could perform in fault tolerant way like if the object got renamed or moved around. AI can still figure out the objects with its intelligence like humans.
A combination of these techniques can be used to build reliable automation systems.

Programming Languages – The usual picks are Python, Java, JavaScript, and C# and the choice depends on the proficiency of the development team and integration requirements.

Supporting Tools – The use of some scheduler or some trigger to run the automation conditionally for smart maintenance is common in the frameworks which most often integrate.

These are not competing technologies but complementary ones — the language instructs the logic, the framework interacts with the apps, and the tools allow for integration with the environment.

Challenges with Native Automation

Here are a few key challenges with native automation:

Non-Native Elements – Apps built using non-native SDK are difficult to automate, in that case the pattern-based object detection must be used like OCR, AI etc.
Platform Compatibility – Apps built for specific platform are generally not compatible on other platforms, so are the automation scripts can be used across multiple platforms. If apps are built using platform neutral technologies like Java Swing, special purpose tools are needed to automate such tools.
Synchronization Issues – The "element not found" errors may appear if the elements are not detected even a fraction of a second earlier than the script so proper waiting or timing control should be used.
Maintenance Overhead – In the situation when a app layout or functionality has changed then there is a necessity for the test scripts to be updated first before the tests can run.
Authentication and Security Barriers – A few examples of the problematic issues that can arise automation due to the introduction of new security features like MFA (Multi Factor Authentication) etc.
Scalability and Performance – The large-scale automation process (e.g., running parallel tests) can require a lot of resources and a well-thought-out infrastructure.
Handling Non-Standard Elements – Just like regular UI components, complex ones can also be hard to automate. These components are canvas, pop-up, drag-and-drop, etc.

These challenges make app automation challenging sometimes but thoughtful design, robust frameworks, and continuous maintenance makes it powerful.

Coding sample Native Automation using AutoIT

Following is the windows app automation code written in AutoIT Basic like script. The app automated is a brokerage app that provides the stock quote info including price and volume info. The script polls for the price and volume info, to decide if all the conditions to purchase stock is met. Practically the decision making can be handed off to AI engine. In that case the automation script will provide the automation services to the AI agent so that the tools share the overall responsibilities and they do the best they can do. This is written just for educational purposes. This script cannot be used for real usage, because the real-world usage of the stocks application need more sophisticated logic and a lot of reinforcement for proper handling of financial info.

#include <AutoItConstants.au3>
#include <MsgBoxConstants.au3>
#include <WindowsConstants.au3>
#include <GUIConstantsEx.au3>

; Configuration Variables
Global $SYMBOL = "AAPL"              ; Stock symbol to monitor
Global $TARGET_PRICE = 150.00        ; Target price to trigger buy
Global $TARGET_VOLUME = 100000       ; Minimum volume required
Global $SHARES_TO_BUY = 100          ; Number of shares to buy
Global $CHECK_INTERVAL = 30          ; Seconds between each check
Global $FIDELITY_WINDOW = "Fidelity Active Trader Pro"  ; Main window title

; Function to check if Fidelity application is running
Func IsFidelityRunning()
    If WinExists($FIDELITY_WINDOW) Then
        Return True
    Else
        MsgBox($MB_ICONERROR, "Error", "Fidelity Active Trader Pro is not running!")
        Return False
    EndIf
EndFunc

; Function to activate Fidelity window
Func ActivateFidelity()
    If Not WinActivate($FIDELITY_WINDOW) Then
        Return False
    EndIf
    Sleep(1000)  ; Wait for window to activate
    Return True
EndFunc

; Function to enter symbol
Func EnterSymbol()
    ; Activate symbol input field (you'll need to adjust coordinates)
    MouseClick("left", 100, 100)
    Sleep(500)
    Send($SYMBOL)
    Sleep(500)
    Send("{ENTER}")
    Sleep(1000)
EndFunc

; Function to get current price
Func GetCurrentPrice()
    ; You'll need to adjust coordinates based on where price appears
    Local $price = PixelGetColor(200, 200)  ; Replace with actual coordinates
    ; Add OCR logic here to read price
    Return $price
EndFunc

; Function to get current volume
Func GetCurrentVolume()
    ; You'll need to adjust coordinates based on where volume appears
    Local $volume = PixelGetColor(300, 200)  ; Replace with actual coordinates
    ; Add OCR logic here to read volume
    Return $volume
EndFunc

; Function to place buy order
Func PlaceBuyOrder()
    ; Click Trade button
    MouseClick("left", 400, 100)  ; Adjust coordinates
    Sleep(1000)

    ; Enter number of shares
    Send($SHARES_TO_BUY)
    Sleep(500)

    ; Click Buy button
    MouseClick("left", 500, 150)  ; Adjust coordinates
    Sleep(1000)

    ; Confirm order
    MouseClick("left", 550, 200)  ; Adjust coordinates
    Sleep(1000)
EndFunc

; Main monitoring loop
While 1
    If Not IsFidelityRunning() Then
        Exit
    EndIf

    If Not ActivateFidelity() Then
        ContinueLoop
    EndIf

    EnterSymbol()

    Local $currentPrice = GetCurrentPrice()
    Local $currentVolume = GetCurrentVolume()

    ConsoleWrite("Current Price: " & $currentPrice & " Volume: " & $currentVolume & @CRLF)

    ; Check if conditions are met
    If $currentPrice <= $TARGET_PRICE And $currentVolume >= $TARGET_VOLUME Then
        PlaceBuyOrder()
        MsgBox($MB_ICONINFORMATION, "Order Placed", "Buy order placed for " & $SHARES_TO_BUY & " shares of " & $SYMBOL)
        Exit
    EndIf

    Sleep($CHECK_INTERVAL * 1000)
WEnd

Coding sample Native Automation using PyWinAuto

Following is the windows app automation code written in Python using PyWinAuto library. The app automated is a brokerage app that provides the stock quote info including price and volume info. The script polls for the price and volume info, to decide if all the conditions to purchase stock is met. Practically the decision making can be handed off to AI engine. In that case the automation script will provide the automation services to the AI agent so that the tools share the overall responsibilities and they do the best they can do. This is written just for educational purposes. This script cannot be used for real usage, because the real-world usage of the stocks application need more sophisticated logic and a lot of reinforcement for proper handling of financial info.

import os
import time
import logging
from datetime import datetime
from dotenv import load_dotenv
from pywinauto.application import Application
from pywinauto.keyboard import send_keys
import cv2
import numpy as np
import pytesseract
from PIL import ImageGrab

# Load environment variables
load_dotenv()

# Configure logging
logging.basicConfig(
    filename='fidelity_trader.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

class FidelityTrader:
    def __init__(self):
        # Trading Configuration
        self.symbol = os.getenv('STOCK_SYMBOL', 'AAPL')
        self.target_price = float(os.getenv('TARGET_PRICE', '150.0'))
        self.target_volume = int(os.getenv('TARGET_VOLUME', '100000'))
        self.shares_to_buy = int(os.getenv('SHARES_TO_BUY', '100'))
        self.check_interval = int(os.getenv('CHECK_INTERVAL', '30'))

        # Application Configuration
        self.app_path = os.getenv('FIDELITY_APP_PATH', '')
        self.window_title = "Fidelity Active Trader Pro"
        self.app = None
        self.main_window = None

    def connect_to_application(self):
        """Connect to Fidelity application"""
        try:
            # Try to connect to running instance
            self.app = Application(backend="uia").connect(title=self.window_title)
            logging.info("Connected to existing Fidelity application")
        except Exception as e:
            try:
                # Launch new instance if not running
                self.app = Application(backend="uia").start(self.app_path)
                logging.info("Launched new Fidelity application")
            except Exception as launch_error:
                logging.error(f"Failed to launch Fidelity: {launch_error}")
                raise

        self.main_window = self.app.window(title=self.window_title)
        self.main_window.set_focus()

    def capture_region(self, region):
        """Capture a specific region of the screen"""
        screenshot = ImageGrab.grab(bbox=region)
        return np.array(screenshot)

    def get_text_from_image(self, image):
        """Extract text from image using OCR"""
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        text = pytesseract.image_to_string(gray, config='--psm 6')
        return text.strip()

    def get_price(self):
        """Get current stock price from the application"""
        try:
            # Adjust coordinates based on where price appears in the application
            price_region = (100, 100, 200, 130)  # Example coordinates
            price_image = self.capture_region(price_region)
            price_text = self.get_text_from_image(price_image)
            return float(price_text.replace('$', '').strip())
        except Exception as e:
            logging.error(f"Error getting price: {e}")
            return None

    def get_volume(self):
        """Get current trading volume from the application"""
        try:
            # Adjust coordinates based on where volume appears in the application
            volume_region = (300, 100, 400, 130)  # Example coordinates
            volume_image = self.capture_region(volume_region)
            volume_text = self.get_text_from_image(volume_image)
            return int(volume_text.replace(',', '').strip())
        except Exception as e:
            logging.error(f"Error getting volume: {e}")
            return None

    def enter_symbol(self):
        """Enter stock symbol in the application"""
        try:
            # Find and click symbol input field
            symbol_input = self.main_window.child_window(auto_id="symbolInput")
            symbol_input.click_input()
            send_keys(self.symbol)
            send_keys('{ENTER}')
            logging.info(f"Entered symbol: {self.symbol}")
        except Exception as e:
            logging.error(f"Error entering symbol: {e}")
            raise

    def place_buy_order(self):
        """Place a buy order"""
        try:
            # Click trade button
            trade_button = self.main_window.child_window(title="Trade")
            trade_button.click_input()

            # Enter number of shares
            shares_input = self.main_window.child_window(auto_id="sharesInput")
            shares_input.type_keys(str(self.shares_to_buy))

            # Click buy button
            buy_button = self.main_window.child_window(title="Buy")
            buy_button.click_input()

            # Confirm order
            confirm_button = self.main_window.child_window(title="Confirm")
            confirm_button.click_input()

            logging.info(f"Buy order placed for {self.shares_to_buy} shares of {self.symbol}")
            return True
        except Exception as e:
            logging.error(f"Error placing buy order: {e}")
            return False

    def monitor_stock(self):
        """Main monitoring loop"""
        logging.info(f"Starting monitoring for {self.symbol}")
        logging.info(f"Target price: ${self.target_price}")
        logging.info(f"Target volume: {self.target_volume}")

        while True:
            try:
                self.enter_symbol()
                current_price = self.get_price()
                current_volume = self.get_volume()

                if current_price and current_volume:
                    logging.info(f"Current price: ${current_price}, Volume: {current_volume}")

                    if current_price <= self.target_price and current_volume >= self.target_volume:
                        if self.place_buy_order():
                            logging.info("Order executed successfully")
                            break
                        else:
                            logging.error("Failed to execute order")

                time.sleep(self.check_interval)

            except Exception as e:
                logging.error(f"Error in monitoring loop: {e}")
                time.sleep(self.check_interval)

if __name__ == "__main__":
    try:
        trader = FidelityTrader()
        trader.connect_to_application()
        trader.monitor_stock()
    except Exception as e:
        logging.critical(f"Critical error: {e}")

Testing Native Automation

Testing native automation code is vital if you want to make sure that it functions consistently and is capable of dealing with the situations that might appear in the world.

Part of web automation testing can consist of the use of assertions for verification that expected elements have appeared, data is accurate, and navigational steps have been completed successfully.

Moreover, a wait conditions can be used for handling dynamic data loads instead of using fixed delays, which is also a very effective method.

Try your automated actions on various browsers and devices to check whether the performance is the same.

Integrating Native Automation with AI Agent

A script for native automation is excellent for handling repetitive and predictable tasks, such as clicking on buttons, filling out forms, moving through pages, or extracting data.

Conversely, an AI agent is capable of logical thinking, planning, learning, and making judgments based on available data.

When you combine these two, you get a smart system where the AI is in charge of the operations and the automation carries out the tasks.

Typical AI Agent System Components

AI Decision Engine - Processes goals, rules, or user commands and decides the sequence of actions. Can use ML models, NLP, or rule-based systems
Native Automation Layer - Executes low-level actions on apps via tools like AutoIT, PyWinAuto or WinDriver. Handles clicks, inputs, scrolling, navigation.
Perception / Data Extraction Module - Observes the web environment and extracts relevant information (prices, flight options, stock data). Feeds this back to the AI agent.
Feedback / Learning Module - Evaluates outcomes of automated actions (e.g., did the AI decision the purchase of stocks based on the conditions?) and updates decision-making models for future improvements.
Scheduler / Controller - Coordinates the flow: triggers web automation when needed, handles retries, logs progress, and ensures proper sequencing of tasks.

AI Agent Building Block: Web Automation

Srav Nayani — Wed, 08 Oct 2025 23:52:46 +0000

Code for this article is available at https://github.com/shravyanayani/automation

What is Artificial Intelligence

The key components of AI are data, algorithms, and models.

Data provides the examples or information that AI learns from. Algorithms are the step-by-step methods that process this data to find patterns or make decisions.
The model is the outcome of the trained algorithms. It uses what it has learned from the data to predict or act on new inputs.
These components work together so AI systems can keep improving their performance and make smart decisions in real-world situations.

What is AI Agent

An AI agent is a system that can sense its environment, make decisions, and act to reach specific goals, often without ongoing human help.

The key components of an AI agent include the perception module, decision-making module, and action module.

The perception module collects information from the environment using sensors or data inputs and makes sense of it.
The decision-making module uses algorithms or models to select the best action based on goals, rules, or past experiences.
Finally, the action module executes those decisions, and the agent continuously repeats this cycle to learn and improve over time.

What is Web Automation

Web automation uses software or scripts to automatically perform tasks on websites. This includes filling out forms, clicking buttons, scraping data, or testing web applications without needing manual effort.

The key components of web automation are the web driver or automation tool, like Selenium, the scripts or test code, and the browser interface. The scripts instruct the automation tool on which actions to take. The automation tool then controls the browser to carry out those actions, while the browser interface shows and responds like a human user would.

These components enable developers to test, monitor, or interact with websites effectively and reliably.

Web Automation vs API

Web Automation depends on UI based user functionality, so setting up the API interface is not necessary. However, there are some drawbacks to Web Automation, such as occasional unreliability, slowness, and websites disabling the automation etc.

Web Automation vs Native App Automation

Web automation and native app automation involve using software to automatically test or interact with applications, but they target different platforms and use different tools.

Web automation relies on the Web Browser's DOM (Document Object Model) while native app automation relies on UI elements defined by the operating system.

In short, web automation tests websites, while native app automation tests standalone apps. Both ensure that software functions correctly in their respective environments.

How Web Automation can be a building block for AI Agents

Web automation can be one of the most powerful features of AI systems, as it basically allows them to perform actions on the web and interact with it much in the same way as a human user would do.

By the means of web automation, an AI agent can simply navigate through the sites, collect data, fill in the forms, or start a certain process — thus giving it the ability to access the up-to-date information and perform the online tasks without any human intervention.

As an illustration, an AI agent may employ natural language understanding to get an idea of a request (“book a flight to Austin”) and then, through web automation, it goes to the respective travel websites, compares the prices, and makes the booking.

Ultimately, the combination of AI-driven decision-making and web automation execution gives agents the power to move seamlessly between thought and deed, thus implementing the smart insights to the world.

Technology Choices for Web Automation

Technology options for the purpose of web automation are primarily dependent on the to-be-executed tasks, the platforms to be targeted, and the degree of the intelligence or scalability desired.

Programming languages, support tools, and automation frameworks are the main classes of technologies that feature.

Automation Frameworks – Tools of this kind such as Selenium, Playwright, Cypress, and Puppeteer are the most cited ones.

Selenium is compatible with different browsers and several languages (Java, Python, C#) besides.
Playwright and Puppeteer are a bit quicker , more recent alternatives, where parallel testing and headless browsing are automatically supported.
Cypress is the best choice for front-end developers using modern JavaScript frameworks.

Programming Languages – The usual picks are Python, Java, JavaScript, and C# and the choice depends on the proficiency of the development team and integration requirements.

Supporting Tools – The use of some scheduler or some trigger to run the automation conditionally for smart maintenance is common in the frameworks which most often integrate.

These are not competing technologies but complementary ones — the language instructs the logic, the framework interacts with the browser, and the tools allow for integration with the environment.

Challenges with Web Automation

Here are a few key challenges with web automation:

Dynamic Web Elements – Recently, JavaScript or AJAX are often used to update websites content, which automatically changes element IDs or structures and breaks automation scripts.
Browser Compatibility – Every browser has a different way of rendering the same page that is almost negligible but still a bit different, thus scripts need to be tested in different browsers to be sure that they work consistently there.
Synchronization Issues – The "element not found" errors may appear if the elements are not loaded even a fraction of a second earlier than the script so proper waiting or timing control should be used.
Maintenance Overhead – In the situation when a website layout or functionality has changed then there is a necessity for the test scripts to be updated first before the tests can run.
Authentication and Security Barriers – A few examples of the problematic issues that can arise automation due to the introduction of new security features such as captchas, multi-factor authentication, rate limits etc.
Scalability and Performance – The large-scale automation process (e.g., running parallel tests) can require a lot of resources and a well-thought-out infrastructure.
Handling Non-Standard Elements – Just like regular UI components, complex ones can also be hard to automate. These components are canvas, pop-up, drag-and-drop, etc.

These challenges make web automation challenging sometimes but thoughtful design, robust frameworks, and continuous maintenance makes it powerful.

Coding sample Web Automation

Following is the Selenium web automation code written in Python language to search for flights on a travel site. This is written just for educational purposes. This travel website cannot be used for real usage, because the websites are generally protected by no-bot usage policy, that will immediately come into picture with a captcha or a puzzle asking to prove that a human is using the site. Web Automation cannot pass this hurdle, so the automation script cannot be used without human supervision.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from datetime import datetime
import time

# Configuration
ORIGIN = "New York"  # Or airport code like "JFK"
DESTINATION = "Los Angeles"  # Or "LAX"
DEPARTURE_DATE = "15/10/2025"  # Format: DD/MM/YYYY (adjust based on site)

def setup_driver():
    """Set up Chrome driver in headless mode."""
    options = Options()
    options.add_argument("--headless")  # Run without UI
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=options)
    return driver

def search_flights(driver):
    """Perform flight search."""
    wait = WebDriverWait(driver, 10)

    # Step 1: Navigate to Skyscanner
    driver.get("https://www.skyscanner.com/")
    time.sleep(2)  # Allow page load

    # Step 2: Select one-way trip (click if needed; Skyscanner defaults to round-trip, so toggle)
    try:
        one_way_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[data-testid="trip-type-selector-one-way"]')))
        one_way_button.click()
        time.sleep(1)
    except:
        print("One-way button not found; assuming default.")

    # Step 3: Enter origin
    origin_input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="origin-input"] input')))
    origin_input.clear()
    origin_input.send_keys(ORIGIN)
    time.sleep(1)

    # Click suggestion if appears (e.g., JFK)
    try:
        origin_suggestion = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[data-testid="suggestion-card"]')))
        origin_suggestion.click()
        time.sleep(1)
    except:
        print("No origin suggestion; proceeding.")

    # Step 4: Enter destination
    dest_input = driver.find_element(By.CSS_SELECTOR, '[data-testid="destination-input"] input')
    dest_input.clear()
    dest_input.send_keys(DESTINATION)
    time.sleep(1)

    # Click suggestion (e.g., LAX)
    try:
        dest_suggestion = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[data-testid="suggestion-card"]')))
        dest_suggestion.click()
        time.sleep(1)
    except:
        print("No destination suggestion; proceeding.")

    # Step 5: Enter departure date
    date_input = driver.find_element(By.CSS_SELECTOR, '[data-testid="date-picker"] input')
    date_input.clear()
    date_input.send_keys(DEPARTURE_DATE)
    time.sleep(1)

    # Select the date from calendar if pops up
    try:
        date_picker = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="date-picker-month"]')))
        # Find and click the specific date (adapt XPath for day)
        specific_date = driver.find_element(By.XPATH, f"//td[@data-testid='day-15']")  # Adjust for month/year
        specific_date.click()
        time.sleep(1)
    except:
        print("Date input direct; calendar may not have triggered.")

    # Step 6: Click search button
    search_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[data-testid="search-button"]')))
    search_button.click()
    time.sleep(5)  # Allow results to load

def extract_lowest_price(driver):
    """Extract flight prices and find the lowest."""
    wait = WebDriverWait(driver, 10)
    flights = []

    try:
        # Wait for results to load
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="flight-result"]')))

        # Extract all flight cards
        flight_cards = driver.find_elements(By.CSS_SELECTOR, '[data-testid="flight-result"]')

        for card in flight_cards[:10]:  # Limit to top 10 for brevity
            try:
                price_elem = card.find_element(By.CSS_SELECTOR, '[data-testid="price"]')
                price_text = price_elem.text.strip().replace('$', '').replace(',', '')
                if price_text.isdigit():
                    price = int(price_text)
                    airline = card.find_element(By.CSS_SELECTOR, '[data-testid="airline"]').text
                    flights.append({'airline': airline, 'price': price})
            except:
                continue

        if flights:
            lowest = min(flights, key=lambda x: x['price'])
            print(f"Lowest priced flight: {lowest['airline']} for ${lowest['price']}")
            return lowest
        else:
            print("No prices extracted.")
            return None
    except Exception as e:
        print(f"Error extracting prices: {e}")
        return None

# Main execution
if __name__ == "__main__":
    driver = setup_driver()
    try:
        search_flights(driver)
        lowest_flight = extract_lowest_price(driver)
        if lowest_flight:
            print(f"Found lowest flight: {lowest_flight}")
        else:
            print("No flights found or error in extraction.")
    finally:
        driver.quit()

Testing Web Automation

Testing web automation code is vital if you want to make sure that it functions consistently and is capable of dealing with the situations that might appear in the world.

Part of web automation testing can consist of the use of assertions for verification that expected elements have appeared, data is accurate, and navigational steps have been completed successfully.

Moreover, a wait condition (such as WebDriverWait) can be used for handling dynamic page loads instead of using fixed delays, which is also a very effective method.

Try your automated actions on various browsers and devices to check whether the performance is the same.

Integrating Web Automation with AI Agent

A script for web automation is excellent for handling repetitive and predictable tasks, such as clicking on buttons, filling out forms, moving through pages, or extracting data.

Conversely, an AI agent is capable of logical thinking, planning, learning, and making judgments based on available data.

When you combine these two, you get a smart system where the AI is in charge of the operations and the automation carries out the tasks.

Typical AI Agent System Components

AI Decision Engine - Processes goals, rules, or user commands and decides the sequence of actions. Can use ML models, NLP, or rule-based systems
Web Automation Layer - Executes low-level actions on web pages via tools like Selenium, Playwright, or Puppeteer. Handles clicks, inputs, scrolling, navigation.
Perception / Data Extraction Module - Observes the web environment and extracts relevant information (prices, flight options, stock data). Feeds this back to the AI agent.
Feedback / Learning Module - Evaluates outcomes of automated actions (e.g., did the AI pick the lowest flight price?) and updates decision-making models for future improvements.
Scheduler / Controller - Coordinates the flow: triggers web automation when needed, handles retries, logs progress, and ensures proper sequencing of tasks.

Implementing Software Design Principles in a no-code tool, such as MIT App Inventor

Srav Nayani — Sun, 28 Sep 2025 20:00:52 +0000

UI Design in MIT App Inventor

Blocks to code in MIT App Inventor

Parameterizing the UI widgets to not repeat the code

Below are examples of Software Design Principles commonly applied when designing code in object-oriented languages like Java.

Single Responsibility Principle (SRP): A class should have only one reason to change, meaning it should have a single, well-defined purpose.
Open-Closed Principle (OCP): Software entities (like classes and modules) should be open for extension but closed for modification.
Liskov Substitution Principle (LSP): Objects of a superclass should be replaceable with objects of a subclass without affecting the program's correctness.
Interface Segregation Principle (ISP): Clients should not be forced to depend on interfaces they do not use; it's better to have many specific interfaces than one general-purpose interface.
Dependency Inversion Principle (DIP): High-level modules should not depend on low-level modules; both should depend on abstractions, and abstractions should not depend on details, but details should depend on abstractions.
DRY (Don't Repeat Yourself): Avoid duplicating code to reduce complexity and improve maintainability.
Keep It Simple: Design systems to be as simple as possible, avoiding unnecessary complexity.
YAGNI (You Aren't Gonna Need It): Do not implement functionality until it is actually required, preventing the accumulation of unnecessary code.
Abstraction: Hide complex details and provide a clear, understandable interface to users.
Modularity: Break down a software system into smaller, independent, and manageable components.
Testability and Maintainability: Design code that is easy to test, understand, and maintain by fixing bugs or adding new features.
Reusability: Design software that can be used in other projects with minimal modification.

Above principles are straightforward to implement in object-oriented languages like Java. However, in a no-code environment such as MIT App Inventor, things can quickly become complex. While drag-and-drop, Lego-like code blocks make it easy to get started, the number of blocks can grow rapidly and become overwhelming as the application scales.

This is where good code design and organization practices prove invaluable. They help developers manage complexity and build maintainable, real-world applications. In the following sections, I will demonstrate how the above design principles can be applied effectively within MIT App Inventor and other similar no-code systems.

Single Responsibility Principle (SRP):

Unlike traditional code-based environments, in MIT App Inventor, the Screen serves as the primary unit of code—similar to a class in Java. Each Screen has its own properties, contains child UI components arranged in a tree-like hierarchy, and is controlled by Lego-style blocks that manage both UI events and backend services such as CloudDB.

Because there is no built-in mechanism for reusing UI elements across different Screens, it is best to design each Screen with a single responsibility. This approach keeps the widget hierarchy and associated blocks simple, organized, and easier to maintain.

When a Screen must serve multiple purposes with only minor variations, you can adopt a delegation model. By introducing a flag to indicate the current mode, the Screen’s behavior can be adjusted without duplicating the entire structure. However, avoid applying this technique to Screens with vastly different roles, as overuse can increase complexity and make the app harder to maintain.

Open-Closed Principle (OCP):

In object-oriented programming (OOP) languages like Java, extension is typically achieved through inheritance. In no-code environments, achieving the same effect is more challenging.

A recommended practice is to design your blocks and UI hierarchy so that new functionality can be added as an extension, rather than modifying existing features. Avoid altering existing behavior, as this can introduce errors and compromise stability. By extending instead of modifying, you ensure your app remains reliable while still allowing it to evolve over time.

Additionally, the Open-Closed Principle can be applied when creating custom components, enabling them to be extended without changing their original implementation.

Liskov Substitution Principle (LSP):

Since MIT App Inventor does not directly support inheritance, the concepts of subclass and superclass can only be implemented through custom components. In this approach, a subclass can be used wherever its superclass would normally appear. For example, a specialized Button component can be created and substituted for a default Button, adhering to this principle.

Interface Segregation Principle (ISP):

Rather than creating a single screen or block setup that handles many unrelated features, break your app into smaller, purpose-driven modules. Develop utility procedures (functions) that perform one well-defined task instead of large, catch-all procedures. Similarly, design each screen to serve one primary purpose rather than bundling multiple features together.

Dependency Inversion Principle (DIP):

Rather than tying your logic directly to a specific implementation, encapsulate it in a procedure that represents the abstract behavior. Use flags or configuration variables to switch between implementations when minor variations are needed. Additionally, create helper screens or components that act as reusable “providers” for shared functionality.

DRY (Don't Repeat Yourself):

If a sequence of blocks is used multiple times, extract it into a procedure. Use constants or configuration values stored in global variables instead of repeating them across screens. Avoid creating multiple screens that are nearly identical but differ only in minor ways; instead, repurpose a single screen and control variations with mode flags. For logic shared across multiple screens—such as file picking or error handling—consider creating a dedicated screen or custom component to centralize that functionality.

Keep It Simple:

Avoid creating too many screens; instead, reuse screens by using flags or parameters to modify behavior. Keep your blocks organized—don’t create spaghetti code, and avoid cramming multiple complex algorithms into a single block; separate them into smaller, manageable procedures. Similarly, don’t overload screens with too many buttons or labels; use clear labels and group controls logically for a clean and user-friendly interface.

YAGNI (You Aren't Gonna Need It):

Avoid adding screens prematurely based on assumptions or anticipated features. Keep the UI minimal and allow it to evolve according to actual needs and requirements. Similarly, don’t over-generalize procedures; while it may be tempting to create a “super procedure” that handles every possible case, breaking logic into focused, well-defined procedures is more maintainable and easier to debug.

Abstraction:

Use procedures to encapsulate logic instead of repeating long sequences of blocks. Abstract UI actions into events and triggers, and move all complex logic into dedicated procedures. Create reusable helper procedures to handle recurring or complex tasks. Additionally, keep data storage blocks separate from UI update blocks to maintain clarity and modularity.

Modularity:

Use multiple screens for different features, ensuring each screen represents a single main functionality or module. Define procedures as mini-modules that perform one specific task. Avoid mixing UI event handling (buttons, sliders) with data handling (TinyDB, lists). If a group of UI elements and blocks serves the same purpose in multiple places, create a custom component (using extensions or templates) for reuse. Finally, don’t overload a single screen with too many features; keep each screen focused and manageable.

Testability:

Make screens as testable as possible by keeping them independent. Make data storage mockable by using intermediary procedures, which can also be leveraged for testing. Additionally, add debug labels or temporary notifiers to display intermediate results and facilitate easier debugging.

Maintainability:

Use self-explanatory names for procedures, variables, and components. If you find yourself copy-pasting the same blocks multiple times, encapsulate them into a procedure. Add block comments to clarify any tricky logic. Since MIT App Inventor does not support direct Git integration, export your .aia files frequently and maintain versioned backups (v1, v2, v3). Consider using cloud storage to store these backups and snapshots of key features.

Reusability:

Encapsulate commonly used operations into procedures instead of repeating blocks. Parameterize procedures rather than hardcoding values, so they can be reused in different contexts. Organize shared data access by creating wrapper procedures for TinyDB, CloudDB, file operations, and similar tasks.

Since MIT App Inventor does not support “fragments” like Android, reuse screens for similar purposes with minor variations by using flags or parameters. Abstract repeated UI patterns: if multiple screens share a UI pattern (e.g., a “list of items + delete button”), copy the UI once and adjust the data source or title dynamically.

Take advantage of extensions and community modules to implement reusable functionality, and export blocks or use the backpack feature to transfer procedures and components to other projects.

Develop a native Android app : PDF Voice Reader

Srav Nayani — Sat, 27 Sep 2025 21:11:18 +0000

This project's full code is in GitHub @ https://github.com/shravyanayani/AndroidPdfVoiceReader

Why Choose Native Apps?

Native apps on Android mobile devices are designed to take full advantage of the underlying platform’s features and capabilities. In this case, the Text-to-Speech (TTS) functionality of the phone can be seamlessly integrated, allowing the app to read PDF content aloud without limitations or external dependencies.

Beyond feature access, Android native apps also excel in performance and responsiveness. Since they are optimized for the specific operating system, they can handle tasks like parsing text and generating speech more efficiently, resulting in a smoother, faster, and more reliable user experience.

Why Create PDF Voice Reader?

PDF is one of the most widely used document formats, valued for its portability across devices and operating systems. Books, research papers, articles, and even web pages or documents can easily be saved as PDFs, making them a universal standard for digital reading.

While PDFs are convenient for distribution, reading them visually isn’t always practical or desirable. In many situations—such as while driving, before sleep, exercising, or when reducing screen time—having the document read aloud can be far more convenient.

Although there are existing apps in app stores that provide PDF-to-speech functionality, many of them come with drawbacks. They often include intrusive advertisements and lack the customization options users truly need. For example, most do not allow skipping repetitive elements like headers, footers, or page numbers, which disrupt the listening experience.

By creating PDF Voice Reader, these limitations can be overcome. The app not only eliminates ads but also offers greater flexibility, allowing users to tailor the reading experience to their needs. This makes it a more personalized, efficient, and user-friendly solution for anyone who wants to consume PDF content through voice.

Key Features of PDF Voice Reader

1. Native Text-to-Speech (TTS) Integration

PDF Voice Reader leverages the built-in Text-to-Speech engine of the mobile operating system. This ensures seamless performance without external dependencies. The app converts the text extracted from a PDF into high-quality speech, using the same voice and settings already available on the Android device. Users can also customize the voice directly from their device’s system settings.

2. File Selection with Native File Picker

Users can easily select a PDF file from their device using the native Android file picker dialog. Once chosen, the selected document is displayed in the app, ready to be read aloud. This makes the process quick, intuitive, and consistent with the device’s user experience.

3. Playback Controls

The app includes simple but powerful controls for listening:

Play/Pause/Resume the reading at any time.

Adjust the reading speed through a dropdown menu with options for slower or faster playback.

4. Page Navigation

Reading doesn’t have to start at the beginning of a document. Users can:

Enter a specific page number to jump directly to that section.

Restart playback from the chosen page once the controls are activated.
This feature is especially useful for textbooks, research papers, or long-form PDFs.

Use Next Page and Previous Page buttons to skip directly to different sections.

5. Phrase Ignoring for Cleaner Listening

One of the most unique features of PDF Voice Reader is the ability to ignore repetitive phrases such as headers, footers, or page numbers.

Users can add these phrases to an “Ignore List” so they won’t be read aloud.

Each ignored phrase is displayed in a list with a delete icon, allowing users to manage or remove phrases at any time.
This customization significantly improves the listening experience, making the content flow more naturally.

6. Android Theme and Controls

The PDF Voice Reader app is built using Android native controls, ensuring a familiar look, feel, and behavior consistent with other Android apps. This not only enhances user-friendliness but also makes the interface more intuitive, as users can rely on the interactions they already know. Additionally, the app automatically adapts to the system’s chosen theme—whether light mode or dark mode—providing a seamless and visually consistent experience.

Why Android Studio for Building PDF Voice Reader ?

To create the PDF Voice Reader Android app from scratch, I chose Android Studio as the development environment. Android Studio is the official IDE (Integrated Development Environment) for Android app development, designed specifically for building, testing, and deploying apps on Android devices. Its tight integration with the native Android SDK makes it the most reliable and future-proof choice for native development.

1. Access to Native SDKs

Android Studio comes bundled with the latest and previous versions of the Android SDK, ensuring compatibility across a wide range of Android versions. This is critical for building apps that not only use the newest platform features but also remain accessible to users on slightly older devices.

2. Built-In Device Simulators

One of Android Studio's most powerful features is its built-in Android Simulator, which allows developers to test the app on multiple device models and Android versions without needing the physical hardware. This makes it possible to verify performance, behavior, and UI responsiveness across a wide variety of scenarios, saving significant development time.

3. Standardized Layouts and Controls

Android Studio also provides native UI components and layout tools that strictly follow Android Platform's Interface Guidelines. By leveraging these, the PDF Voice Reader app automatically inherits key Android features such as theming (light and dark modes), accessibility standards, and a familiar look-and-feel. This ensures the app feels natural to users while maintaining high compatibility with Android design principles.

4. Streamlined Development Workflow

From code editing and debugging to interface design and deployment, Android Studio offers a comprehensive workflow in one place. This integration reduces complexity and allows for faster, more efficient development compared to using third-party tools.

Why Use Kotlin for PDF Voice Reader?

For developing the PDF Voice Reader app, I chose Kotlin as the programming language. Kotlin is Google’s modern, powerful, and intuitive language designed specifically for building apps across the Android ecosystem, including phones, tablets, wear devices and so on.

1. Native Performance and Compatibility

Kotlin is fully integrated with the Android SDK and development tools, making it the best choice for achieving native performance. Apps written in Kotlin run efficiently, take advantage of the latest Android features, and integrate seamlessly with system services like Text-to-Speech.

2. Simplicity and Readability

Kotlin’s syntax is clean, concise, and expressive, making it easier to write and maintain code compared to older languages like Java. This simplicity helps speed up development while reducing the chances of errors, making the codebase more maintainable over time.

3. Safety and Reliability

One of Kotlin’s strengths is its focus on safety. Features like strong typing, optionals, and automatic memory management help catch errors early during compilation rather than at runtime. This leads to more reliable and stable apps—crucial for providing a smooth reading experience to users.

4. Modern Features for Faster Development

Kotlin offers powerful features such as closures, generics, and structured concurrency, which make coding more efficient and expressive. These modern tools enable developers to implement features like customizable playback or phrase filtering with less code and greater clarity.

5. Future-Proof and Actively Supported

Kotlin is actively maintained and improved by Google and the open-source community. Choosing Kotlin ensures the app will remain compatible with future versions of Android and benefit from ongoing performance improvements, security updates, and new language features.

Steps to Create the Project in Android Studio

Since this is a single-screen app, we can start with the Empty Activity app template in Android Studio. Follow these steps:

Open Android Studio
From File menu, select New , select New Project, select Phone and Tablet section, select Empty Activity, click Next

Configure Project with following Settings

Name: PDFReadAloud

Package Name: com.productivity

Language: Kotlin

Minimum SDK: API 33(can chose the SDK version of your choice)

Build Configuration Language: Kotlin

Click Finish to generate the project.

At this point, Android Studio will scaffold the project with the necessary files and structure, and you’ll be ready to start coding the app.

Significant Code Fragments

Code to open a PDF file selection dialog and display file name.

                    Button(
                        onClick = {
                            pdfPickerLauncher.launch(arrayOf("application/pdf"))
                        },
                        modifier = Modifier.fillMaxWidth()
                    ) {
                        Icon(
                            imageVector = Icons.Default.FileOpen,
                            contentDescription = "Select PDF File"
                        )
                        Spacer(modifier = Modifier.width(8.dp))
                        Text("Select PDF File")
                    }

                    selectedPdfName?.let {
                        Text(
                            text = "Selected file: $it",
                            style = MaterialTheme.typography.bodyMedium
                        )
                    }

    fun openPdfFile(uri: Uri): Boolean {
        try {
            closeCurrentDocument()

            val inputStream = context.contentResolver.openInputStream(uri)
            pdfDocument = PDDocument.load(inputStream)
            totalPages = pdfDocument?.numberOfPages ?: 0
            currentPageNumber = 1

            if (totalPages > 0) {
                _state.value = ReaderState.Loaded(currentPageNumber, totalPages)
                return true
            } else {
                _state.value = ReaderState.Error("No pages found in PDF")
                return false
            }
        } catch (e: Exception) {
            _state.value = ReaderState.Error("Failed to open PDF: ${e.message}")
            return false
        }
    }

Code to select a specific page number to begin reading from.

                    Text(
                        text = "Page Controls",
                        style = MaterialTheme.typography.titleMedium
                    )

                    Row(
                        verticalAlignment = Alignment.CenterVertically,
                        modifier = Modifier.fillMaxWidth()
                    ) {
                        OutlinedTextField(
                            value = pageNumber,
                            onValueChange = { value ->
                                if (value.isEmpty() || value.all { it.isDigit() }) {
                                    pageNumber = value
                                }
                            },
                            label = { Text("Page #") },
                            keyboardOptions = KeyboardOptions(
                                keyboardType = KeyboardType.Number,
                                imeAction = ImeAction.Done
                            ),
                            keyboardActions = KeyboardActions(
                                onDone = { keyboardController?.hide() }
                            ),
                            modifier = Modifier.weight(1f)
                        )

                        when (readerState) {
                            is PdfReaderService.ReaderState.Loaded,
                            is PdfReaderService.ReaderState.Paused -> {
                                Text(
                                    text = "of ${(readerState as? PdfReaderService.ReaderState.Loaded)?.totalPages
                                        ?: (readerState as? PdfReaderService.ReaderState.Paused)?.totalPages
                                        ?: (readerState as? PdfReaderService.ReaderState.Reading)?.totalPages
                                        ?: 0}",
                                    modifier = Modifier
                                        .padding(horizontal = 8.dp)
                                        .align(Alignment.CenterVertically)
                                )
                            }
                            else -> {
                                Text(
                                    text = "of ${(readerState as? PdfReaderService.ReaderState.Loaded)?.totalPages
                                        ?: (readerState as? PdfReaderService.ReaderState.Paused)?.totalPages
                                        ?: (readerState as? PdfReaderService.ReaderState.Reading)?.totalPages
                                        ?: 0}",
                                    modifier = Modifier
                                        .padding(horizontal = 8.dp)
                                        .align(Alignment.CenterVertically)
                                )

    fun nextPage() {
        if (currentPageNumber < totalPages) {
            readPage(currentPageNumber + 1)
        }
    }

    fun previousPage() {
        if (currentPageNumber > 1) {
            readPage(currentPageNumber - 1)
        }
    }

    fun readPage(pageNumber: Int = currentPageNumber) {
        if (!isInitialized || pdfDocument == null) {
            _state.value = ReaderState.Error("Reader not initialized or no PDF loaded")
            return
        }

        if (pageNumber < 1 || pageNumber > totalPages) {
            _state.value = ReaderState.Error("Invalid page number")
            return
        }

        try {
            currentPageNumber = pageNumber
            val stripper = PDFTextStripper()
            stripper.startPage = pageNumber
            stripper.endPage = pageNumber

            currentText = stripper.getText(pdfDocument).lowercase()

            // Apply excluded text filtering
            var textToRead = currentText
            excludedTexts.forEach { excludedText ->
                if (excludedText.isNotBlank()) {
                    textToRead = textToRead.replace(excludedText.lowercase(), "")
                }
            }

            if (textToRead.isBlank()) {
                _state.value = ReaderState.Reading(currentPageNumber, totalPages)
                if (continuousReading && currentPageNumber < totalPages) {
                    // If page is blank and continuous reading is enabled, skip to next page
                    readPage(currentPageNumber + 1)
                } else {
                }
                return
            }

            _state.value = ReaderState.Reading(currentPageNumber, totalPages)
            stopReading()
            textToSpeech?.speak(textToRead, TextToSpeech.QUEUE_FLUSH, null, "pdf_page_$pageNumber")
        } catch (e: Exception) {
            _state.value = ReaderState.Error("Failed to read page: ${e.message}")
        }
    }
    fun goToPage(pageNumber: Int) {
        val validPageNumber = min(max(1, pageNumber), totalPages)
        readPage(validPageNumber)
    }

    fun setContinuousReading(enabled: Boolean) {
        continuousReading = enabled
    }

    fun stopReading() {
        textToSpeech?.stop()
        if (_state.value is ReaderState.Reading) {
            _state.value = ReaderState.Paused(currentPageNumber, totalPages)
        }
    }

    fun closeCurrentDocument() {
        stopReading()
        pdfDocument?.close()
        pdfDocument = null
        currentText = ""
        totalPages = 0
        _state.value = ReaderState.Idle
    }

    fun shutdown() {
        stopReading()
        textToSpeech?.shutdown()
        textToSpeech = null
        closeCurrentDocument()
    }

Code to change the reading speed, either faster or slower.

                        OutlinedButton(
                            onClick = { isSpeedMenuExpanded = true }
                        ) {
                            Text("Speed: ${selectedSpeed}x")
                        }

                        DropdownMenu(
                            expanded = isSpeedMenuExpanded,
                            onDismissRequest = { isSpeedMenuExpanded = false },
                            modifier = Modifier.wrapContentSize()
                        ) {
                            speedOptions.forEach { speed ->
                                DropdownMenuItem(
                                    onClick = {
                                        selectedSpeed = speed
                                        pdfReaderService.setReadingSpeed(speed)
                                        isSpeedMenuExpanded = false
                                    },
                                    text = { Text("${speed}x") }
                                )
                            }
                        }

    fun setReadingSpeed(speedFactor: Float) {
        textToSpeech?.setSpeechRate(speedFactor)
    }

Code to pause, stop, or restart the reading.

                        // Previous Page Button
                        val isPreviousEnabled = when (readerState) {
                            is PdfReaderService.ReaderState.Loaded -> (readerState as PdfReaderService.ReaderState.Loaded).currentPage > 1
                            is PdfReaderService.ReaderState.Reading -> (readerState as PdfReaderService.ReaderState.Reading).currentPage > 1
                            is PdfReaderService.ReaderState.Paused -> (readerState as PdfReaderService.ReaderState.Paused).currentPage > 1
                            else -> false
                        }

                        Button(
                            onClick = { pdfReaderService.previousPage() },
                            enabled = isPreviousEnabled,
                            modifier = Modifier.weight(1f)
                        ) {
                            Text("Prev Page")
                        }

                        Spacer(modifier = Modifier.width(4.dp))

                        // Next Page Button
                        val isNextEnabled = when (readerState) {
                            is PdfReaderService.ReaderState.Loaded -> {
                                (readerState as PdfReaderService.ReaderState.Loaded).currentPage < (readerState as PdfReaderService.ReaderState.Loaded).totalPages
                            }
                            is PdfReaderService.ReaderState.Reading -> {
                                (readerState as PdfReaderService.ReaderState.Reading).currentPage < (readerState as PdfReaderService.ReaderState.Reading).totalPages
                            }
                            is PdfReaderService.ReaderState.Paused -> {
                                (readerState as PdfReaderService.ReaderState.Paused).currentPage < (readerState as PdfReaderService.ReaderState.Paused).totalPages
                            }
                            else -> false
                        }

                        Button(
                            onClick = { pdfReaderService.nextPage() },
                            enabled = isNextEnabled,
                            modifier = Modifier.weight(1f)
                        ) {
                            Text("Next Page")
                            //Spacer(modifier = Modifier.width(4.dp))
                        }

    fun pauseReading() {
        if (textToSpeech?.isSpeaking == true) {
            textToSpeech?.stop()
            _state.value = ReaderState.Paused(currentPageNumber, totalPages)
        }
    }

    fun resumeReading() {
        if (_state.value is ReaderState.Paused) {
            val textToRead = currentText
            textToSpeech?.speak(textToRead, TextToSpeech.QUEUE_FLUSH, null, "pdf_resume_$currentPageNumber")
            _state.value = ReaderState.Reading(currentPageNumber, totalPages)
        }
    }

Code to add phrases to an exclusion list and remove them individually when needed.

                    Text(
                        text = "Exclude Text",
                        style = MaterialTheme.typography.titleMedium
                    )

                    Row(
                        verticalAlignment = Alignment.CenterVertically,
                        modifier = Modifier.fillMaxWidth()
                    ) {
                        OutlinedTextField(
                            value = excludeText,
                            onValueChange = { excludeText = it },
                            label = { Text("Text to exclude") },
                            keyboardOptions = KeyboardOptions(imeAction = ImeAction.Done),
                            keyboardActions = KeyboardActions(
                                onDone = {
                                    if (excludeText.isNotBlank()) {
                                        coroutineScope.launch {
                                            preferenceRepository.addExcludedText(excludeText)
                                            excludeText = ""
                                        }
                                    }
                                    keyboardController?.hide()
                                }
                            ),
                            modifier = Modifier.weight(1f)
                        )

                        Spacer(modifier = Modifier.width(8.dp))

                        IconButton(
                            onClick = {
                                if (excludeText.isNotBlank()) {
                                    coroutineScope.launch {
                                        preferenceRepository.addExcludedText(excludeText)
                                        excludeText = ""
                                    }
                                }
                            }
                        ) {
                            Icon(
                                imageVector = Icons.Default.Add,
                                contentDescription = "Add excluded text"
                            )
                        }
                    }

                    Spacer(modifier = Modifier.height(8.dp))

                    if (excludedTexts.isNotEmpty()) {
                        Divider()
                        Spacer(modifier = Modifier.height(8.dp))

                        Text(
                            text = "Excluded Texts:",
                            style = MaterialTheme.typography.bodyMedium
                        )

                        FlowRow(
                            horizontalArrangement = Arrangement.spacedBy(8.dp),
                            verticalArrangement = Arrangement.spacedBy(8.dp),
                            modifier = Modifier.fillMaxWidth()
                        ) {
                            excludedTexts.forEach { text ->
                                ExcludedTextChip(
                                    text = text,
                                    onRemove = {
                                        coroutineScope.launch {
                                            preferenceRepository.removeExcludedText(text)
                                        }
                                    }
                                )
                            }
                        }
                    }
                }

    suspend fun addExcludedText(text: String) {
        context.dataStore.edit { preferences ->
            val currentList = preferences[EXCLUDED_TEXT_KEY]?.split(",") ?: emptyList()
            if (text.isNotBlank() && !currentList.contains(text)) {
                val newList = currentList.toMutableList().apply { add(text) }
                preferences[EXCLUDED_TEXT_KEY] = newList.joinToString(",")
            }
        }
    }

    suspend fun removeExcludedText(text: String) {
        context.dataStore.edit { preferences ->
            val currentList = preferences[EXCLUDED_TEXT_KEY]?.split(",") ?: emptyList()
            val newList = currentList.filter { it != text }
            preferences[EXCLUDED_TEXT_KEY] = newList.joinToString(",")
        }
    }

This project's full code is in GitHub @ https://github.com/shravyanayani/AndroidPdfVoiceReader

Develop a native iOS app : PDF Voice Reader

Srav Nayani — Sat, 27 Sep 2025 16:27:52 +0000

This project's full code is in GitHub @ https://github.com/shravyanayani/iosPdfVoiceReader

Why Choose Native Apps?

Native apps on iOS mobile devices are designed to take full advantage of the underlying platform’s features and capabilities. In this case, the Text-to-Speech (TTS) functionality of the phone can be seamlessly integrated, allowing the app to read PDF content aloud without limitations or external dependencies.

Beyond feature access, iOS native apps also excel in performance and responsiveness. Since they are optimized for the specific operating system, they can handle tasks like parsing text and generating speech more efficiently, resulting in a smoother, faster, and more reliable user experience.

Why Create PDF Voice Reader?

Key Features of PDF Voice Reader

1. Native Text-to-Speech (TTS) Integration

PDF Voice Reader leverages the built-in Text-to-Speech engine of the mobile operating system. This ensures seamless performance without external dependencies. The app converts the text extracted from a PDF into high-quality speech, using the same voice and settings already available on the iOS device. Users can also customize the voice directly from their device’s system settings.

2. File Selection with Native File Picker

Users can easily select a PDF file from their device using the native iOS file picker dialog. Once chosen, the selected document is displayed in the app, ready to be read aloud. This makes the process quick, intuitive, and consistent with the device’s user experience.

3. Playback Controls

The app includes simple but powerful controls for listening:

Play/Pause/Resume the reading at any time.

Adjust the reading speed through a dropdown menu with options for slower or faster playback.

4. Page Navigation

Reading doesn’t have to start at the beginning of a document. Users can:

Enter a specific page number to jump directly to that section.

Restart playback from the chosen page once the controls are activated.
This feature is especially useful for textbooks, research papers, or long-form PDFs.

Use Next Page and Previous Page buttons to skip directly to different sections.

5. Phrase Ignoring for Cleaner Listening

One of the most unique features of PDF Voice Reader is the ability to ignore repetitive phrases such as headers, footers, or page numbers.

Users can add these phrases to an “Ignore List” so they won’t be read aloud.

6. iOS Theme and Controls

The PDF Voice Reader app is built using iOS native controls, ensuring a familiar look, feel, and behavior consistent with other iOS apps. This not only enhances user-friendliness but also makes the interface more intuitive, as users can rely on the interactions they already know. Additionally, the app automatically adapts to the system’s chosen theme—whether light mode or dark mode—providing a seamless and visually consistent experience.

Why Xcode for Building PDF Voice Reader ?

To create the PDF Voice Reader iOS app from scratch, I chose Xcode as the development environment. Xcode is the official IDE (Integrated Development Environment) for iOS, designed specifically for building, testing, and deploying apps on Apple devices. Its tight integration with the native iOS SDK makes it the most reliable and future-proof choice for native development.

1. Access to Native SDKs

Xcode comes bundled with the latest and previous versions of the iOS SDK, ensuring compatibility across a wide range of iOS versions. This is critical for building apps that not only use the newest platform features but also remain accessible to users on slightly older devices.

2. Built-In Device Simulators

One of Xcode’s most powerful features is its built-in iOS Simulator, which allows developers to test the app on multiple device models and iOS versions without needing the physical hardware. This makes it possible to verify performance, behavior, and UI responsiveness across a wide variety of scenarios, saving significant development time.

3. Standardized Layouts and Controls

Xcode also provides native UI components and layout tools that strictly follow Apple’s Human Interface Guidelines. By leveraging these, the PDF Voice Reader app automatically inherits key iOS features such as theming (light and dark modes), accessibility standards, and a familiar look-and-feel. This ensures the app feels natural to users while maintaining high compatibility with iOS design principles.

4. Streamlined Development Workflow

From code editing and debugging to interface design and deployment, Xcode offers a comprehensive workflow in one place. This integration reduces complexity and allows for faster, more efficient development compared to using third-party tools.

Why Use Swift for PDF Voice Reader?

For developing the PDF Voice Reader app, I chose Swift as the programming language. Swift is Apple’s modern, powerful, and intuitive language designed specifically for building apps across the Apple ecosystem, including iOS, iPadOS, watchOS, and macOS.

1. Native Performance and Compatibility

Swift is fully integrated with the iOS SDK and Apple’s development tools, making it the best choice for achieving native performance. Apps written in Swift run efficiently, take advantage of the latest iOS features, and integrate seamlessly with system services like Text-to-Speech.

2. Simplicity and Readability

Swift’s syntax is clean, concise, and expressive, making it easier to write and maintain code compared to older languages like Objective-C. This simplicity helps speed up development while reducing the chances of errors, making the codebase more maintainable over time.

3. Safety and Reliability

One of Swift’s strengths is its focus on safety. Features like strong typing, optionals, and automatic memory management help catch errors early during compilation rather than at runtime. This leads to more reliable and stable apps—crucial for providing a smooth reading experience to users.

4. Modern Features for Faster Development

Swift offers powerful features such as closures, generics, and structured concurrency, which make coding more efficient and expressive. These modern tools enable developers to implement features like customizable playback or phrase filtering with less code and greater clarity.

5. Future-Proof and Actively Supported

Swift is actively maintained and improved by Apple and the open-source community. Choosing Swift ensures the app will remain compatible with future versions of iOS and benefit from ongoing performance improvements, security updates, and new language features.

Steps to Create the Project in Xcode

Since this is a single-screen app, we can start with the standard iOS app template in Xcode. Follow these steps:

Open Xcode

From the top menu, go to:
File → New → Project

Choose Template

In the dialog that appears, select the iOS tab.

Under Application, choose App.

Click Next.

Configure Project with following Settings

Product Name: PDFReadAloud

Organization Identifier: com.productivity

Interface: SwiftUI

Language: Swift

Check the box for Include Tests to add a testing target.

Click Next.

Select Project Location

Create or select a folder named PDFReadAloud.

Click Create to generate the project.

At this point, Xcode will scaffold the project with the necessary files and structure, and you’ll be ready to start coding the app.