Forem: Apify

Edge AI vs. Cloud AI

Apify — Wed, 22 Nov 2023 23:00:00 +0000

Hi, we're Apify , a cloud platform that helps you build reliable web scrapers fast and automate anything you can do manually in a web browser. This article on Edge AI vs. Cloud AI was inspired by our work on collecting data for AI and machine learning applications .

The rise of Edge AI

Five years ago, Gartner predicted that 75% of enterprise data would be created and processed outside the cloud by 2025. Whether that will prove entirely accurate remains to be seen. But what is clear is that Edge AI is rapidly growing in popularity.

The rise of edge computing accelerated in the 2010s with the explosion of IoT devices necessitating smarter, faster processing at the edge, in other words, closer to the data source. This gave rise to Edge AI, where AI algorithms are processed locally on a hardware device.

The growing interest in Edge AI has generated a myth that edge computing will replace cloud computing. But in reality, Edge and Cloud can work hand-in-hand by synchronizing a decentralized edge and a centralized cloud.

The purpose of this article isn't to tell you which of the two - Edge or Cloud - is better but to highlight the pros and cons of each so you can know which is suitable for your AI tasks.

What is Edge AI?

Sometimes referred to as AI at the edge, Edge AI is the implementation of artificial intelligence in an edge computing environment. In other words, Edge AI allows computation to be done close to where data is collected rather than at an offsite data center. That means it processes data locally to provide swift response times due to on-device processing.

Pros and cons of Edge AI

Edge AI advantages

Reduced latency and bandwidth: By processing data close to the edge, the need to transmit information over the network is reduced.
Swift response times: Fully on-device processing provides quick services, which eliminates wait times due to remote server responses.
Privacy and security: Edge AI offers better security for personal data than transmitting it across networks, where it can be vulnerable to cyberattacks.

Edge AI disadvantages

Limited computing power: Edge devices often have less computing power than cloud servers, limiting the complexity of AI models they can run.
Cost and scalability challenges: Scaling Edge AI solutions across numerous devices can be complex and expensive due to the amount of money required to acquire, maintain, and operate computing resources.
Maintenance and upgrades: Regular maintenance and updates of each edge device can be more challenging compared to centralized cloud updates.
Machine variations: There's more variation in machine types with edge devices, leading to more common failures.

What is Cloud AI?

Cloud AI refers to artificial intelligence systems where the data processing and AI model execution occur in cloud-based servers rather than on local devices.

The foundation for Cloud AI was laid with the advent of cloud computing in the early noughties. The introduction of cloud AI services, such as Google's Cloud AI, AWS's SageMaker, and Microsoft's Azure AI, some ten years later, was a significant milestone. These platforms provided tools for machine learning, data analytics, and cognitive services (like natural language processing and computer vision).

Because Cloud AI operates on data sent to remote servers, it's more scalable and flexible than Edge AI. That's the main thing that gives Cloud the edge (see what I did there?), but there are other advantages too:

Pros and cons of Cloud AI

Cloud AI advantages

Big data handling

AI algorithms thrive on voluminous data for training and accuracy. Cloud storage is integral here, providing the capacity to store and process terabytes of data. This capability is essential for developing machine learning models that learn from vast, varied datasets to enhance their predictive accuracy and reliability.

Parallel processing

Before cloud infrastructure, processing limitations were a significant bottleneck in AI development. Cloud computing introduced parallel processing nodes, which dramatically enhanced computing power. This means complex AI models can be computed much faster, accelerating the development and deployment of AI solutions.

GPU acceleration

Advanced AI computations, especially those in machine learning and deep learning, require significant processing power. GPUs, known for their parallel processing capabilities, are ideal for these tasks. Cloud AI utilizes GPU acceleration to handle intensive AI computations efficiently.

Scalability and flexibility

One of the most significant advantages of cloud storage in AI is scalability. Cloud-based AI systems can adapt to varying computational demands, scaling up or down as needed. This flexibility allows for efficient management of resources and costs, which is particularly vital for fluctuating AI workloads.

Cloud AI disadvantages

Latency issues: Depending on internet connectivity, there can be latency in data processing, which may not be suitable for real-time applications.
Security concerns: Transmitting data to and from cloud servers can pose security risks, especially if sensitive data is involved. That being said, cloud providers can provide strong security measures and compliance standards, so they can be a viable option if properly configured.
Dependence on internet connectivity: Cloud AI's effectiveness is contingent on reliable internet connectivity, which can be a limitation in remote or unstable network areas.

Key takeaways: when to use Edge and when to use Cloud

Edge computing minimizes latency by processing data locally but has limitations in terms of computational resources.
Cloud computing provides powerful processing capabilities but introduces latency due to data transmission.
The choice between Edge and Cloud depends on the latency tolerance of your application, the available network bandwidth, and the computational needs of your AI tasks.
Use Edge AI when real-time processing, data privacy, and reduced bandwidth usage are critical.
Use Cloud AI for complex computations, large-scale data analysis, and applications where latency is less of a concern.

Apify as a data cloud platform for AI

If data in the cloud is what you need, Apify is a cloud platform that helps you build reliable web scrapers for real-time data collection, and automate anything you can do manually in a web browser. This makes it an ideal platform for extracting web data at scale for AI and machine learning.

🧑🏻💻 Web scraping for AI data

Apify excels in extracting vast amounts of data from the web, which is crucial for training and fine-tuning AI models like ChatGPT and LLaMA. Its ability to crawl and extract relevant information from various sources makes it a go-to solution for feeding AI algorithms.

🧩 Easy integration with AI tools

Apify facilitates the integration of scraped data into AI platforms. It supports the loading of data into vector databases for querying to enhance the capabilities of AI chatbots and other applications.

📈 Customizable and scalable

Whether it's using pre-built scrapers or developing custom ones, Apify offers tailored solutions. This flexibility is vital for AI applications that require specific, up-to-date data from diverse web sources.

Practical applications

In customer service, Apify's web scraping abilities are already enhancing AI chatbots, enabling them to provide accurate and relevant responses based on real-time data.

🦾You might be interested in how you can add custom actions to your GPTs

How to download and upload files in Puppeteer

Apify — Tue, 21 Nov 2023 23:00:00 +0000

Hey, we're Apify , the only full-stack web scraping and automation library. Check out some of our easy to use web scraping code templates if you want to get started on building your own Puppeteer scrapers.

Puppeteer is a Node.js library that allows you to interact with headless and headful Chrome browsers. It enables you to perform lots of tasks, such as navigating web pages, taking screenshots, generating PDFs, and handling filesboth downloads and uploads. Puppeteer essentially allows you to automate tasks that would typically require manual intervention in a web browser.

Why handle file downloads and uploads?

File downloads and uploads are common activities when automating web interactions. Whether you're scraping data from websites, testing file upload functionality, or automating document retrieval, Puppeteer's ability to download and upload files makes it an invaluable tool.

Setting up Puppeteer

To set up Puppeteer, you need to have Node.js installed on your computer. If you haven't installed it yet, you can follow this guide for a step-by-step procedure on how to do it.

Once Node.js is installed, it will also install npm for you. You can verify this by opening your command prompt (CMD) or your terminal and using this command:

node -v && npm -v

The output will look like the image below:

Node and NPM version installed on a computer

After that, you can go to your desired location on your computer to create a folder (either desktop or documents) for your project. You'll initialize a new Node.js project in the folder you created using npm init, which will bring up some prompts. Follow through with the prompts, and that will set you up.

Installing Node.js

In the image above, a folder called puppeteer-download-upload is created. The cd puppeteer-download-upload command is used to change your directory into the folder, and the npm init command is used to initialize Node.js into the folder. The prompts came up and were filled in accordingly.

This is so you can run Node.js operations within the project.

The next step is for you to install Puppeteer.

In the same project folder, open your terminal, change your directory, and run this command:

npm install puppeteer

📒Note: Anytime you're working on a new Puppeteer project, you'll have to perform these operations:

Create a new project folder

Initialize Node into the folder

Install Puppeteer

With Puppeteer successfully installed, you're ready to start automating.

Performing download operations with Puppeteer

To perform a download operation with Puppeteer, you'll need a method to trigger the download action, specify the path at which you want the file to be downloaded, and finally take the download action. After deciding on the method, specify the download path and then trigger the download action by navigating to the page or link and clicking the download link.

There are various methods and approaches you can use to perform download operations with Puppeteer. Here are three of them:

Intercepting network requests using page.setRequestInterception(true). You can use this to detect a request for a file download based on the content. If you want to learn more about this method, you should read about request interception in the Puppeteer documentation.

await page.setRequestInterception(true);page.on('request', (interceptedRequest) => { // Check the URL or content type to detect a download request if (interceptedRequest.url().endsWith('.pdf')) || interceptedRequest.url().endsWith('.jpg') { // Handle the download here... } interceptedRequest.continue();});// If the file download doesn't start right away, click a button to trigger itawait page.click('#downloadButton')

Browser Contexts and setDownloadBehavior. This is a more direct way to handle downloads using Puppeteer. In this case, when the download is triggered, the file will automatically be downloaded to the specified directory or path. The previous version used the page._client private API, but it was deprecated. Instead, you should create your own CDP sessions for access to the Chrome dev protocol directly, like so:

const browser = await puppeteer.launch({ headless: false });const page = await browser.newPage()const client = await page.target().createCDPSession()await client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: '/path/to/save/downloads' });// If the file download doesn't start right away, click a button to trigger itawait page.click('#downloadButton')

Each of the methods listed above is suited to different download scenarios. Just choose the one that best fits the project or task you're working on.

Tips for handling errors

Handle timeout within your code to give more time for the download operation to be completed.
To ensure the file is downloaded successfully, you can use the [page.waitForResponse](<https://pptr.dev/api/puppeteer.page.waitforresponse>) and targetcreated conditions.
You can also check the specified directory manually to ensure the file exists and is of the expected size.

Uploading files with Puppeteer

To perform an upload operation with Puppeteer, you'll need a method that lets you perform the file selection option, specify the path at which you want the file to be selected from, and then finally take the upload action.

You can use the elementHandle.uploadFile(...path) method that allows you to upload a file by providing the path, or you can use the fileChooser method.

The FileChooser works when the file chooser has a dialog while elementHadle.uploadFile works directly with the file input element. The method you use depends on the scenario you're working with. If the webpage has a custom button or hides the original file input, FileChooser is advisable. If you're dealing with a standard file input, elementHandle.uploadFile is a better option.

//the code sample of FileChooserconst [fileChooser] = await Promise.all([page.waitForFileChooser(), page.click('#customUploadButton')]);await fileChooser.accept(['/path/file.jpg']);

//code sample for ElementHandle.uploadFileconst UploadElement = await page.$('input[type="file"]');await uploadElement.uploadFile('/path/file.jpg');

After deciding on the method, specify the path and then trigger the upload action (navigate to the page or link, click the upload button, and submit).

Best practices for secure file uploads

Check the file type required for upload and make sure you're uploading the correct file type.
Check and verify the file input and button selectors.
Monitor and handle network requests accordingly.

Examples of using upload and download

In this section, you'll be trying out the upload and download options available in Puppeteer.

1. Automating a file upload with Puppeteer

In the code below, you'll set up your Puppeteer project as explained earlier, create a new file called upload.js, import Puppeteer, then add a test pdf file in the root folder for the purpose of the example.

import puppeteer from "puppeteer";// function to handle timeout for every action to be completedfunction delay(time) { return new Promise(resolve => setTimeout(resolve, time));} const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('<https://easyupload.io/>'); await page.waitForSelector('input[type=file]'); const inputUploadHandle = await page.$('input[type=file]'); // path to the file you want to upload await inputUploadHandle.uploadFile('./testdoc.pdf'); await page.click('#upload'); // Introduce a timeout if necessary (in case of internet speed) await delay(20000); // Wait for a success message. await page.waitForSelector('.upload-success'); await browser.close();

2. Automating the download of the file you uploaded earlier

In the previous example, you uploaded a file to [easyupload.io](<http://easyupload.io>), after which you were given a download link. Copy the download link to the file, and replace it with the URL in page.goto('...'). Also, create a folder named downloads. This is where your file will be downloaded.

import puppeteer from 'puppeteer'import * as fs from 'fs';//function to handle timeout function delay(time) { return new Promise(resolve => setTimeout(resolve, time)); } const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); // Set download behavior const client = await page.target().createCDPSession() await client.send('Page.setDownloadBehavior', { behavior: 'allow', // the download path can be set to a folder in your project root downloadPath: './downloads' }); // Navigate to the download page. (change the download URL) await page.goto('<https://easyupload.io/x2na1r>'); // Download the file. await page.click('#hd1'); // Wait for the download to complete. Adjust this based on your network speed. await delay(10000);; //check the download folder to know if the downloaded file exists there if (fs.existsSync('./downloads/testdoc.pdf')) { console.log('file downloaded successfully!'); } else { console.log('Download failed.'); } await browser.close();

3. Another code sample for upload and download actions with Puppeteer

import puppeteer from 'puppeteer'//function to handle timeoutfunction delay(time) { return new Promise(resolve => setTimeout(resolve, time));} const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); const client = await page.target().createCDPSession(); await client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: './downloads' }); await page.goto('<https://imgur.com/upload>'); const uploadSelector = '#file-input'; await page.waitForSelector(uploadSelector); const inputUploadHandle = await page.$(uploadSelector); await inputUploadHandle.uploadFile('./cap.jpeg'); //wait for the upload to be completed await delay(10000); // initiate the download process and click the download button const downloadLinkSelector = '.upload-download'; await page.waitForSelector(downloadLinkSelector); await page.click(downloadLinkSelector); //wait for the file download to await delay(10000); await browser.close();

You can find some more examples of downloading files in Puppeteer and submitting a form with an attachment in the Apify docs.

GitHub gists

You can explore all the sample code in this tutorial in the GitHub gists below.

💻How to scrape the web with Puppeteer in 2023

Step-by-step guide to scraping Steam Store

Apify — Wed, 15 Nov 2023 23:00:00 +0000

Hey, we're Apify. You can run any scrapers on the Apify platform. Build your own or use one of 1,500+ pre-built scrapers for popular websites.

If you're looking to extract detailed information from Steam store pages, Steam Store Scraper on Apify is a tool designed to simplify your data collection process. This tutorial will walk you through the steps to configure and use Steam Store Scraper to gather data from Steam's frontpage, which you can then use for analysis or integration with other applications.

Getting started with Steam Store Scraper

Steam Store Scraper is a straightforward Apify Actor, crafted to scrape Steam's frontpage and individual store pages, extracting data like game titles, descriptions, prices, reviews, and more. The output is structured in a convenient dataset that can be exported in various formats, such as JSON, XML, CSV, and Excel.

Steam Store Scraper on Apify Store

Step 1: Accessing Steam Store Scraper

To begin, navigate to the Steam Store Scraper page on Apify Store. Here, you will find a brief description of the tool and a "Try for free" button which you can click to start using the scraper without the need for a credit card. The scraper is free to use and you will only pay for your usage of the Apify platform.

Step 2: Setting scraping input parameters

Once you're on the Steam Store Scraper page, switch to the Input tab. This is where you will define the parameters for the data you wish to scrape. The input settings include options such as:

Only on sale : Toggle this if you're only interested in games currently on sale.
Only released : Use this filter to exclude games that haven't been released yet.

Expand the Run options if you need to customize advanced settings like the proxy configuration.

Steam Store parameters

Step 3: Running the scraper

After setting your desired parameters, click the Save button to preserve your configuration. Then, hit the Run button to initiate the scraping process. The tool will now start crawling through the Steam store based on your specified criteria.

Step 4: Extracting the data

Once the scraper has completed its run, you can download the dataset. Navigate to the Runs tab to view the finished runs. Click on a specific run to see the results and choose the format for data export. For instance, if you need to integrate the data with a web application, you might opt for JSON.

Example of extracted data

This is the kind of data you can expect from Steam Store Scraper:

URL : Direct link to the game's Steam page.
Title : Name of the game.
Description : A brief summary or pitch of the game.
Price : Current price, along with sale information if applicable.
Release date : Official release date of the game.
Genres : Game genres such as Action, RPG, Strategy, etc.
Popular tags : Tags that users have associated with the game.
Reviews : Recent and overall review status, like "Very Positive" or "Mixed".
Languages : Supported languages for interface, sound, and subtitles.
Developers and publishers : Companies responsible for the game's development and distribution.

For instance, a sample entry for "Age of Empires IV: Anniversary Edition" would include its price, sale status, release date, genres, supported platforms, and more.

How to use the extracted Steam data

The data obtained from the Steam Store Scraper can be used for:

Market analysis : Understand trends, popular genres, and pricing strategies in the gaming market.
Competitive analysis : Compare game features, pricing, and reviews to gauge the competition.
Data integration : Incorporate the dataset into your applications, websites, or research projects.

Or you could just use it to find your next game!

Add custom actions to your GPTs with Apify Actors

Apify — Tue, 14 Nov 2023 23:00:00 +0000

Hi, were Apify , a full-stack web scraping and browser automation platform. A big part of what we do is getting better data for AI .

At the DevDays keynote on November 6, 2023, OpenAI released GPTs - a new way to build custom AI agents based on ChatGPT - which can use custom instructions and actions provided by third-party APIs to give the agent new capabilities for new use cases.

Apify now makes it easy to connect any of the 1,500+ Actors from Apify Store to your GPTs to give them web scraping and browser automation capabilities, such as fetching data from search engines, maps, social media, travel sites, or extracting data from any website.

What are GPTs?

OpenAI released GPTs at the November 2023 DevDays keynote as a new way to build your own custom AI agents based on the ChatGPT model. GPTs enhance ChatGPT with custom text instructions to prime the model to behave a certain way using the following retrieval-augmented generation (RAG) extensions:

Built-in browser, data analytics, DALL-E 3 - they now run together in the same chat session; you dont need to pick just one as before.
External knowledge imported as static files, such as TXT, JSON, PDF, etc.
External actions described using an OpenAPI specification and invoked using the models function calling capability.

GPTs from OpenAI

Evolution of ChatGPT Plugins

GPTs are an iteration of the ChatGPT plugins introduced in March 2023. The big change is that while ChatGPT plugins can only use actions provided by an API running on the same domain as the plugin itself, the GPTs can perform actions from any API, even a third-party one.

For example, the Apify ChatGPT plugin could only invoke actions running on api.apify.com, but not from api.example.com. And this restriction also meant that other ChatGPT plugins couldnt use api.apify.com. This changes now, as GPTs can be used with any API. This has greatly expanded the number of things that chat-based agents can do.

Examples of GPTs with actions

Its hard to keep track of what GPTs people are building (pro tip: the easiest way is to type site:chat.openai.com/ginto Google Search). Here are some interesting examples of GPTs that use custom actions:

YC Application GPT : Automatically fills YC applications for you based on website or Pitch Deck.
Grimoire : A coding GPT that lets you create a website with a sentence, or even from a paper sketch or screenshot.
Spotify Explorer GPT : Lets you drop a Spotify link to a song, playlist, user or artist and explore with insights.
Calendar GPT : Helps you prepare for your day. Powered by Zapier's AI Actions.
Code GPT GPT : Helps you understand the rules of the Code GPT repository at https://github.com/Decron/Code-GPT/

Using Apify Actors as GPT actions

Apify Actors can now automatically generate the OpenAPI specification for their API, which makes it extremely easy to call them from your GPTs. This makes it easy to add any of the 1,500+ Actors published in Apify Store in your GPTs or link your newly built Actors for your use case.

For example, heres how to create a new GPT agent that can scrape Google Search results and use it to answer user questions:

1. Make sure you have an Apify account

Its free. No credit card required, and no time limit on the free plan. You get $5 dollars worth of monthly credit to get you started.

2. Go to Apify Store and pick Google Search Scraper

3. On the Actor detail page, select APIOpenAPI specification

4. Copy the JSON with OpenAPI specification into the clipboard

5. Go to ChatGPT Explore page, click "Create a GPT"

6. Configure your GPT

Set its Name, Description, Instructions, and Conversation starters. Uncheck the Web Browsing capability to avoid interference of the native web browser with Google Search.

7. Click the Add actions button, and paste the OpenAPI specification of the Actor from step 4 into the Schema field.

Add a link to Apify's privacy policy https://apify.com/privacy-policy, so that you can later publish your GPT.

8. Open a new browser window, go to Settings Integrations in Apify Console, and copy your Apify API token

9. Edit your GPT's Authentication settings

In the settings of your GPT, click Edit in the Authentication section, select API key Authentication Type, Bearer Auth Type, and paste the Apify API key.

10. Save and publish

Save the Authentication settings, click Save on your GPT, and publish it to Only people with a link.

And voil! Your GPT is ready to use, and published at a link (like this one) that you can share with any ChatGPT user!

Benefits of Actors

Apify Actors are a way to package your code that makes it easy to share, integrate, and build upon. Actors are based on Docker images extended with an input and output schema to enable linking, automated generation of user interfaces, and sharing with other users. Now, they can also export the OpenAPI specification for their invocation and return of results.

Actors run on the Apify cloud platform, which provides compute, storage, and proxy services. It takes care of scaling, authentication, security, and billing. Apify Store features more than 1,500+ ready-made Actors built by the community for various use cases, which you can use right away in your GPTs.

Apify also provides extensive documentation and open-source SDKs, including the popular library Crawlee. It also offers Actor templates for JavaScript/TypeScript and Python and scraping libraries such as Puppeteer, Playwright, or Selenium. All of this means you can quickly build new Actors for your unique web scraping and automation projects and then incorporate them into your GPTs.

Example of Crawlee code

Limitations of Actors and GPTs

Not every Actor is suitable for the above use as a GPT action.

OpenAI enforces a timeout of 45 seconds for the action to return a result, and thus, the Actor needs to be able to return its output in that timeframe. Otherwise, users of the GPT will see timeout errors.

The exported OpenAPI specification from Apify Console is intended for running the Actor with a custom input and synchronously returning its output dataset, using the Run actor synchronously with input and get dataset items API endpoint. Therefore, the Actor needs to generate its output to the default dataset to provide results to GPT. For Actors that generate their results to the key-value store, youll need to manually update the OpenAPI specification to use the Run Actor synchronously API endpoint.

The exported OpenAPI specification covers all input parameters, but for most cases, you only need the model to set a few and keep the rest with default values. Therefore, it might make sense to edit the specification and remove irrelevant input parameters, to avoid eating into GPT model context window size.

To use multiple Actors in a single GPT, youll need to merge their respective OpenAPI specifications into one.

Conclusion

GPTs are a new, exciting release from OpenAI, and we're curious to see what GPTs people will build, with or without Apify Actors. If you have any feedback or ideas for improvements to the Actor integration for GPTs, we'd love to hear from you! Simply email us at ai@apify.com

What is an anonymous proxy? A detailed guide for 2024

Apify — Mon, 13 Nov 2023 23:00:00 +0000

Have you ever felt like someone's watching every click you make online?

In the digital realm, protecting your privacy and security isn't just importantit's essential. Think of anonymous proxies as your personal guardians, safeguarding your online persona by masking your actual IP address and providing you with the freedom to surf the web under the radar.

By the end of this comprehensive guide, you'll have a solid understanding of the intricacies of anonymous proxies, their types, benefits, risks, and how to set them up on various devices. So, get ready to become an anonymous proxy pro!

Understanding anonymous proxies

Serving as intermediary servers, anonymous proxies conceal the IP address while offering varying levels of anonymity. Their primary purpose is to provide online security and anonymity by altering network traffic and HTTP headers, safeguarding a clients identity. This makes an anonymous proxy a powerful tool for maintaining privacy and anonymity when browsing the internet.

Proxy anonymity comes in different tiers to match all kinds of user needs and preferences. A transparent proxy, for example, can be used in educational institutions and public Wi-Fi networks, managing website traffic without hiding the users IP address. On the other hand, highly anonymous proxies and elite proxies, due to their higher levels of anonymity, are more suited for users who prioritize online privacy.

Types of proxies

Based on their level of anonymity and functionality, proxies can be grouped into three main types: elite, anonymous, and transparent proxies.

Elite proxies provide the highest security, completely masking the users IP address and proxy use.
Anonymous proxies hide the users real IP address to a mid-level, offering increased privacy but allowing some headers to pass along to the server.
Transparent proxies don't provide anonymity to the user, allowing web activities to be traced. To keep the user entirely anonymous, the following headers might be removed:

Authorization
From
Proxy-Authorization
Proxy-Connection
Via
X-Forwarded-For

How do anonymous proxies work?

Anonymous proxies enhance online privacy and security by channeling user requests via a proxy server, which replaces the users original IP address with its own proxy. This creates a layer of anonymity, making it seem as though the user is accessing the internet through a different location. Such proxies are crucial for personal and business use, offering a shield against data breaches and unwanted surveillance by concealing the users digital footprint.

They strengthen privacy and security in personal and business applications by routing user requests through an anonymous proxy server, obscuring their IP address in the process. These anonymous proxy services ensure a higher level of protection for users, making an anonymous proxy tool essential for maintaining online privacy. With these services, proxy anonymity exists, safeguarding user information.

Benefits of using anonymous proxies

The use of anonymous proxies comes with many advantages, such as bypassing online censorship, accessing geo-limited content, and ensuring user privacy. By concealing your IP address, anonymous proxies guarantee anonymity while browsing and can help you bypass geo-restrictions to access content that is otherwise unavailable in your area.

🔍 Let's examine the benefits of anonymous proxies in both personal and business scenarios.

For personal use, anonymous proxies enable users to bypass geo-restrictions and preserve online privacy. Bypassing geo-restrictions allows users to access content that is otherwise blocked in their region, such as streaming services, multiple social media accounts, or news websites.

Other personal applications of anonymous proxies include:

Improving the browsing experience
Accessing restricted content
Preserving online identity
Conducting secure online transactions
Engaging in anonymous communication

💼 In the business world, anonymous proxies find applications in:

Web scraping: involves extracting data from websites, and anonymous proxies help prevent scraping bots from being blocked by the target website by masking the bots IP address.
Brand protection: anonymous proxies can be used to monitor online platforms and detect any unauthorized use of a brands intellectual property.
Reputation intelligence missions: anonymous proxies can be used to gather information about a companys online reputation and monitor online discussions and reviews.

📣 Anonymous proxies can be used for brand protection, ensuring that businesses can monitor online activities and safeguard their online presence effectively.

Risks and limitations of anonymous proxies

While anonymous proxies offer advantages, they possess potential risks and limitations like security vulnerabilities, reduced speeds, and reliability. Free anonymous proxies can be slow, overutilized, and may inject malware into ones device or browser, while some can be set up as traps by governments or hackers to identify and compromise user data.

Risk mitigation involves making an informed choice between free and premium proxies and selecting an appropriate proxy service.

💰 Free vs. premium proxies

Free proxies often come with risks and performance issues, unlike premium proxies, which are reputable for their heightened security and effectiveness. There's a chance that complimentary proxies might compromise personal information or introduce malware to a users system. On the other hand, premium proxies ensure a secure and stable online experience.

Check out these 10 free or low-cost proxy services

To ensure optimal security and performance, its crucial to choose a reputable proxy provider and consider factors such as the providers reputation, the types of proxies they offer, and the locations of their servers.

Read more about how to scrape free proxy lists

Choosing the right proxy service

Choosing a suitable anonymous proxy and service is essential to ensure optimal security and efficiency. When choosing an anonymous proxy service, consider the following factors:

Compatibility
Proxy types
Anonymity levels
IP pools and performance
Use case and budget

A secure and efficient browsing experience can be ensured by assessing your needs and the offerings of different proxy providers, including your internet service provider.

💓Best proxy providers

🔧 Setting up and configuring anonymous proxies

To set up and configure anonymous proxies on various devices and browsers, youll need to adjust the settings within the browser or device. This includes altering the proxy settings, activating the proxy, and entering the proxy server address.

We'll now focus on the configuration of anonymous proxies on Windows and Mac computers, various popular web browsers, and the web server settings for mobile devices.

Browser settings

To configure anonymous proxies on popular browsers like Chrome, Firefox, and Safari, youll need to adjust their respective proxy settings. In Chrome, access Settings from the toolbar, select Advanced followed by Settings and then System, open your computers proxy settings, and select Manual proxy setup. Enter the address of your proxy server and a proxy port number, then click Save.

Go to the Safari toolbar and select Preferences. Then go to the Advanced tab and select Proxies, then Change settings.

📱For mobile devices setting up anonymous proxies on mobile devices, including Android and iOS, might require a different approach. While there isnt a specific process for setting up anonymous proxies on mobile devices, it is generally recommended to use a VPN service for Android devices.

For iOS devices, you can configure proxy settings within the Wi-Fi settings and input the server address, same IP name, port number, and any necessary authentication information.

🔎 Comparing anonymous proxies and VPNs

Both anonymous proxies and VPNs offer a degree of anonymity and security. However, there are some key differences between the two. VPNs, such as ExpressVPN or NordVPN, offer more features and greater security, as they encrypt all of your internet traffic, while proxy servers only conceal your IP address for specific websites or applications.

Now let's delve into the distinctions in anonymity, security, speed, and performance between anonymous proxies and VPNs.

Anonymity and security

While both anonymous proxies and VPNs provide anonymity, VPNs offer better security through encryption. VPNs secure all network traffic from a single IP address, while anonymous proxies can make use of multiple IP addresses.

Additionally, VPNs encrypt traffic, providing a higher degree of security in comparison to anonymous proxies, which do not offer the same level of encryption.

Speed and performance

VPNs may offer better speed and performance compared to anonymous proxies, depending on the provider and server locations. Generally, VPNs are faster because they are built to manage and encrypt all internet traffic, providing a more effective and secure connection. Proxies, on the other hand, tend to only function with a single application or service, which can lead to slower speeds.

Factors that can impact the speed and performance of both anonymous proxies and VPNs include:

Server location
Number of users connected to the server
Type of encryption employed
Type of traffic being transmitted

Troubleshooting common proxy issues

Anonymous proxies oftentimes come with issues like:

a high number of website IP bans
proxy provider failures
improper configuration for a specific browser or operating system
proxy server errors like HTTP error status codes.

Addressing these issues may require handling Anonymous Proxy Detected errors and ensuring the optimal performance of the proxy.

Resolving "anonymous proxy detected" errors

If you encounter an Anonymous Proxy Detected error message, you might need to:

Close all applications before using proxies
Clear the cookies and cache of the browser
Update the proxy rotation
Attempt shorter intervals per session

These measures can help address the error message and ensure a smoother browsing experience.

🔑 Key takeaways

Anonymous proxies enhance security and anonymity by modifying network traffic and HTTP headers.
Different types of proxies offer different levels of privacy, making them beneficial for personal and business use cases.
When choosing a proxy service, consider compatibility, type, level of anonymity, and performance to ensure optimal results.

That's a wrap! Now you know how proxy servers work 🦾

To ensure optimal proxy performance, choose a reputable provider with diverse servers, reliable performance, and secure connections. Regularly update your proxy settings for a secure and high-performing connection. If proxy settings are not functioning as intended, verify that the proxy settings are properly configured and that the connection is secure.

Anonymous proxies are powerful tools for maintaining online privacy and anonymity. By understanding their types, benefits, risks, and how to set them up on various devices, you can harness their potential to safeguard your online identity and access restricted content. However, it is essential to weigh the pros and cons of using proxy servers versus VPNs and to choose a reputable provider for optimal security and performance.

Check out Apify Proxy!

Connecting web scrapers: a guide to Actor-to-Actor integrations

Apify — Sun, 12 Nov 2023 23:00:00 +0000

If you're an Apify user, you're no doubt familiar with the serverless cloud programs (or micro-apps) we call Actors.

You're also probably aware that you can integrate Actors and their tasks with your favorite web apps and cloud services. But now it's possible to integrate Actors with other Actors. That means you can reuse existing Actors instead of building your own to complete certain processes.

Say you have one Actor with a dataset containing URLs and another that takes URLs and downloads them as images. Integrating the two makes the whole process of retrieving and downloading images so much easier than it used to be.

Connecting two Actors can simplify your workflow

You may have noticed that some Actors have an Actor-specific integration available. For example, Google Maps Scraper has the AI Text Analyzer for Google Reviews as a recommended integration.

AI Text Analyzer under Google Maps Scraper's integration tab

But this doesn't mean you're limited to integrating just this one Actor. You can connect any Actor or task with the Apify integration.

In this step-by-step guide, we're going to show you an example by connecting two Actors via the Apify integration.

Prefer video? Watch this instead:

Step 1. Choose an Actor

Start by going straight to Apify Console. If you're not an Apify user, you'll need to sign up for a free account before you start.

🆓Sign up for free

We're going to choose Cheerio Scraper with the purpose of connecting it with the Download Images from Dataset Actor.

What we want to do is grab all image URLs from the Apify Blog (because it's awesome) with Cheerio Scraper (because it's super fast) and then download them into a zip file using Download Images from Dataset. So, we're using one Actor to grab the URLs of the images, and with the other, we're downloading the images (not just the URLs) as a zip file.

Step 2. Create a task with the Actor

First, you need to create a task. Tasks are great for organizing your inputs, especially if you want to connect more than two Actors.

In this example, we'll create a task with Cheerio Scraper, which we'll call Apify Blog Image Grabber.

Hit "Create new task" at the top of the Actor input page

Next, we'll put the URL for the Apify Blog in the Start URLs field.

We have no need for Glob patterns or Link selectors, so we'll leave those blank. What we do need is code in the Page function.

Edit your scraper's input accordingly

Here, we're creating a pageFunction where we're iterating through all images on the page, getting a URL of each image, and pushing the full image URL to a dataset.

Our code will push image URLs found on the website to a dataset

Step 3. Choose the Actor to integrate with

Now go to Integrations , click the Apify integration , and then choose the Actor you want to integrate it with.

Connect the Actor you want to integrate

The first Actors you'll see (in alphabetical order) are integration-ready Actors. You can also find these in Apify Store under the Integrations category.

You can find integration-ready Actors in Apify Store

We'll select Download Images From Dataset and click Connect.

Step 4. Choose a trigger

Now you can choose the Trigger that will start the Download Images Actor. We'll stick with Run succeeded.

Because we're using an integration, the dataset ID field is already prefilled with the datasetID that Cheerio Scraper will generate.

Choose a trigger to run the second Actor: note the Dataset ID is prefilled from Cheerio Scraper!

Step 5. Save and test settings

Now you can save the integration settings. Once saved, you can test the integration with the test button. You have multiple test settings. We'll test it with last run.

Test your integrations via the Test button

You can see the test results in the log underneath.

If you go back to the integrations tab, you can see how the Actors and tasks are connected.

View the schema of your Actor connections in the task's Integrations tab

So now, when we start the Apify Blog Image Grabber task and go to our runs, we can see that the image downloader was also triggered.

See how the relevant Actors are triggered in the Runs tab

Since the run is finished, lets look at the results. If we go to Storage and the key-value store, we can see our images archive.

You can find your pictures in the image-archive file in the run's Key-value store

Now we can download it to our device.

Bonus step: schedule tasks

If you want to automate your workflow further, you can schedule your integration-infused tasks to run at specific times.

See how to set up a schedule in our video on the topic:

Which Actors do you want to integrate?

So, now you know how it works, which Actors will you choose to integrate first? There are well over a thousand pre-built web scraping and automation tools in Apify Store to choose from. Take your pick!

🛒Browse Apify Store

Create integration-ready Actors

Don't forget, you can build your own Actors, run them locally on Apify's cloud platform, and publish them in Apify Store to reach people who need your solution and get paid.

If you want to know how to build integration-ready Actors, the guidelines in our documentation will show you what to do.

😱 Create integration-ready Actors

How to scrape Google search results

Apify — Fri, 10 Nov 2023 23:00:00 +0000

Hey, we're Apify , and we've been scraping the web for over 8 years. You can build, deploy, share, and monitor fast, reliable web scrapers on the Apify platform. Check us out .

Does Google even need an introduction? If our phones have become extensions of our hands, Google is one of the main reasons for that evolution. Google is a synonym for answers, speed, and accessibility - but also for billions of dollars, users, clicks, searches, and billions of terabytes of data. That data can be extracted automatically and effortlessly if you know the proper methods and have the right tools at hand.

In this brief how-to article, were going to show you exactly how to scrape the most extensive library in the world by using a ready-made tool on the Apify platform called Google Search Results Scraper. This is your step-by-step guide to how to scrape any information available from Google, including organic and paid results, ads, queries, People Also Ask boxes, prices, and reviews. Lets get started!

🪚 Try out Google SERP Scraper

🕵 What are Google SERPs?

A Google SERP is a page containing the list of search results that Google displays to you when you type in your query and hit Enter. SERP, in this case, stands for Search Engine Results Page, and youll find SERPs not only on Google, which controls 90% of the search engine market, but also on other search engines, such as Bing, Yahoo, and others. We need to know this term to understand how to use web scraping on the Google Search Engine. You can consider the terms Google page, Google search page, and Google SERP to be equal and interchangeable, but well stick with Google SERP for the sake of being technically correct.

What do Google SERPs look like?

Google SERPs have changed a lot over the years, with the most prominent features being those infoboxes we all know too well - Knowledge Graphs, Carousels, Featured Snippets - so ubiquitous these days that we cant imagine the Google SERP interface looking any other way. Those now-classic Google SERP features were part of the Hummingbird algorithm release in 2013.

Its a far cry from the 2003 version of Google results. Does this prehistoric SERP interface ring a bell? Luckily, were not there anymore.

Google SERP interface in the beginning of the 2000s

How do you scrape Google SERP?

To scrape Google search results, we first need to understand how Google sees and prioritizes our searches. When you search for things on Google, what you see is not just an index of pages with URLs or so-called organic searches. While it used to be like that in the past (as seen above), the primary purpose and driving force of Google - or any search engine for that matter - has always been to have your queries answered as quickly and efficiently as possible, and in a way that will attract your attention but be easy on the eyes.

Thats why over time, Google search results have become much more multilayered, including the results of varying complexity and formats, like a giant layer cake. And that cake-like structure isnt going away any time soon, with voice command search, apps, and mobile search introducing their significant corrections into the way we google stuff. Today, Google search results consist of various levels, depending on the complexity and type of search, as you can see in this example of a James Webb Space Telescope query 🔭

Search results for James Webb Telescope

What are the elements of a Google search page?

So these days, Google SERP is packed with various content: featured snippets, so-called snap packs, ads, and organic results. Additional types may also show up: product ads, related searches, and multiple snap pack types (Wikipedia, Google Maps, YouTube videos, etc.)

The elements you receive will depend on the type of search query. For instance, as the James Webb Telescope is rather scientific and educational content, you won't see paid ads there or a carousel with products. But you will see those when googling everyday objects like shoes or headphones. Leveraging this complicated Google structure means giving access to an incredible amount of useful data, all of which you can scrape.

🔧 How to use data extracted from Google

Google is the main entry point to the internet for billions of people. This makes appearing in Google Search results a key factor for almost every business. And Google reviews and ratings have a massive impact on local businesses online profiles. Marketing agencies, especially those with a large number of clients from various industries, rely heavily on obtaining reliable SEO tools, including advanced AI tools. They are not only a means of effectively performing multiple tasks but also a means of successful management and analysis of results. You can look for things like how the top-ranking pages are writing their page titles, the keywords they're targeting, how they format their content, or take it a stage further and do some deeper link analysis.

Typical use cases for Google Search scraping are, among thousands of others:

Analyze Google algorithm and identify its main trends
Gain insights for Search engine optimization (SEO)monitor how your website performs in Google for specific queries over a period of time
Analyze ads ranking for a given set of keywords
Monitor competition in both organic and paid results
Build a URL list for specific keywords. This is useful if you, for example, need good relevant starting points when scraping web pages containing specific phrases.

Is it legal to scrape Google?

Google search results fall into the category of publicly available data, so scraping Google search results is legal. But there is still some data you should not be accumulating, such as personal information or copyrighted content. Learn more about regulations and laws connected to scraping at our legality article.

🤖 Can I use AI to scrape Google?

AI is currently unable to scrape websites directly, but it can help generate code for scraping Google if you prompt it with the target elements you want to scrape. Note that the code may not be functional, and website structure and design changes may impact the targeted elements and attributes.

🦾 Does Google search have an API?

Scraping Google search results is how you can create your own Google SERP API (read What is an API? for more info on APIs) to extract data from Google. But why? Technically, you can fish out some insights into the way Google works and displays results without the need to use any specific tools: just google your keyword and see what you get. Now google the same thing in incognito mode and see what you get again. But there are two problems with this approach.

First, the process is pretty time-consuming to do manually and at scale - an inefficient monkey job, essentially. Second, the results you get can't be considered objective , even if you're using incognito mode. At the beginning of the 2000s, when the Google SERPs were first introduced, they looked much the same to each user for the localized Google version per each country. Now Google algorithms have evolved to really zero in on the most relevant results, so they give out customized results tailored to each user. Google search results these days take into account many factors, such as:

Type of device 📱 If a user searches using their smartphone, the search results will look different since, starting in 2015, Google prefers showing mobile-optimized web pages.
Registration 🔒 If a Google user is logged into their account, what they see on SERPs will be aligned with their history and user behavior, provided that's allowed within their data-related settings.
Browser history 📖 If a user rarely empties their browser cache, Google will include that information concerning previous search queries with cookies and adjust the results.
Location 📍 If the geolocalization option is activated, Google aligns the SERPs with the user's location. That's why search results for the sushi takeaway query in Prague will be different from those in Los Angeles. If we're talking about local search, the results will be a combination of data from Google Search and Google Maps.

Just look at how differently Google shows the results for hot air balloon for users in Australia and Ukraine. Google's algorithm is highly adaptable to location, browser history, and device type.

What about the official Google Search API?

That's a funny question. Google doesn't provide its own SERP API for web search - so Google doesn't make it that easy to extract data from Google at scale. Moreover, only a limited subset of information available on any search results page can be provided to you via Google services such as Google Ads or Google Analytics. The two official methods suggested by Google for getting data are Google Custom Search API (deprecated in April 2018) and scraping by URLFetch method.

💡If you're interested in the Google APIs topic in more depth, see our quick guide through Google APIs: Top Google search APIs in 2022

How do you scrape a search engine?

So there's a lack of a working solution from Google, a demand to search results objectively, and lots of manual work ahead. The solution to these issues is an automated crawler that is simple enough to use and complex enough to scrape such a massive website as Google. In other words, an alternative SERP API - that's a lot of Caps, but essentially it's a program that will automatically collect data from Google SERP for you to analyze and use. This is precisely what our Google Search Result Scraper is created for.

Our SERP API supports the extraction of all data on:

organic and paid results 🔍
ads 🛍
queries
People Also Ask 🙋
prices 🏷
reviews

If you need additional attributes, you can also include a short snippet of JavaScript code to extract additional attributes from the HTML. But for now, you can try it out for free on our platform.

😉 If you want a simpler version of Google search results scraper, we have a great alternative for you: Fast Google Search Scraper

🔍 How to scrape Google search pages

Now that we've covered all the aspects and reasons for scraping Google let's get started with the tutorial. Promise it won't take long :), but if you would prefer a short video tutorial, we have one for you right here 📼:

How to scrape data from Google search pages

Step 1. Go to Google Search Results Scraper

Go to the scraper's page, and click the Try for free button. You will be redirected to Apify Console, which is your workspace to run tasks for your scrapers.

Google Search Results Scraper's page on Apify Store

If you are not signed in, you'll find yourself on the sign-up page. Sign up using your email account, Google, or GitHub - no credit card required. You will be redirected to the scraper's page on your Apify Console.

Log in or sign up with Google, GitHub, or your email address

Step 2. Insert the keyword you want to scrape

Now fill in the input fields. You can provide keywords or Google Search URLs as many as you want. Let's see what Google data we'll get for hot air balloon. We can use either a search URL or just the keyword; either will work.

Google Search Results Scraper on Apify Console - typing in the search query

Step 3. Choose the number of pages for extraction

Now set how many Google pages you want to scrape and how many results you want to see on each page. So if you are interested in the first 2 pages and 50 results on each, then you'll scrape 100 pages in total. But so is true if you will set the scraper for the first 5 pages and 20 results on each.

Choose the number of results and Google pages to scrape

Note that Google will always show you that it has found a gazillion results for almost any query. But in reality, the number of found pages rarely exceeds a few thousand 🤥 This is why it's unlikely that you will find a gazillion pages in your scraped run. You can check it yourself if you set Results per page to max on your Google account settings and see how the 000000 part of Google pages shrinks to just a few.

Step 4. Set up country domain and language of search

Last but not least - where are we searching? This is your time to specify the domain or country + the language. You can mix&match them as you like (for instance, find all French-speaking results in Canadian Google), but we're going to go with Czech-in-Czech.

Set up country and language of search. We're going to go for Czech Google in the Czech language 🇨🇿

Step 5. Collect your data extracted from Google search

Once you are all set, click the Start button. Notice that your task will change its status to Running, so wait for the scraper's run to finish. It will be just a minute before you see the status switch to Succeeded. In our case - 13 seconds, wow! 🚀

Note that we've only got 4 results which only means the scraper finished extracting data from the Google pages just as we've asked it. The actual number of results is 100 because we've decided to scrape 2 pages with 50 results each (see Step 3 ). You can view all results, only organic results or only paid results. You can also preview them as a table or in JSON.

After Google Scraper has finished extracting search data, you can preview the results: all, organic and paid.

You can also preview the data before downloading it by clicking the Preview in another tab 👁 button or viewing it in a new tab if the dataset is too large.

Step 6. View and download your Google data

Once you're ready to download Google data, move to the Export button to see the results of your scraping. Now you can download your scraped data in many formats, including HTML table, JSON, CSV, Excel, XML, and RSS feed.

You can choose to download only paid results, only organic results or all of them. Before downloading, you can also narrow down your dataset to a few specific fields that you want to keep or discard. Last option: getting your results via an API.

View and download your data in different formats

🛡 Do I need proxies for scraping Google SERPs?

Yes. Apify has proxies designed specifically for SERPs. Our SERP proxies will make your scraping much faster, and you'll be able to dynamically switch between countries so that you can get search information from any location. If you sign up for a free Apify account, you get a 30-day free trial of our SERP proxy service.

Now that you're all ready go ahead and start your first month with Apify by using our Google Search Scraper. If you need to scrape other parts of the Google giant, check out our other Google scrapers at Apify Store.

What are other tools for scraping Google services?

👁 Google Lens Actor	📍 Google Maps Scraper
🔍 Google Trending Searches	🛍 Google Shopping Scraper
📈 Google Trends Scraper	⭐️ Google Maps Reviews Scraper
📩📍 Google Maps Email Extractor
	🤟 Google Datasets Translator

AIOps vs. MLOps

Apify — Wed, 08 Nov 2023 23:00:00 +0000

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about AIOps vs. MLOps was inspired by our work on getting better data for AI .

AIOps and MLOps are not synonymous

Its commonplace to mistakenly use artificial intelligence (AI) and machine learning (ML) interchangeably. Thankfully, the same is not true of AIOps and MLOps. Theyre two distinct approaches in the field of IT and data operations, each serving unique purposes.

I already covered Artificial Intelligence for IT Operations in What is AIOps? and Machine Learning Operations in What is MLOps? So, what we'll do here is hone in on the differences and explain when you need the one and when you need the other. In the end, well look at how and why you might combine them.

What is AIOps?

AIOps refers to the application of AI and machine learning techniques to enhance and automate IT operations. Its primary goal is to improve the management and monitoring of complex IT environments by analyzing data from various sources to provide actionable insights and predictive capabilities.

Why use AIOps?

For proactive problem resolution: To detect and predict issues before they impact users and allow for quicker problem resolution.
For automation: To automate routine tasks and responses and reduce the workload on IT staff.
For enhanced visibility: To provide a holistic view of the entire IT infrastructure and identify bottlenecks and areas for improvement.
For reduced downtime: To predict and prevent outages so as to minimize downtime for systems and applications.
For cost optimization: To help optimize resource allocation and reduce unnecessary spending.

What is MLOps?

MLOps , on the other hand, is a set of practices and tools aimed at streamlining the deployment, monitoring, and management of machine learning models in production environments. It focuses on the operationalization of ML models and maintaining their reliability over time.

Why use MLOps?

For reproducibility: To ensure that machine learning experiments are reproducible and that models are auditable.
For scalability: To facilitate the scaling of ML models to handle large datasets and increased workloads.
For model governance: To provide tools for tracking model versions, lineage, and compliance.
For reliability: To maintain model performance over time with automated monitoring and retraining.

What are the differences between AIOps and MLOps?

Focus

AIOps focuses on IT operations and infrastructure management.
MLOps focuses on managing machine learning models and their lifecycle.

Use of AI/ML

AIOps uses AI/ML for monitoring, alerting, and optimizing IT environments.
MLOps uses AI/ML for model training, deployment, and monitoring.

Primary domain

AIOps is mainly used in the IT and DevOps domain.
MLOps is primarily used in data science and machine learning projects.

What are the use cases of AIOps and MLOps?

AIOps use cases

Network performance monitoring: AIOps can analyze network data to identify anomalies, predict congestion, and optimize network performance.
Incident management: It can automatically classify and prioritize incidents, which reduces response times.
Capacity planning: It helps in optimizing resource allocation by predicting demand and load patterns.
Root cause analysis: It can identify the root causes of problems, which helps IT teams resolve issues faster.
Security: It detects and responds to security threats by analyzing patterns in log and event data.

MLOps use cases

Recommendation systems: MLOps is used to deploy and maintain recommendation models in applications like e-commerce.
Predictive maintenance: It helps deploy predictive maintenance models to minimize equipment downtime in manufacturing.
Natural language processing: Its used to manage NLP models for chatbots, sentiment analysis, and language translation.
Financial forecasting: It ensures that predictive models for stock prices or credit risk are up-to-date and reliable.
Healthcare diagnostics: It's employed to deploy and monitor ML models for disease diagnosis and patient monitoring.

When to use AIOps and when to use MLOps?

Use AIOps when:

You need to optimize and automate IT operations and infrastructure.
You want to detect and resolve IT issues proactively.
You're dealing with monitoring and managing IT systems and networks.

Use MLOps when:

Youre developing, deploying, and maintaining machine learning models in production.
You need to ensure model reliability and scalability.
Youre involved in data science and AI projects where ML models play a central role.

When can you combine AIOps and MLOps?

Organizations may sometimes combine AIOps and MLOps to enhance their overall operations and derive more value from their AI and ML investments. So, lets end with some examples of how you can integrate these two disciplines.

Automated incident resolution with ML predictions

Using AIOps to monitor IT infrastructure for anomalies and incidents continuously

When an incident is detected, AIOps can trigger an MLOps pipeline to analyze relevant data and predict the root cause. ML models may suggest resolutions or actions for IT teams based on historical data. This combination streamlines incident management and thus reduces resolution time.

Dynamic resource allocation for ML workloads

Using AIOps to monitor resource utilization in data centers or cloud environments

AIOps may trigger an MLOps process when resource constraints or performance issues are detected. ML models are able to predict resource requirements for upcoming machine-learning tasks based on historical patterns. Resources can be dynamically allocated to meet these requirements to optimize cost and performance.

Security threat detection and response

Using AIOps to monitor logs, network traffic, and system behavior for security anomalies

AIOPs can trigger an MLOps pipeline when there's suspicious activity. ML models analyze the detected anomalies to determine if they represent real threats. If a threat is confirmed, automated responses or alerts can mitigate the risk.

Optimizing ML model deployment

Using MLOps to manage the entire ML model lifecycle, from training to deployment

AIOps is able to monitor the performance of deployed ML models in production environments. If AIOps detects a drop in model accuracy or unusual behavior, it can trigger MLOps to retrain or update the model automatically.

Predictive capacity planning for ML infrastructure

Using AIOps to analyze the utilization and performance of servers, GPUs, and other infrastructure components

ML models are able to predict future capacity requirements based on historical data and upcoming ML workloads. So, if AIOps identifies capacity constraints or bottlenecks, it could trigger MLOps processes to help scale ML infrastructure efficiently.

Anomaly detection in ML model behavior

Using AIOps to monitor the behavior of deployed ML models, such as input data distribution and model output

MLOps may be triggered by AIOps when deviations from expected behavior are detected. The ML models analyze the anomalies to identify potential issues with data quality, model drift, or external factors.

Cost optimization for ML workloads

Using AIOps to track the costs associated with IT resources used for ML workloads

AIOPs can trigger MLOps processes to analyze cost data in relation to model performance and business objectives. ML models are able to make recommendations for optimizing resource allocation to achieve cost-efficiency without compromising performance.

AIOps and MLOps are different but not mutually exclusive

If you want to optimize and automate IT operations and infrastructure, detect and resolve IT issues, or monitor and manage IT systems, AIOps is what you need.

If youre developing, deploying, and maintaining machine learning models in production and want to ensure model reliability and scalability, you need MLOps.

That being said, combining the two allows organizations to create a closed-loop system where AI-driven insights from AIOps inform and automate actions within MLOps. This ensures both efficient IT operations and the reliability of machine learning applications.

Whichever one you want to use, you cant build an AIOps or MLOps solution without data. Both solutions begin with data collection, aka web scraping. If you need a web scraping platform for data acquisition, Apify provides the tools and infrastructure you need to harvest data from any website, including scrapers and integrations for collecting data for AI and machine learning.

A guide to data collection for training computer vision models

Apify — Tue, 07 Nov 2023 23:00:00 +0000

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about computer vision was inspired by our work on getting better data for AI .

What is computer vision?

Have you ever used Google Translate by pointing your smartphone at a sign in a foreign language to get an almost immediate translation of it? I have, and I thank computer vision for it every time.

You've heard of self-driving cars, right? Computer vision is behind those, too.

What about face detection? Computer vision again.

The history of computer vision goes back to 1959 - almost as far back as AI itself. It's a field of machine learning that helps computers see the world around them. It involves developing algorithms, techniques, and systems that allow computers to analyze and extract meaning from images, videos, and other visual data.

The applications of computer vision are very broad, covering fields as wide-ranging as automotive manufacturing, optical character recognition, and face detection.

It doesn't look like this field is going to slow down any time soon, either, with the market for computer vision predicted to reach $82.1 billion by 2032. And with the rise of multimodal AI, computer vision use cases are likely to expand.

Datasets: ground truth for machine learning

For computer vision to work, AI models require access to datasets that serve as their "ground truth" for learning. The process of collecting data for such datasets is pivotal in the development of efficient computer vision models, as the quality and quantity of the data directly influence their accuracy and performance.

What is meant by ground truth?

"Ground truth" refers to the correct values or labels used for training and evaluation purposes. It serves as a reference or benchmark against which the performance of an AI model is measured. Ground truth data is critical for supervised learning, where models are trained using labeled examples and then assessed for their ability to make accurate predictions.

What is data collection in AI and computer vision?

Data collection is a broad term. But in AI, its the process of aggregating relevant data and structuring it into datasets suitable for machine learning. The choice of data type, such as video sequences, frames, photos, or patterns, depends on the specific problem the AI model aims to solve.

In the domain of computer vision, AI models are trained using image datasets to make predictions related to things such as image classification, object detection, and image segmentation. These image or video datasets must contain meaningful information for training the model to recognize patterns and make predictions based on them.

For example, in industrial automation, image data is collected to identify specific part defects. Therefore, cameras capture footage from assembly lines to create video or photo images, which form the dataset.

Data sources for computer vision

Generating a high-quality machine learning dataset requires identifying sources that will be used to train the model. There are two ways of sourcing and collecting image or video data for computer vision tasks.

Public image datasets

Public machine learning datasets are readily available online and are often open-source and free to use. However, it's important to review the dataset's licensing terms, as some may require payment for commercial projects. Public datasets are suitable for common computer vision tasks but may not be suitable for unique or specific problems.

Custom datasets

Custom datasets can be created by collecting data using web scrapers, cameras, and other sensor-equipped devices like mobile phones or webcams. Third-party dataset service providers can assist in collecting data for machine learning tasks, and modern computer vision platforms, such as TensorFlow or PyTorch, host datasets for AI model deployment.

Image annotation and data labeling

Once data is collected, the next step is image annotation and data labeling, where humans manually provide information about the ground truth within the data. This involves indicating the location and characteristics of objects that the AI model should learn to recognize. For example, training a deep learning model to detect giraffes would involve annotating each image or video frame with bounding boxes around the giraffes linked to the label "giraffe." The trained model can then identify giraffes in new images.

Data preparation and characteristics of image data

Most computer vision models are trained on datasets comprising hundreds or thousands of images. The quality of these images is crucial to the AI model's ability to classify or predict outcomes accurately. There are several key characteristics that can help identify a good image dataset:

Quality: Images should be detailed enough for the AI model to identify and locate target objects effectively.
Variety: Diverse images in the dataset improve the model's performance in various scenarios.
Quantity: More data is generally better, as training on a large, accurately labeled dataset increases the model's chances of making accurate predictions.
Density: The density of objects within the images also matters, as more data improves the model's efficiency.

Video data collection

While computer vision models are predominantly trained on image datasets, certain applications, like video classification, motion detection, and human activity recognition, require video data. Videos are essentially sequences of images, and the process of collecting video data involves identifying sources, scraping video content, recording video files, extracting frames, and preprocessing the data for machine learning.

The best way to collect image and video data

To train computer vision models, you need vast amounts of data. The go-to solution for collecting real-time data at scale for computer vision and other AI applications is web scraping. This is a method of retrieving unstructured data from websites and converting it into a structured format so machines can process it. One way to go about this is to build your own custom scrapers (you can learn how in these free web scraping courses). Another option is to use pre-built scraping and automation tools for extracting image and video data from the web. Here are five options for starters:

The challenges of web scraping for computer vision

Getting blocked

Anyone who has done large-scale data extraction knows that the biggest challenge for web scraping is getting blocked by anti-bot protections.

To deal with these and other challenges, you don't just need a web scraper but infrastructure that sets you up to scrape successfully at scale.

Unclean data

Another challenge, particularly in the field of computer vision, where high-quality images are required, is data cleanliness.

The web is full of low-quality images, videos, and audio. So you not only need to perform web scraping, but you also need to clean and process web data to feed AI models.

At Apify, we're well aware of these challenges and have a lot of experience in dealing with them. So, if you want a reliable platform to create your own web scrapers for computer vision models, Apify provides the infrastructure you need. If you prefer a ready-made scraper designed to handle the complexities of a particular website, there's a range of web scraping and automation tools for AI available in Apify Store, where developers publish the micro-apps (Actors) they've created for web scraping and automation projects.

Take your pick!

What are embeddings in AI?

Apify — Mon, 06 Nov 2023 23:00:00 +0000

Apify is all about collecting the right data at scale in the most efficient possible way. So if you need a web scraping platform to extract large volumes of real-time web data for embeddings and other AI solutions, your project begins with Apify.

What is an embedding?

Imagine you're in a bar with a couple of your fellow programming buddies.

You're huddled together in the corner, geeking out over programming stuff. Over in the other corner is a group of loud, thuggish football fans, annoying everyone with their obnoxious behavior. Meanwhile, all the pretty girls are strutting their stuff on the dance floor, having a good time.

Notice how different people in this one space are grouped together based on similarities and shared interests.

In AI, embeddings work the same way. They group similar features in virtually any data type.

Embeddings are a way to represent data points (like words or even users) in a mathematical space, often multi-dimensional. These embeddings are generated in such a way that similar items are closer together while dissimilar items are farther apart, much like in the bar.

Embeddings are a way to represent data points in a mathematical, multi-dimensional space. An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors.

Applications of embeddings

Recommendation systems

You've probably watched a few movies on a streaming platform at some point in your life. Whenever you do this, the system tries to understand your taste by placing you in a certain position within its "taste space." Based on your position (and the position of movies you liked), it recommends other movies that are close by in this space.

Search engines

When you type a query, the search engine translates your words into their embedding space and fetches documents that are semantically closer to your query. The closer the semantic relationship, the more likely it is that it's relevant to your search.

Text generation

Embeddings also help in understanding the context of words. If a language model wants to generate a sentence following "The king and the...", it would look at the proximity of words to "king" in the embedding space to predict the next word. That's pretty much how LLMs like ChatGPT work.

Large language models don't understand text, only numbers.

Word embeddings: why are they important for LLMs?

Despite the widespread (mis)use of the term 'AI' for large language models, LLMs are not intelligent. Machines, by their nature, don't understand text. They understand numbers (yet LLMs suck at math: go figure!). An NLP task, whether it's sentiment analysis, machine translation, or document classification, needs numerical data.

Word embeddings convert words into numbers, but not just any numbers. These numbers (or vectors) capture the semantics, or the meaning, of a word in relation to other words in a given dataset. A well-trained set of word embeddings will place words with similar meanings or contexts close to each other in this multi-dimensional space. The word "king", for example, might be close to "queen" and "monarch", but farther from "apple" or "car". This is what is called semantic relationships.

Learn more about NLP in the context of LLMs

Semantic relationships

When we say "semantic relationships," we mean the connections between words that have similar meanings or associations.

Let's continue with the monarch example: consider the words "king," "man," "queen," and "woman." These words have relationships based on gender and monarchy. Good word embeddings can detect these relationships. They can tell us that the difference between "king" and "man" is similar to the difference between "queen" and "woman." In other words, embeddings can mathematically capture the idea that kings and queens are associated with their respective genders.

This is super useful in various language tasks because it allows computers to recognize not just individual words but also how words relate to each other in meaning.

Dimensionality

While a word can be represented as a one-hot encoded vector of the size of the entire vocabulary, embeddings compress this representation into a dense vector of much smaller dimensions.

Initially, each word is like a huge switchboard with as many switches as there are words in the entire language - it's gigantic! If we have 100,000 words in our vocabulary, that's 100,000 switches for each word. This is what we call a "one-hot encoded vector".

Now, think about the mess and inefficiency of handling these giant "switchboards". It's not only impractical but also computationally heavy. This is where dimensionality comes into play. We can make things more manageable.

Word embeddings are like smart compressors. They take each word's "switchboard" and squeeze it down into a much smaller size, typically between 50 to 300 "switches" (dimensions) instead of tens of thousands.

Contextual information

Words don't live in isolation; they depend on the words around them to convey their full meaning. Think of a word like "bank." Depending on the context, it could mean a financial institution or the side of a river. Understanding these contextual nuances is vital.

Modern word embeddings, especially contextual ones, can grasp the meaning of a word based on the words that surround it in a sentence or a paragraph.

For instance, if you see the word "bank" in a sentence like "I deposited money in the bank," a contextual embedding knows it's talking about a financial institution. But if you see "I sat by the bank," it understands that it's referring to the side of a river. This ability to consider context makes embeddings incredibly powerful for tasks like language comprehension, translation, and text generation because they capture the subtleties and nuances of meaning that words carry in different situations.

These contextual embeddings are the reason LLMs can understand the subtleties of language and accurately interpret the meaning of words in different sentences.

Large language models produce contextual embeddings. The embedding for a word may be different based on its surrounding words and context.

How to create embeddings (tools and methods)

Now that we've established the necessity of embeddings, the question naturally arises: how on earth do you generate them?

Fortunately, there's a range of tools, methods, libraries, and platforms out there for creating embeddings. Here are some of the most popular, together with some of the features they offer:

Word2Vec

Word2Vec is a popular method developed by Google that uses neural networks to learn word representations. It can capture semantic relationships between words.

Features

Continuous Bag of Words (CBOW) & Skip-Gram: two architectures to produce a distributed representation of words.
Captures semantic relationships. For instance, the vector math "King" - "Man" + "Woman" might get you close to "Queen".
Generates a fixed-size dense vector for each word.

GloVe (Global Vectors for Word Representation)

Developed by Stanford, GloVe is an unsupervised learning algorithm to obtain vector representations for words. It does this by aggregating global word-word co-occurrence statistics from a corpus.

Features

Focuses on word co-occurrence statistics.
Tries to capture meaning based on how frequently words appear together in large corpora.
Generates fixed-size dense vectors for words.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that generates contextualized word embeddings. Unlike Word2Vec or GloVe, which generate a single word embedding for each word, this model produces embeddings that consider the context in which a word appears.

Features

Contextual embeddings that represent words based on their meaning in a sentence.
Deep bidirectional transformer architecture.
Can handle polysemy (words with multiple meanings).

Hugging Face

Hugging Face is a wildly popular deep learning library that offers a wide array of pre-trained models for NLP tasks, from BERT and GPT-2 to newer models like T5 and BART.

Features

Comes with tokenizers for different models to ensure that text is preprocessed in a manner consistent with the model's original training data.
While offering pre-trained models, the library is designed to fine-tune these models on custom datasets easily.
Provides out-of-the-box solutions for tasks like text classification, named entity recognition, and translation, simplifying the process for developers.
Built on both TensorFlow and PyTorch, which gives users the freedom to choose their preferred framework.

Learn more about Hugging Face in this introduction to transformers and pipelines

TensorFlow

Developed by Google, TensorFlow is designed to provide a flexible platform for building and deploying ML models.

Features

Supports training on multiple GPUs and TPUs, which enables faster computations.
Allows for more intuitive operations and debugging, behaving more like standard Python operations.
An integrated visualization tool to monitor the training process, visualize model architecture, and more.

Learn more about TensorFlow in this PyTorch and TensorFlow comparison

Keras

Keras acts as an interface for TensorFlow, which makes it simpler to develop deep learning models.

Features

Comes with an integrated Embedding layer, which simplifies the process of creating word embeddings.
Allows you to use pre-trained embeddings or train them from scratch.
Converts tokenized text to dense vectors of fixed size.
Provides sequential and functional APIs for building complex architectures.
Offers various pre-trained models for tasks like classification, segmentation, etc., which can be easily imported and fine-tuned.

Learn more about Keras

Gensim

A Python library specialized for topic modeling and document similarity analysis, Gensim provides tools for working with Word2Vec and other embedding models.

Features

Designed to work with large text corpora without consuming vast memory.
Supports Word2Vec, FastText, and other popular embedding algorithms out of the box.
Known for its implementation of the Latent Dirichlet Allocation (LDA) algorithm for topic modeling.
Provides functionalities to compute document similarities based on their semantic meanings.
Uses a dictionary to manage and map tokens to their IDs, which can be updated without having to recompute the entire dictionary.
Supports incremental training, meaning you can update your model with new data without starting from scratch.

LLMs use deep neural architectures to generate embeddings.

How LLMs generate embeddings

Deep architectures

LLMs use deep neural architectures, often transformer-based, which allow them to consider a wide context when generating embeddings. They don't just look at one or two words; they consider the entire context, capturing the complex patterns and relationships in language.

Token and position embeddings

In models like BERT, each word is initially represented using token embeddings. Think of these as traditional word embeddings, but not static. Then, they add another layer called position embeddings, which indicate the word's position in the sentence. This way, the model knows the order of words and can understand how they relate to each other.

Pre-training and fine-tuning

Generating embeddings isn't enough, though. AI models need to be pre-trained on the embeddings and fine-tuned.

For models like BERT, during pre-training, some words in a sentence are masked, which is to say they're hidden, and the model is trained to predict them based on surrounding words. This helps the model learn contextual representations.

After pre-training on massive datasets, models can be fine-tuned on specific tasks. This allows the embeddings to be further refined for particular applications.

But hang on! We forgot the first step!

Collecting data to create embeddings

Before you can train high-quality embeddings, you'll need to get your paws on vast amounts of data. Then that data needs to be ingested and prepared for machine learning purposes. So the first step to generating and training embeddings is data acquisition and pre-processing. Here's a brief overview of how to go about it:

Collect text data: Use open datasets, web scraping, or access APIs to collect textual data.
Clean the data: Remove noise and irrelevant information, and perform stemming and lemmatization.
Label the data (if needed): While embeddings often use unsupervised learning, some models might benefit from labeled data. This can be done manually, via crowdsourcing, or using semi-supervised techniques.

The crucial last step: every embedding needs a home

So, now we've covered the beginning (data acquisition) and the middle (generating and training embeddings). But the vital final part is storing the embeddings, and that's where vector databases come in.

Storing embeddings in a vector database

Once embeddings are generated, they can be stored in a vector database for quick retrieval and similarity search. Because embeddings are high-dimensional vectors, traditional databases aren't optimized for vector search operations. A vector database allows you to find the most similar items quickly, handle literally billions of vectors without sacrificing search speed, and reduces the need for scanning the entire dataset. With a vector database, you can:

Convert embeddings to a suitable format (like numpy arrays).
Use database-specific APIs or tools to insert the vectors.
Index the database for efficient search.

Examples of vector databases: Pinecone, Milvus, Weaviate, Chroma, FAISS.

You can read more about these and other vector databases in What is Pinecone? and 6-open source Pinecone alternatives

Let's end where we began

Now you know what embeddings are, how they work, and how to generate and use them. But where does Apify fit into all this?

We gave the game away right at the start:

Apify is all about collecting the right data at scale in the most efficient possible way. So if you need a web scraping platform to extract large volumes of real-time web data to generate and train embeddings and other AI solutions, your project begins with Apify.

Python dictionaries: a comprehensive guide for devs

Apify — Sun, 05 Nov 2023 23:00:00 +0000

Apify is all about making the web more programmable. Our SDK for Python is a great toolkit to help simplify the process of making scrapers to collect web data. This tutorial aims to give you a solid understanding of Python dictionaries for storing data.

What is a Python dictionary?

Python has a variety of built-in data structures that can store different types of data. One such data structure is the Python dictionary, which can store data in the form of key:value pairs and allows for quick access to values associated with keys. You can think of it like a regular dictionary, where words are keys and their definitions are values.

In some other languages, the Python dictionary is called a hashtable because its keys are hashable. Python dictionaries are dynamic and mutable, which means that their data can be changed.

What are dictionaries used for in Python?

Python dictionaries are used to store data in key-value pairs, where each key is unique within a dictionary, while values may not be. The values of a dictionary can be of any type, but the keys must be of an immutable data type, such as strings, numbers, or tuples. This is because the Python dictionary is implemented internally as a hash table. So, if you hash a mutable object, then change it and hash it again, you will get a different hash.

Python dictionaries are optimized for fast lookups, making them more efficient than lists for this purpose.

In Python, the average time complexity of a dictionary key lookup is O(1) because dictionaries are implemented as hash tables, and keys are hashable. On the other hand, the time complexity of a lookup in a list is O(n) on average.

Using a Python dictionary makes the most sense under the following conditions:

If you want to store data and objects using names rather than just index numbers or positions. Use a list if you want to store elements so that you can retrieve them by their index number.
If you need to look up data and objects by names quickly. Dictionaries are optimized for constant-time lookups.
If you need to store data efficiently and flexibly. Dictionaries store unique keys, so if youve got a lot of duplicate data, dictionaries will only store unique data.

How to use dictionaries in Python

A dictionary is a group of key-value pairs. Using a dictionary in Python means working with the key-value pairs and performing various operations, such as retrieving data, deleting data, and inserting data.

What is a key-value pair?

A key-value pair is a combination of two elements: a key and a value. The key in a key-value pair must be immutable, meaning that it cannot be changed. Examples of immutable keys include numbers, strings, and tuples. Values in key-value pairs can be any type of data, including numbers, lists, strings, tuples, and even dictionaries.

The values can repeat, but the keys must remain unique. You cannot assign multiple values to the same key. However, you can assign a list of values as a single value.

How to create a dictionary?

To create a dictionary in Python, you can use curly braces {} to enclose a sequence of items separated by commas. Each item consists of a key and a value. There are two primary methods for defining a dictionary: using a literal (curly braces) or built-in function (dict()).

First, lets create an empty dictionary and then fill it with some items.

test_dict = {}

Let's fill the dictionary with some data. Suppose weve got integer keys and string values.

test_dict = {1: "apify", 2: "crawlee"}print(test_dict)

It is not necessary that all keys be of the same type.

test_dict = {1: [2, 3, 4], "product": "crawlee"}print(test_dict) # {1: 'apify', 2: 'crawlee'}

Alternatively, you can create a dictionary by explicitly calling the Python dict() constructor. Remember, the main usage of the built-in dict() function is to convert between different data types.

test_dict = dict({1: "apify", 2: "crawlee"})print(test_dict) # {1: 'apify', 2: 'crawlee'}

The dictionary can be created by simply passing the key and value to the dict() constructor, as shown below.

test_dict = dict(company="apify", product="crawlee")print(test_dict) # {'company': 'apify', 'product': 'crawlee'}

A dictionary can also be created by passing a list of tuples to the dict() constructor.

test_dict = dict([(1, [2, 3, 4]), ("product", "apify")])print(test_dict) # {1: [2, 3, 4], 'product': 'apify'}

You can create a dictionary inside another dictionary.

nested_dict = { "product1": {"name": "Crawlee", "year": 2019}, "product2": {"name": "Actors", "year": 2022},}print(nested_dict) # {'product1': {'name': 'Crawlee', 'year': 2019}, 'product2': {'name': 'Actors', 'year': 2022}}

As discussed earlier, each key in the dictionary should be unique. The last assignment of the key overwrites the previous ones.

test_dict = {1: "apify", 1: "crawlee", 1: "proxy"}print(test_dict) # {1: 'proxy'}

In Python dictionaries, keys should be hashable (they should be immutable). Thus, mutable data types like lists aren't allowed. Let's try to hash() different data types and see what happens:

# Hashing of an integerprint(hash(1))# Hashing of a floatprint(hash(1.2))# Hashing of a stringprint(hash("apify"))# Hashing of a tupleprint(hash((1, 2)))# Hashing of a list# Lists are not hashable, so this will raise a TypeErrorprint(f"Hash of list [1, 2, 3]: {hash([1, 2, 3])}")

Heres the code result:

Example of a code result

Integers, floats, strings, and tuples are hashable data types, whereas lists are unhashable data types.

Accessing values using keys

To access the value from a Python dictionary using a key, you can use the square bracket notation or the get() method.

To access the value using square bracket notation, place the key inside the square brackets.

test_dict[key]

In the following code, we access the values using these keys: "name" and "products.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"], "active": True,}print(company["name"]) # apifyprint(company["products"]) # ['crawlee', 'actors', 'proxy']

You can also use the get() method to access the dictionary elements. Pass the name of the key as the argument to the method.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"], "active": True,}print(company.get("name")) # web scraping

Now, what if you're searching for the key and the key does not exist in the dictionary? If you use the square bracket notation, the program will throw a KeyError. However, if you use the get() method, youll get None by default, and you can also set the default return value by passing the default values as the second argument. The following examples return "Key not found!" if the key doesn't exist.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"], "active": True,}print(company.get("founder")) # Noneprint(company.get("founder", "Key not found!")) # Key not found!

How to modify dictionaries

Python dictionaries are mutable, which means you can modify the dictionary items. You can add, update, or remove key-value pairs. Here are some basic operations to modify dictionaries.

Adding and updating key-value pairs

There are various ways to add new elements to a dictionary. A common way is to add a new key and assign a value to it using the = operator.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"], "active": True,}company["repositories"] = 112print(company)

Heres the code result:

New key and value in code result

New key-value pairs have been added successfully.

You can add multiple values to a single key using the = operator.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}company["products"] = "crawlee", "actors", "proxy"print(company)

Heres the code output:

Code output with multiple values

New values have been added in the form of a tuple.

If you would like to add a key without a value, you can put None instead of the value.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}company["price"] = Noneprint(company)

Heres the code result:

Example of code result without a value

You can add multiple key-value pairs to an existing dictionary. This is achieved by using the update() method. If the key is already present in the dictionary, it gets overwritten with the new value. This is the most popular method for adding key-value pairs to the Python dictionary.

Heres the code result:

Code result with multiple key-value pairs

One last method involves using the Merge (|) and Update (|=) operators. They were introduced in Python 3.9. The merge (|) operator creates a new dictionary with the keys and values from both of the given dictionaries. You can then assign this newly created dictionary to a new variable.

The Update (|=) operator adds the key-value pairs of the second dictionary to the first dictionary.

company = {"name": "apify"}new_dict = {"year": 2015, "solution": "web scraping"}# Merge Operatorresult = company | new_dictprint(result)# Update Operatorcompany |= new_dictprint(company)

Here's the code result in the Python (3.9+) interpreter:

Example of code result in the Python 3.9+ interpreter

Removing key-value pairs

The removal of key-value pairs can be done in several ways, which well discuss one by one.

The del keyword can be used to delete key-value pairs from the dictionary. Just pass the key of the key-value pair that you want to delete.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}del company["year"]print(company) # {"name": "apify", "solution": "web scraping", "active": True}

The "year" entry has been deleted from the Python dictionary.

Another way is to use the pop() method to delete the key-value pair. The main difference between Pop and Del is pop will return the popped item, whereas del will not.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}popped_item = company.pop("year")print(popped_item) # 2015print(company) # {'name': 'apify', 'solution': 'web scraping', 'active': True}

If you don't pass the key name, the popitem() method removes the last item inserted into the dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}company.popitem()print(company) # {'name': 'apify', 'year': 2015, 'solution': 'web scraping'}

Clearing a dictionary

If you want to delete the entire dictionary, it would be difficult to use the above methods on every single key to delete all the items. Instead, you can use the del keyword to delete the entire dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}del companyprint(company)

But if you try to access the dictionary, youll encounter a NameError because the dictionary no longer exists.

A more appropriate method to delete all elements from a dictionary without deleting the dictionary itself is to use the clear() method.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "active": True,}company.clear()print(company) # Output: {}

The code gives you the empty dictionary as the output. The del will delete the whole object, and the clear() method will just clear the content of the dictionary, but the object will remain.

Built-in Python dictionary methods

Python provides a set of built-in methods to make common dictionary operations like adding, deleting, and updating easier. These methods improve code performance and consistency. Some common methods include get(), keys(), values(), items(), pop(), and update().

get()

The get() method returns the value for a key if it exists in the dictionary or returns None if the key does not exist. You can also set a default value, which will be returned when the key does not exist in the dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}print(company.get("name"))print(company.get("founder"))print(company.get("founder", "Sorry, key not exist!"))print(company["founder"])

Heres the code result:

Get method example of code result

In the above output, when you try to access the key name, the value apify is returned. When you try to access the key founder, which does not exist in the dictionary, the value None is returned. When you access the key founder again, the default value that you set to return if the key is not found is returned. Finally, when you try to access the key without using the .get() method, the program throws an error. This is why the .get() method is useful.

keys()

This method returns a list of keys in a dictionary. It returns an iterable object (dict_keys) that contains a list of all keys in the dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}print(company.keys()) # dict_keys(['name', 'year', 'solution', 'products'])

Often, this method is used to iterate through each key in the dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}for k in company.keys(): print(k)

Heres the code output:

Keys method example of code output

values()

Returns a list of all values in the dictionary. The values() method returns a dict_values object.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}print(company.values())# Output: dict_values(['apify', 2015, 'web scraping', ['crawlee', 'actors', 'proxy']])

items()

This method returns a list of key-value pairs as tuples, where the first item in each tuple is the key, and the second item is the value. The returned object is iterable, so this method is primarily used when you want to iterate through the dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}print(company.items())# Output: dict_items([('name', 'apify'), ('year', 2015), ('solution', 'web scraping'), ('products', ['crawlee', 'actors', 'proxy'])])

When you iterate through the dictionary.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}for k, v in company.items(): print(k, v)# or for x in company: print(x, company[x])

Heres the code output:

Items method example of code output

Note that when you change the value in the dictionary, the items object will also be updated to reflect the changes.

company = { "name": "apify", "year": 2015, "solution": "web scraping", "products": ["crawlee", "actors", "proxy"],}pairs = company.items()print(pairs)company["year"] = 2023print(pairs)

Heres the code result:

Example of code result with changed value in the dictionary

pop()

The pop() method removes a key from a dictionary if it is present and returns its associated value.

Example of code output pop method

The pop() method raises an error if the key is not found.

Pop method example of error

But if you don't want this error to be raised, you can set a default value.

Pop method setting a default value

update()

The update() method is useful whenever you want to merge a dictionary with another dictionary or with an iterable of key-value pairs. Let's consider that the dictionary company will be updated using the entries from the dictionary new_dict. For each key in new_dict:

If the key is not already present in the company, the key-value pair from new_dict will be added to the company.
If the key is already present in the company, the corresponding value in the company for that key is updated with the value from new_dict.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}new_dict = { "active": True, "solution": "web scraping and automation",}company.update(new_dict)print(company)

Heres the code output:

Update method example of code output

As shown above, the new key active is added to the company dictionary, and the solution key is updated.

Now, new_dict can also be specified as a list of tuples.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}new_dict = (("active", True), ("solution", "web scraping and automation"))company.update(new_dict)print(company)

Heres the code result:

Update method with a list of tuples code result

Values to be merged can also be specified as a list of arguments.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}new_dict = (("active", True), ("solution", "web scraping and automation"))company.update(active=True, solution="web scraping and automation")print(company)

Heres the code output:

Update method with values as a list of arguments code output example

fromkeys() method

The fromkeys() method is used to create a dictionary from a given sequence of keys (which can be a string, tuple, list, etc.) and values. The value parameter is optional, and if a value is not provided, None is assigned to the keys. The fromkeys() method returns a new dictionary with the specified sequence of keys and values.

keys = {"name", "year", "solution"}value = "apify"new = dict.fromkeys(keys, value)print(new) # Output: {'name': 'apify', 'solution': 'apify', 'year': 'apify'}

If a value is not provided, None will be assigned as the value to all the keys.

keys = {"name", "year", "solution"}new = dict.fromkeys(keys)print(new)# Output: {'year': None, 'solution': None, 'name': None}

Now, let's take a look at one more feature provided by fromkeys(). This time, use a mutable value.

keys = {"name", "year", "solution"}value = ["apify"]new = dict.fromkeys(keys, value)print(new)value.append(2015)print(new)

Heres the code output:

Fromkeys method code output example

In the above result, when we update the list value using append, the keys are assigned with the new updated values. This is because each element is referencing the same memory address.

copy()

This method returns a copy of the existing dictionary. Note that modifications made to the copied dictionary won't affect the original one.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}new_copy = company.copy()print(new_copy)new_copy["name"] = "apify technology"print(company)

Heres the code output:

Copy method example of code output

In the code, we are trying to modify the name of the new_copy dictionary, but it does not affect the original dictionary (company).

Iterating over dictionaries

The dictionary allows access to the data in O(1) time. So, it is important to understand how to iterate over it so that you can access the data youve stored in the dictionary. Three commonly used methods for iterating over dictionaries are keys(), values(), and items().

Using loop with keys()

This method allows you to iterate through all the initialized keys. It returns a view object, which is basically a view of some data. You can iterate over this returned object without any problems. However, if you want to store the list of keys, you need to materialize it.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}all_keys = company.keys()print(all_keys) # dict_keys(['name', 'year', 'solution'])# Materializing the keys into a list of keyskeys_list = list(company.keys())print(keys_list) # ['name', 'year', 'solution']

The first result is a view object, and the second result is a list of keys because you materialize the view object into a list.

Now, let's iterate over the dictionary, demonstrating two simple ways.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}for key in company.keys(): print(key, company[key])for key in company: print(key, company[key])

Heres the code output:

Loop with keys example of code output

Note: When using the in keyword with a dictionary, the dictionary invokes the __iter__ () method, which returns an iterator. This iterator is then used to iterate through the keys of the dictionary implicitly.

Using loop with values()

Like the keys() method, the values() method also returns a view object that allows you to iterate over the values. Unlike the keys() method, this method provides only values. So, if you are not concerned about the keys, use this method.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}for val in company.values(): print(val)# Output:""""apify2015web scraping"""

Using loop with items()

Similar to the previous method, the items() method also returns a view object, but here you can iterate through the (key, value) pairs instead of iterating through either keys or values.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}for item in company.items(): print(item)

Heres the code output:

Loop with items example of code output

The above code returns the key-value pairs. Now, to immediately assign both keys and values simultaneously, you can use tuple unpacking and extract them using variables.

company = {"name": "apify", "year": 2015, "solution": "web scraping"}for key, value in company.items(): print(key, value)

Heres the code result:

Loop with items assigning keys and values using variables code result example

Dictionary comprehensions

First, let's remind ourselves what comprehension is in Python. It simply means applying a specific kind of operation on each element of an iterable (tuple, list, or dictionary). Dictionary comprehension is similar to list comprehension, but it creates a dictionary instead of a list.

It is a useful feature that allows us to create dictionaries on a single line concisely and efficiently. The common uses of dictionary comprehension include constructing dictionaries, transforming existing dictionaries, and filtering dictionary content.

{key: value for (key, value) in iterable}

Suppose youve got a dictionary that contains the names and money of each person. Suppose you want to transform the dictionary's values, i.e., you want to increment each amount of money by 10 dollars.

person = {"satyam": 22, "john": 18, "elon": 19}# Dictionary comprehension approachnew_person = {name: money + 10 for name, money in person.items()}print(new_person) # {'satyam': 32, 'john': 28, 'elon': 29}

Suppose you want to filter all the person names whose money is less than 20 dollars.

person = {"satyam": 22, "john": 18, "elon": 19}new_person = {name: money for name, money in person.items() if money < 20}print(new_person) # {'john': 18, 'elon': 19}

Let's filter the data by using if and else with dictionary comprehension. Suppose you want to create a new dictionary where the amount of money less than 20 will get a 20% increase.

person = {"satyam": 22, "john": 18, "elon": 19}new_person = { name: money * 1.2 if money < 20 else money for name, money in person.items()}print(new_person) # {'satyam': 22, 'john': 21.6, 'elon': 22.8}

You can perform a lot of operations in one line using dictionary comprehension.

Advanced Python dictionary operations

Python dictionaries allow you to perform various advanced operations, such as handling missing keys using the get() and setdefault() methods, merging dictionaries, and working with nested dictionaries. This ultimately provides a lot of flexibility for working with data in a more structured way.

Handling missing keys with get() and setdefault()

Sometimes, users don't know if a key exists in a dictionary, and they try to access it. This can cause an error. There are a few ways to handle missing keys. One way is to use the get() and setdefault() methods to set a default value for the key. This way, you don't have to handle the error yourself.

The get() method returns the associated value if the key is found in the dictionary. Otherwise, it returns the default value (which is None by default), instead of raising an error. You can also specify a custom default value as the second argument.

company = {"name": "Apify", "year": 2015, "solution": "web scraping"}# Accessing a key using the get() methodprint(company.get("founder"))# Accessing a key using the get() method with a default valueprint(company.get("founder", "Sorry, key not found!"))# Accessing a key without using the get() methodprint(company["founder"])

Heres the code result:

Missing keys example of code output

Another method is setdefault(). If the key is present in the dictionary, it returns the value of the key. Otherwise, this method inserts a new key with the default value passed as an argument in the method as its value.

# Create a dictionary to store person's informationperson = {"name": "Satyam", "age": 21}# If the 'age' key is present, retrieve its value; otherwise, set a default valueage = person.setdefault("age", "Key not found!")# If the 'city' key is not present, set it to the default value "Key not found!"city = person.setdefault("city", "Key not found!")print("Age:", age)print("City:", city)# The dictionary 'person' will be updated as the new key 'city' is addedprint("Updated dictionary:", person)

Heres the code output:

Using setdefault for missing keys code output example

In the above code, the age key is already in the dictionary. We are trying to set the default value for the age key, but this is impossible. So, the age value remains the same, which is 21. We are also trying to set the default value for the city key, which is not in the dictionary. So, the city value is set to the default value, Key not found. Now, if you try to access the city key, the default value will be printed. The dictionary has been updated as the new key city is added.

The main difference between get() and setdefault() is that get() only returns the value of the key if it exists in the dictionary, while setdefault() will set the key to the default value if it does not exist.

Merging dictionaries

Merging of dictionaries happens from right to left (A B). When there is a common key in both dictionaries, the value from the second dictionary overwrites the value in the first dictionary. As shown in the illustration below, key 1 exists in both dictionary A and dictionary B, but the value of key 1 in dictionary B overwrites the value of key 1 in dictionary A. This process occurs from right to left.

Merging dictionaries

Lets understand some ways to merge Python dictionaries.

From Python version 3.9 onward, we can use the merge (|) and Update (|=) operators. The merge (|) operator creates a new dictionary with the keys and values from both of the given dictionaries. You can then assign this newly created dictionary to a new variable. The Update (|=) operator adds the key-value pairs of the second dictionary to the first dictionary.

dict_a = {"name": "apify", "year": 2015, "solution": "web scraping"}dict_b = {"name":"apify tech", "active": True, "repos": 112}# Merge Operatorresult = dict_a | dict_bprint(result)# Update Operatordict_a |= dict_bprint(dict_a)

Heres the code output:

Merging dictionaries code output example

Another method is to merge by unpacking both dictionaries using double asterisks (**).

dict_a = {"name": "apify", "year": 2015, "solution": "web scraping"}dict_b = {"name":"apify tech", "active": True, "repos": 112}result = { **dict_a,** dict_b}print(result)

Heres the code result:

Merging dictionaries by unpacking them code result example

Another way is to copy one of the dictionaries and update it with other dictionaries.

dict_a = {"name": "apify", "year": 2015, "solution": "web scraping"}dict_b = {"name":"apify tech", "active": True, "repos": 112}copy_dict_a = dict_a.copy()copy_dict_a.update(dict_b)print(copy_dict_a)

Heres the code result:

Merging dictionaries by copying one of them code result example

Nested dictionaries

A nested dictionary is a collection of dictionaries within a single dictionary. In short, a nested dictionary is a dictionary inside another dictionary. Below, we have created a nested dictionary for two companies:

companies = { "Apify": { "founded_year": 2012, "HQ_location": "Prague, Czech Republic", "CEO": "Jan Curn", "products": ["Web Scraping", "Data Extraction", "Automation"], }, "Google": { "founded_year": 1998, "HQ_location": "Mountain View, California", "CEO": "Sundar Pichai", "products": ["Search", "Android", "YouTube"], },}

You can access individual items in a nested dictionary by specifying the key within multiple square brackets.

companies = { "Apify": { "founded_year": 2012, "HQ_location": "Prague, Czech Republic", "CEO": "Jan Curn", "products": ["Web Scraping", "Data Extraction", "Automation"], }, "Google": { "founded_year": 1998, "HQ_location": "Mountain View, California", "CEO": "Sundar Pichai", "products": ["Search", "Android", "YouTube"], },}print(companies["Apify"]["HQ_location"]) # Prague, Czech Republicprint(companies["Apify"]["CEO"]) # Jan Curn

If the key is not present, the program will raise an exception. To avoid this exception, you can use the get() method.

companies = { "Apify": { "founded_year": 2012, "HQ_location": "Prague, Czech Republic", "CEO": "Jan Curn", "products": ["Web Scraping", "Data Extraction", "Automation"], }, "Google": { "founded_year": 1998, "HQ_location": "Mountain View, California", "CEO": "Sundar Pichai", "products": ["Search", "Android", "YouTube"], },}# Accessing the HQ location of 'Apify'print(companies["Apify"].get("HQ_location")) # Output: Prague, Czech Republic# Attempting to access a non-existent key ('Founder') in 'Apify'print(companies["Apify"].get("Founder")) # Output: None

You can change the value of a specific item in a nested dictionary.

companies = { "Apify": { "founded_year": 2012, "HQ_location": "Prague, Czech Republic", "CEO": "Jan Curn", "products": ["Web Scraping", "Data Extraction", "Automation"], }, "Google": { "founded_year": 1998, "HQ_location": "Mountain View, California", "CEO": "Sundar Pichai", "products": ["Search", "Android", "YouTube"], },}companies["Apify"]["founded_year"] = 2015companies["Apify"]["CEO"] = "Curn Jan"print(companies)

Heres the code result:

Nested dictionaries code result example

You can add or update nested dictionary items. If the key is not present in the dictionary, the key will be added. If the key is already present in the dictionary, its value is replaced by the new one.

companies = { "Apify": { "founded_year": 2012, "HQ_location": "Prague, Czech Republic", "CEO": "Jan Curn", "products": ["Web Scraping", "Data Extraction", "Automation"], }, "Google": { "founded_year": 1998, "HQ_location": "Mountain View, California", "CEO": "Sundar Pichai", "products": ["Search", "Android", "YouTube"], },}# Adding a new company ('Amazon') to the dictionarycompanies["Amazon"] = { "founded_year": 1994, "HQ_location": "Seattle, Washington", "CEO": "Andy Jassy", "products": ["eCommerce", "Amazon Web Services", "Kindle"],}print(companies)

Heres the code output:

Nested dictionary items code output example

Use the del statement to delete elements from the nested dictionary. In the code below, we are trying to delete the Google dictionary.

companies = { "Apify": { "founded_year": 2012, "HQ_location": "Prague, Czech Republic", "CEO": "Jan Curn", "products": ["Web Scraping", "Data Extraction", "Automation"], }, "Google": { "founded_year": 1998, "HQ_location": "Mountain View, California", "CEO": "Sundar Pichai", "products": ["Search", "Android", "YouTube"], },}del companies["Google"]print(companies)

Heres the code result:

Using del statement code result example

Real-world applications for Python dictionaries

Python dictionary is a very useful data structure and is used in many real-world scenarios because they are easy to use, flexible, and efficient.

Benefits

Ease of use: Python dictionary is a powerful data structure that makes it easy to store and retrieve data. They are easy to use, provide quick access to data, and make finding data much easier. With very little code, you can modify dictionaries, add and delete objects, and more.
Flexible: Python dictionaries are flexible and can store many different types of data, such as numbers, strings, and lists. This makes it easy to access and manipulate the data.
Efficiency: Python dictionaries use hash tables to store data, which allows them to find the value associated with a key quickly. Python dictionaries take very little space to store data, making them ideal for applications that need to access and manipulate large amounts of data easily. Python dictionary functions like get() and setdefault() let you efficiently look up data in a dictionary. Other functions like pop(), update(), and clear() are efficient for manipulating data.

Example

Let's take a look at the real-life application where Python dictionaries are used a lot. This example shows how to store student data using a Python dictionary efficiently. You can access the student data without getting an error if the student information is not found. You can also list all the student information and remove specific student details.

class StudentDatabase: def __init__ (self): self.students = {} def add_student(self, student_id, name, grade): self.students[student_id] = {"name": name, "grade": grade} def get_student_info(self, student_id): return self.students.get(student_id, None) def list_all_students(self): for student_id, student_info in self.students.items(): print( f"Student ID: {student_id}, Name: {student_info['name']}, Grade: {student_info['grade']}" ) def remove_student(self, student_id): if student_id in self.students: del self.students[student_id] print("Student removed from the database.") else: print("Student not found in the database")def main(): student_manager = StudentDatabase() while True: print("\\nStudent Management Menu:") print("1. Add Student") print("2. Retrieve Student Information") print("3. List All Students") print("4. Remove Student") print("5. Exit") choice = input("Enter your choice: ") if choice == "1": student_id = input("Enter student ID: ") name = input("Enter student name: ") grade = input("Enter student grade: ") student_manager.add_student(student_id, name, grade) print("Student added to the database.") elif choice == "2": student_id = input("Enter student ID to retrieve: ") student = student_manager.get_student_info(student_id) if student: print( f"Student ID: {student_id}, Name: {student['name']}, Grade: {student['grade']}" ) else: print("Student not found in the database.") elif choice == "3": student_manager.list_all_students() elif choice == "4": student_id = input("Enter student ID to remove: ") student_manager.remove_student(student_id) elif choice == "5": break else: print("Invalid choice. Please try again.") print("Goodbye!")if __name__ == " __main__": main()

Heres the code output:

How to store student data using a Python dictionary code output

Similarly, Python dictionaries can be used in other real-life applications, such as:

Creating a contact book to store all of your contacts.
Building a user authentication service to store all of your users' data.

Best practices for Python dictionaries

Following best practices can help you get the most out of dictionaries and avoid common pitfalls. The best practices ensure that your code will remain maintainable, reliable, and perform at its best.

Ensuring unique keys

Keys in a Python dictionary must be unique. If you try to insert a duplicate key, the new value will overwrite the existing one. The most common and straightforward method to ensure unique keys is to check whether the key exists in the dictionary before inserting a new key-value pair. You can use the in operator for this.

In the following code, we check if the key exists. If it does not exist, we simply insert it. Otherwise, we handle it accordingly, such as ignoring the new value, replacing the existing value with the new value, or raising an exception.

company = {"name": "Apify", "founded": 2012}key = "solution"# Check if the 'key' exists in the 'company' dictionaryif key not in company: # If it doesn't exist, add the 'key' with the value 'web scraping' company[key] = "web scraping"else: # Handle the case when 'key' already exists in the dictionary (duplicate key)

Mutable vs. immutable key types

Dictionary keys must be of an immutable type, such as integers, floats, strings, tuples, or Booleans. Lists and dictionaries cannot be dictionary keys because they are mutable. However, values can be of any type and used multiple times.

Mutable vs. immutable key types error example

The output is an Exception, what does unhashable mean?

It simply means that the Python interpreter is not able to generate a hash for the key given in the dictionary, whose data type is a list. So, the values of a dictionary can be of any type, but the keys must be of an immutable data type.

Performance considerations

Python dictionaries are an excellent choice for fast lookups, insertions, and deletions. However, some factors may affect their performance.

Use dictionary comprehension when creating dictionaries. Dictionary comprehension is more concise and readable than a loop, and it is also more efficient in some cases, especially for large datasets.
Flat dictionaries are faster than nested dictionaries. If you want to store nested data, consider using a different data structure, such as a list of dictionaries or a tuple of dictionaries.
When initializing a new dictionary, using {} is more efficient than calling dict(). With {}, there are no function call overheads. Calling dict() requires Python to look up and execute a function, which incurs a slight performance penalty.

Python dictionary in use

Comparing dictionaries with other Python data structures

Python dictionary have their own unique characteristics and use cases when compared to other data structures like lists, sets, and tuples.

Differences and similarities with lists, sets, and tuples

Lists are the ordered collections of elements that can be modified after creation. Items can be accessed by index, and duplicates are allowed.
Sets are unordered collections of unique elements that cannot be modified after creation. Sets are commonly used for operations such as finding the union, intersection, or difference between sets.
Tuples are immutable ordered collections of elements that cannot be modified after creation. Tuples are often used when you want to ensure that a collection of items remains constant and in a specific order.
Dictionaries are the unordered collections of key-value pairs. The keys must be unique and hashable, and the values can be any type of data. Dictionaries can be modified after creation.

When to use dictionaries over other data structures

Python dictionaries provide a fast and efficient way to retrieve data by its key in constant time, O(1). Dictionaries do not allow duplicate keys, so you can be sure that each key in a dictionary is unique. This makes them an excellent choice for storing large amounts of data in memory.

When you need to access a specific element in a data structure, dictionaries are much faster than other data structures, such as lists. This is because the runtime for dictionary lookup operations is constant.

Here is a simple comparison of the runtimes for list and dictionary lookup operations:

List:

import time# Create a list containing numbers from 0 to 9,999,999number_list = [number for number in range(10**7)]# Measure the time taken to check if '5' is in the liststart_time = time.time()if 5 in number_list: print("5 is in the list")list_runtime = time.time() - start_timeprint(f"\\nList runtime (checking '3' in the list): {list_runtime} seconds.")# Measure the time taken to find the number '9,000,000' in the liststart_time = time.time()for number in number_list: if number == 9000000: breaklist_runtime = time.time() - start_timeprint(f"\\nList runtime (finding '9,000,000' in the list): {list_runtime} seconds.")

Heres the code output:

Comparison of the runtimes for list. Code output example

Dictionary:

import time# Create a dictionary with keys and valuesdictionary_data = {i: i * 2 for i in range(10**7)}# Measure the time taken to check if '5' is a key in the dictionarystart_time = time.time()if 5 in dictionary_data: print("Key '5' is in the dictionary.")dict_runtime = time.time() - start_timeprint( f"\\nDictionary runtime (retrieving the value for key '5'): {dict_runtime:.6f} seconds.")# Measure the time taken to retrieve the value for the key '9,000,000' from the dictionarystart_time = time.time()value = dictionary_data.get(9000000)dict_runtime = time.time() - start_timeprint( f"\\nDictionary runtime (retrieving the value for key '9,000,000'): {dict_runtime:.6f} seconds.")

Heres the code output:

Example of dictionary runtimes

Have you noticed that dictionaries perform much better than lists when looking up larger indexes? It took the dictionary almost no time to locate the number, while the list took around 1 second to perform the same operation.

Wrapping up

You learned in detail how to create, modify, and delete Python dictionaries, as well as some of the most commonly used dictionary methods and advanced Python dictionary operations. Start using these dictionary methods whenever you see a possibility. Build amazing projects using dictionaries and their methods, keeping best practices in mind.

TypeScript utility types: when and how to use them

Apify — Thu, 02 Nov 2023 23:00:00 +0000

The Apify SDK supports TypeScript by covering public APIs with type declarations. This allows writing code with auto-completion for TypeScript and JavaScript code alike. This article was written to provide you with a deeper knowledge of TypeScript. But if you want to know more about how it compares with JavaScript for web scraping, you might like to read TypeScript vs. JavaScript: which to use for web scraping?

What is TypeScript?

TypeScript is a superset of JavaScript that provides you with the capabilities of static type-checking, enabling you to catch type-related errors during development. One very useful feature of TypeScript is being able to define and manipulate types effectively to write code that is maintainable and reliable.

Utility types in TypeScript play a significant role in this regard, as they enable you to create new types based on existing ones.

What are utility types?

Utility types are sets of built-in generic types in TypeScript that allow you to create, manipulate, or create new types by applying specific modifications and transformations to existing types.

Implementing utility types in your TypeScript project can make working with types more flexible, expressive, and reusable. Declaring your own custom utility types or using TypeScript's built-in utility types are the two approaches to adding utility types to your code base.

TypeScript built-in utility types

There are different built-in utility types that come along with the TypeScript language to make type transformations without you needing to install a library or create a custom type to use them. Lets look at some of them:

1. `Partial<Type>`

When defined, the partial utility type in TypeScript turns all of a type's properties into optional fields. This allows you to modify the type's fields in part without TypeScript throwing errors.

For example, say you have User data in your application and want to update the information of the user. You only want to update specific fields and not the whole data. Using the Partial utility type, you can transform the data fields of the Userobject from required fields to optional fields.

// Description of the user datainterface User { id: number; firstName: string; lastName: string; email: string; bio: string;};// This is the user data fetchedconst userData: User = { id: 12345, firstName: "Jamin", lastName: "Doe", email: "hellojamin@test.com", bio: "Legendary Gamer and everything in between",};//Function to update the user infoconst updateUserInfo = (userId: number, updatedInfo: Partial<User>) => { // Logic to update the user's info with the provided data // Using the partial type, all fields becomes optional updatedInfo.lastName; //(property) lastName?: string | undefined updatedInfo.email; //(property) email?: string | undefined};// Example usageconst userId = 12345;const updatedInfo: Partial<User> = { firstName: "John", bio: "Web Developer",};updateUserInfo(userId, updatedInfo);

In the example above, the Partial utility type converts all the User object properties to optional fields.

2. `Required<Type>`

The Required utility type is the opposite of the Partial utility type. It transforms all the fields of your Type into required fields. Imagine you have a data type of UserRegistration with some optional fields, but youd like to make all the fields required when using the data. You can achieve this using the Required utility type.

interface UserRegistration { username?: string; password?: string; email: string; fullName?: string;}const registerUser = (userData: Required<UserRegistration>) => { // Logic to register the user with the provided data};// Example usageconst userData: Required<UserRegistration> = { email: "user@example.com",};// throws an error// Type '{ email: string; }' is missing the following properties from type 'Required<UserRegistration>': username, password, fullNameregisterUser(userData);

3. `Pick<Type, Keys>`

This utility type enables you to selectively pick properties from a Type using the Keys properties of the object you want to pick from.

Lets say you have a product catalog in your application and would like to list available products for a catalog without showing all the data properties of the products. You can create a simplified version of the original product object for listing available products.

// Let's create a simplified version of the product for the product listingtype SimplifiedProduct = Pick<Product, "id" | "name" | "price" | "category">;const product: Product = { id: "1001", name: "Laptop", description: "High-performance laptop", price: 1200, category: "Electronics", stock: 10 // ...};const simplifiedProduct: SimplifiedProduct = { id: product.id, name: product.name, price: product.price, category: product.category};console.log(simplifiedProduct);// Output: { id: "1001, name: "Laptop", price: 1200, category: "Electronics" }

The Pick utility type can be very useful when you have a very complex data object and only want to display a fraction of that data object to your users.

4. `Omit<Type, Keys>`

The Omit<Type, Keys> type removes specified Keys properties from the Type and provides you with a Type without those properties.

// Let's create a simplified version of the product by excluding certain propertiestype SimplifiedProduct = Omit<Product, "description" | "stock">;const product: Product = { id: "1001", name: "Laptop", description: "High-performance laptop", price: 1200, category: "Electronics", stock: 10 // ...};//description and stock are excluded from the product propertiesconst simplifiedProduct: SimplifiedProduct = { id: product.id, name: product.name, price: product.price, category: product.category};console.log(simplifiedProduct);// Output: { id: "1001, name: "Laptop", price: 1200, category: "Electronics" }

5. `ReturnType<Type>`

The return type is used to get a functions return type. This enables you to extract and use the type that a function will return when invoked. It works by taking in a function as its parameter and returning the Type of the returned value of the function.

Consider the fetchDataFromApi example below and how the return type of the function was derived using the ReturnType utility type.

type ApiResponse = { success: boolean; data: any; // Assuming data can be of any type};type ApiFetchFunction = () => Promise<ApiResponse>;function fetchDataFromApi(endpoint: string): ApiFetchFunction { // Simulating fetching data from the API const fetchFunction: ApiFetchFunction = async () => { const response = await fetch(endpoint); const data = await response.json(); return { success: true, data }; }; return fetchFunction;}// Usageconst fetchProductData = fetchDataFromApi('https://api.example.com/products');// Get the return type from the functiontype ProductDataResponse = ReturnType<typeof fetchProductData>;// Use the function's return typeasync function handleProductData() { const response: ProductDataResponse = await fetchProductData(); console.log('Product data:', response.data);}

6. `Awaited<Type>`

The Awaited type is used for asynchronous functions and operations to determine the data type that the function or operation would resolve. Consider the product example you used previously. Imagine you need to fetch the product data from an API. The API service returns a Promise with the product data. You want to handle this asynchronously and extract the type of the resolved data.

// A function that simulates fetching product data from an APIconst fetchProductFromAPI = (): Promise<Product> => { return new Promise(resolve => { // Simulate an asynchronous API call setTimeout(() => { resolve({ id: "1001", name: "Laptop", description: "High-performance laptop", price: 1200, category: "Electronics", stock: 10 }); }, 1000); });};// Use ReturnType to get the return type of the async function and use// Awaited to retrieve the type of the async callconst getProductData = async (): Promise<Awaited<ReturnType<typeof fetchProductFromAPI>>> => { const productData = await fetchProductFromAPI(); return productData;};getProductData().then(data => { console.log("Product data:", data); /* Output: Product data: { id: '1001', name: 'Laptop', description: 'High-performance laptop', price: 1200, category: 'Electronics', stock: 10 } */});

It's interesting to note that the Awaited utility transforms types in a recursive manner. So, no matter how deeply nested a Promise is, it will always resolve its value. For example, the code below would transform the nested async request to a single data type.

type Data = Awaited<Promise<Promise<string>>>;//type Data = string

7. `Record<Keys, Type>`

The Record utility type enables you to construct an object type with Keys as its property keys and Type as its property values. The Keys passed to the Record ensure that only those specific keys can be assigned values of Type. This is particularly useful when you want to narrow down your records by only accepting specific keys.

Lets use a real-world scenario to understand this better:

You're creating a notification system for an application that notifies your users based on specific actions they take or responses to a request they make.

You can use the Record utility like this:

type NotificationTypes = 'error' | 'success' | 'warning';type IconTypes = 'errorIcon' | 'successIcon' | 'warningIcon';type IconColors = 'red' | 'green' | 'yellow';const notificationIcons: Record< NotificationTypes, { iconType: IconTypes; iconColor: IconColors }> = { error: { iconType: 'errorIcon', iconColor: 'red' }, success: { iconType: 'successIcon', iconColor: 'green' }, warning: { iconType: 'warningIcon', iconColor: 'yellow' }};console.log(notificationIcons.error)// OUTPUT: { iconType: 'errorIcon', iconColor: 'red' }

In the example above, you specified the Keys for the object to be a Union type of either error, success, or warning and assigned them to a property value of { iconType: IconTypes; iconColor: IconColors }.

With this, youve created a constraint of records where the notificationIcons can only be accessed by one of the NotificationTypes. Trying to access the notificationIcons without a known NotificationTypes will throw an error:

console.log(notificationIcons.completed)// Property 'completed' does not exist on type 'Record<NotificationTypes, { iconType: IconTypes; iconColor: IconColors; }>'

8. `Readonly<Type>`

The Readonly type transforms the object properties of a Type to 'read-only' so its values cannot be reassigned after initialization.

Example: Suppose you have a configuration object for a web application with various settings, and you want to ensure that once the configuration is set, it cannot be modified.

interface AppConfig { apiUrl: string; maxRequestsPerMinute: number; analyticsEnabled: boolean;}const initialConfig: Readonly<AppConfig> = { apiUrl: '<https://api.example.com>', maxRequestsPerMinute: 1000, analyticsEnabled: true,};// Attempt to modify a property (will result in a TypeScript error)initialConfig.apiUrl = '<https://new-api.example.com>';// Error: Cannot assign to 'apiUrl' because it is a read-only property.function displayConfig(config: Readonly<AppConfig>) { console.log('API URL:', config.apiUrl); console.log('Max Requests Per Minute:', config.maxRequestsPerMinute); console.log('Analytics Enabled:', config.analyticsEnabled);}displayConfig(initialConfig);

9. `NonNullable<Type>`

The NonNullable type transforms a type by removing all null and undefined from the input Type passed to it.

Example: Let's say you have a union type that accepts a string, or a number, as values. The type can sometimes be (undefined or null) as optional values. In the case where you dont want to accept null or undefined when reusing this type, you can stripe off the null and undefined fields using NonNullable.

type UserID = string | number | null | undefined//Works fine as expectedconst userByIDString = "100"// error: Type 'undefined' is not assignable to type 'NonNullable<UserID>'const userByID: NonNullable<UserID> = undefined//Works fine as expectedconst userByIDNumber: NonNullable<UserID> = 102

For a list of all the available built-in utility types, check out the official TypeScript documentation.

Custom types in TypeScript

Aside from using the built-in utility type from TypeScript, the language also offers the flexibility to create custom utility types to suit your needs where needed. Lets create a custom type in TypeScript that you can use to transform other types. For this example, youll create a utility type that accepts Type as an object type and filters the keys of the object by the Keypassed to the utility type.

// Custom utility type declaration type FilterKeysByType<Type, KeyType> = { [key in keyof Type as Type[key] extends KeyType ? key: never]: Type[key];}// Usage exampleinterface Person { name: string; age: number; email: string; isAdmin: boolean;}type StringKeys = FilterKeysByType<Person, string>;//OUTPUT: { name: string, email: string }

Lets break down the custom utility line by line to understand it better,

type FilterKeysByType<Type, KeyType> is the type definition, and it accepts two things: the Type you want to transform and the KeyType to filter by.
In the second line, three major things are happening:

Custom types vs. utility types

Should you create custom types over TypeScript's built-in utility types?

That depends on the level of transformation and abstraction you're performing. If you want to perform complex transformations that go beyond what TypeScript's built-in features offer, you should consider creating your own custom types.

Utility types are built-in features of TypeScript, and they don't require any external libraries or more lines of code for you to define and use them. Since they're built-in features of TypeScript, they're well-known and familiar to most developers.

Combining custom types with utility types

Utilizing TypeScripts capabilities to create custom types lets you combine a custom type with a utility type to create a customized utility type that is tailored to fit the needs of your project.

Lets explore this with an example. Say you want to create a custom type that transforms another type to make specific properties of that type optional. You can create such a custom type by utilizing TypeScripts already existing utility types to achieve this:

type PartialBy<Type, Key extends keyof Type> = Omit<Type, Key> & Partial<Pick<Type, Key>>;// Usageinterface User {id: string;name: string;email: string;}const partialUser: PartialBy<User, 'email'> = {id: '123',name: 'John Doe',};// email is optional in partialUserconsole.log(partialUser);

In this example, PartialBy is a utility type that takes two parameters: Type, which is the original type, and Key, which represents the keys that you want to make partial. It uses Omit to remove the specified keys from the original type and Partial<Pick> to make those keys optional.

With this, the PartialBy custom utility can transform the Key of any Type passed to it.

Summing up: why you should use utility types

Utility types in TypeScript can help you write sturdy, maintainable types. You can use them to make your type declarations more flexible, expressive, and reusable. They also give you the ability to combine them with custom types to create more powerful type declarations.