Forem: Mohsin Ashraf

Answer: How to get column and values dictionary in SQLAlchemy model?

Mohsin Ashraf — Wed, 09 Feb 2022 08:58:10 +0000

answer re: How to get column and values dictionary in SQLAlchemy model?

Feb 9 '22

You can use the following method which is inspired by the answer alecxe gave.

def get_model_dict(model, row)
    if isinstance(row, model):
        columns = [x.name for x in list(model.__table__.columns)]
        return {x: getattr(row, x) for x in columns}
    else:
        raise ValueError(f"The provided row is not of type {model.__table__.name.title()}")

Now all you have to…

Open Full Answer

CLI upload for large files

Mohsin Ashraf — Mon, 14 Dec 2020 23:08:21 +0000

We deal with data every day as part of my work in the data science team. It starts by collecting data and analyzing it for potentially important features and baseline numbers. Then we do data preprocessing and cleaning. Finally, we feed the data into a machine learning algorithm for training.

Once the training is complete, we test the model. We then serve via an API if the performance is good.

In a previous article, we talked about uploading large files using multipart upload via pre-signed URLs. We will take a step further now and discuss how to create a CLI tool for uploading large files to S3 using pre-signed URLs.

The article comprises 3 parts, as described below:

Create pre-signed URLs for multipart upload
Upload all parts of the object
Complete the upload

Request for Multipart upload pre-signed URLs

First of all, we have to request the pre-signed URLs to the AWS S3 bucket. It will return a list of pre-signed URLs corresponding with each of the object’s parts, along with a upload_id, which is associated with the object whose parts are being created. Let’s create the route for requesting pre-signed URLs.

from pathlib import Path
…
…

@app.route('/presigned',methods=['POST'])
def return_presigned():
    data = request.form.to_dict(flat=False)
    file_name = data['file_name'][0]
    file_size = int(data['file_size'][0])
    target_file = Path(file_name)
    max_size = 5 * 1024 * 1024
    upload_by = int(file_size / max_size) + 1
    bucket_name = "YOUR_BUCKET_NAME"
    key = file_name
    upload_id = s3util.start(bucket_name, key)
    urls = []
    for part in range(1, upload_by + 1):
           signed_url = s3util.create_presigned_url(part)
             urls.append(signed_url)
    return jsonify({
                     'bucket_name':bucket_name,
                     'key':key,
                     'upload_id':upload_id,
                'file_size:file_size,
                   'file_name':file_name,
                'max_size':max_size,
                     'upload_by':upload_by,
                'urls':urls
            })

Let’s go through the code. In this route (Flask route), we get the information sent in the request: file_name and file_size.
The file_name will be used in creating URLs for parts of the object, and file_size will be used to find how many parts to create (pre-signed URLs to create).
In the route, max_size determines each part’s maximum size. You can change it according to your needs.
upload_by tells how many parts there will be for the object to upload.
bucket_name is the bucket you want to upload data in.
upload_id is generated using the S3 utility function create_multipart_upload, which we will discuss shortly.
After that, pre-signed URLs are created in the for loop using the create_presigned_url utility function of s3. Again, we will come back to it in a bit.
Next, I return the required data in JSON format.

Now, let’s talk about create_multipart_upload. It’s a utility function that helps me encapsulate the code so it’s more readable and manageable. Following is the code for the utility class.

import boto3
from botocore.exceptions import ClientError
from boto3 import Session


class S3MultipartUploadUtil:
    """
    AWS S3 Multipart Upload Uril
    """
    def __init__(self, session: Session):
        self.session = session
        self.s3 = session.client('s3')
        self.upload_id = None
        self.bucket_name = None
        self.key = None

    def start(self, bucket_name: str, key: str):
        """
        Start Multipart Upload
        :param bucket_name:
        :param key:
        :return:
        """
        self.bucket_name = bucket_name
        self.key = key
        res = self.s3.create_multipart_upload(Bucket=bucket_name, Key=key)
        self.upload_id = res['UploadId']
        logger.debug(f"Start multipart upload '{self.upload_id}'")
        return self.upload_id

    def create_presigned_url(self, part_no: int, expire: int=3600) -> str:
        """
        Create pre-signed URL for upload part.
        :param part_no:
        :param expire:
        :return:
        """
        signed_url = self.s3.generate_presigned_url(
            ClientMethod='upload_part',
            Params={'Bucket': self.bucket_name,
                    'Key': self.key,
                    'UploadId': self.upload_id,
                    'PartNumber': part_no},
            ExpiresIn=expire)
        logger.debug(f"Create presigned url for upload part '{signed_url}'")
        return signed_url

    def complete(self, parts,id,key,bucket_name):
        """
        Complete Multipart Uploading.
        `parts` is list of dictionary below.
        ```


        [ {'ETag': etag, 'PartNumber': 1}, {'ETag': etag, 'PartNumber': 2}, ... ]


        ```
        you can get `ETag` from upload part response header.
        :param parts: Sent part info.
        :return:
        """
        res = self.s3.complete_multipart_upload(
            Bucket=bucket_name,
            Key=key,
            MultipartUpload={
                'Parts': parts
            },
            UploadId=id
        )
        logger.debug(f"Complete multipart upload '{self.upload_id}'")
        logger.debug(res)
        self.upload_id = None
        self.bucket_name = None
        self.key = None

In this class, I wrap the functionality of the S3 client to make it easy to use and less cluttered in the API file.

Once you get the response from the API, it would look something like this:

You would download this response in a JSON file to upload the data using the CLI.

Upload all parts of the object

Now let’s turn to the CLI code, which uses this JSON file, and we assume that we save this file as presigned.json.

import requests
import progressbar
from pathlib import Path

def main():
    data = eval(open('presigned.json').read())
    upload_by = data['upload_by']
    max_size = data['max_size']
    urls = data['urls']
    target_file = Path(data['file_name'])
    file_size = data['file_size']
    key = data['key']
    upload_id = data['upload_id']
    bucket_name = data['bucket_name']
    bar = progressbar.ProgressBar(maxval=file_size, \
        widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    json_object = dict()
    parts = []
    file_size_counter = 0
    with target_file.open('rb') as fin:
        bar.start()
        for num, url in enumerate(urls):
            part = num + 1
            file_data = fin.read(max_size)
            file_size_counter += len(file_data)
            res = requests.put(url, data=file_data)

            if res.status_code != 200:
                print (res.status_code)
                print ("Error while uploading your data.")
                return None
            bar.update(file_size_counter)
            etag = res.headers['ETag']
            parts.append((etag, part))
        bar.finish()
        json_object['parts'] = [
            {"ETag": eval(x), 'PartNumber': int(y)} for x, y in parts]
        json_object['upload_id'] = upload_id
        json_object['key'] = key
        json_object['bucket_name'] = bucket_name
    requests.post('https://YOUR_HOSTED_API/combine, json={'parts': json_object})
    print ("Dataset is uploaded successfully")

if __name__ == "__main__":
    main()

The above code loads the file and gets all the required information, including upload_id, URLs, and others. I use Progressbar to show progress while uploading the file. The entire code is pretty much self-explanatory except for the following line of code:

requests.post('https://YOUR_HOSTED_API/combine, json={'parts': json_object})

To understand this piece of code, we have to look at the final step of completing the upload.

Complete the upload

We have uploaded all parts of the file, but these parts are not yet combined. To combine them we need to tell the s3 that we have finished uploading and that now you can combine the parts. The above request calls the route in the table below and completes the multipart upload using the s3 utility class. It provides the proper information about the file and the upload_id, which tells s3 about the parts of the same file being uploaded using the upload_id.

@app.route("/combine",methods=["POST"])
def combine():
    body = request.form
    body = body['parts']    
    session = Session()
    s3util = Presigned(session)
    parts = body['parts']
    id, key, bucket_name = body['upload_id'], body['key'], body['bucket_name']
    PARTS = [{"Etag": eval(x), 'PartNumber': int(y)} for x, y in parts]
    s3util.complete(PARTS, id, key, bucket_name)
    return Response(status_code=200)

This code is a very minimum required code to create a CLI tool. You can deploy it on a server, which has proper roles in AWS for interacting with S3, to create and return the pre-signed URLs for completing the multipart upload. This way, you can make sure that no one has direct access to your S3 bucket. Instead, they upload the data using pre-signed URLs, which is a secure way of uploading the data.

How Traindex Leverages the Keyword Search?

Mohsin Ashraf — Fri, 13 Nov 2020 17:50:23 +0000

Traindex is a semantic search engine for corporate datasets to retrieve the most relevant results from a corpus. This search not only incorporates the meaning of the words but also includes contextual awareness. It enables a semantic search engine to outperform any keyword search engine; to find more, you can head out to our detailed article about the difference between both these approaches.

Traindex performance is measured using a variety of benchmarks, ranging from automated algorithms to manual experts' classification. For instance, we are using the Jaccard Similarity, which counts how many words from the retrieved results match the query's keywords. The following graph illustrates the visual intuition of this benchmark:

The blue bar represents the query's keywords, whereas the orange bars represent the matched keywords. The higher the bar, the more keywords in common.

In general, the benchmarking queries contain around 1200+ unique words. We have achieved a 47% average Jaccard Similarity score for the first twenty results against each query in our latest API release. This score means that we are implicitly applying the keyword search over the corpus since it approves that on average, Traindex could retrieve 564 common keywords with the query, which is far more than any keyword search engine can offer.

Moreover, the common words are not just random words because they are meticulously picked by an algorithm that decides about the query's best representative. The same algorithm also incorporates their semantic meaning during the search, which leads the percentage of Jaccard Similarity score sometimes to raise up to 90% and 99%.

When it comes to response time, Traindex is fairly quick. You can imagine how much time it would take to perform a keyword search of 564 words on 8.5M+ documents. It will require a lot of time and resources to go through the entire corpus, match the keywords, and bring up the relevant results. However, Traindex searches and ranks the results by their semantic similarity and not by the highest keyword match, as you can see from the above figure.

Conclusion

Traindex could give you the best of both worlds: keyword search and semantic search. Keyword search over millions of documents will take a long processing time and too many resources and produce a lot of false positives. Traindex, on the other hand, permits you to do a query with an entire document with even tens of thousands of words, and still, the response time is quick, and results are quite relevant.

Summarizing Large Documents for Machine Learning Modeling

Mohsin Ashraf — Thu, 08 Oct 2020 14:25:47 +0000

Information is growing exponentially every passing day thanks to the internet. It has connected humanity from all the corners of the world. According to Forbes, 2.5 quintillion bytes of data are created every day, and the pace is accelerating. It includes all kinds of data: text, images, videos, and transactions, etc. Text data is the largest shareholder among these data. This text data can be a conversation between two persons, which can be of small sentences, or it can be intellectual property data, for example, patents, which can be up to millions of words.

Handling smaller datasets with fairly small to medium-length documents is not a big challenge anymore, thanks to the power of deep learning. The problem comes when you have to deal with large-sized documents ranging from a few thousand words to millions of words such as patents and research papers. This is still challenging for deep learning methods to deal with such larger documents. It is hard for even state-of-the-art deep learning methods to capture the long-term dependencies in these documents. Hence, these huge documents require some special techniques to deal with them.

At Traindex, we are working with intellectual property and providing effective and efficient search solutions on patents. Patent analysts might want to know what other patents exist in the same domain when filing a new patent. They may want to find prior art to challenge a claim in an existing patent. There are numerous use cases that better patent search helps solve.

We are dealing with millions of patents, each containing thousands of words in it and some patents even reaching millions of words! Dealing with such a massive dataset with enormously large documents is a big challenge. These patents are not only lengthy, but they are intellectually intensive and have long-term dependencies. Deep learning alone will fail here to capture the proper semantics of such humongous documents. There need to be some very specialized techniques to deal with such large documents. We have developed some preprocessing techniques that help us reduce the size of the documents and keep the sense of the document pretty much the same.

We use an extractive summarizer that summarizes the patent by first going through the whole patent, figuring out the important sentences in it, and then dropping off the patent's least important sentences. The summarizer uses two measures to figure out which sentence is essential: first, how many stopwords are present in the sentence (which reduces the importance of the sentence), second, how many important topics does a sentence contain with respect to the overall topics discussed in the patent (which increases the importance of the sentence). Then we use a simple threshold value for deciding which sentences to keep and which to drop. By changing the threshold's value, we can change the summary length, such that the summary contains the most important information about the patent. The following figure illustrates this point.

The above image shows you the scores of the sentences in a patent. The horizontal red line is the threshold of importance for keeping or throwing away the sentences. The blue bars are the importances of the sentences below our defined threshold; hence we will drop them, and the result would look as follows.

These are the sentences that we are going to keep for our summarized patent. We can change the threshold line and get the summary's different lengths based on our needs. The flow of the overall process is given below.

We have tested this approach, and it has improved our search index's overall performance on patents. It tackles many problems for deep learning algorithms, especially if you are using some pre-trained models like Universal Sentence Encoder or Bert, which accept a limited number of words for any document. If you increase the length, you will run into errors. You can apply this summarization technique for all kinds of embedding algorithms with some limits regarding the input document's length.

How Is Semantic Search Different From Keyword Search?

Mohsin Ashraf — Wed, 23 Sep 2020 04:45:44 +0000

With the exponential growth of information, finding the right information is like looking for a needle in a haystack. Bubbling the right information to the top of the search results is essential for efficiently working in the knowledge economy. Putting the best relevant results in the limited place of the first page is what distinguishes an excellent search engine from a good search engine. This is the challenge we are solving at Traindex.

One of the first challenges we are trying to solve is in patent space. Patent analysts might want to know what other patents exist in the same domain as a new patent being filed. They may want to find prior art to challenge a claim in an existing patent. There are numerous use cases that better patent search helps solve.

We have experimented with numerous approaches to retrieve the relevant information quickly and effectively. These approaches are centered around two fundamental techniques, the keyword search, and semantic search.

In this article, we will take a deep dive in what is the difference between semantic search and keyword search and which approach is better.

A keyword search is a simple keyword lookup in a corpus of documents against a query. The system will retrieve all the documents from the database which have any keyword present in the query. We can set constraints of whether all words in the query should be present in the retrieved results or any single word in the document would be sufficient to bring it up.

One drawback of this approach is that the retrieval system will not care what the meaning of the keyword is in the context of a document and query it will simply bring back all the documents which contain the keyword specified by the user. This type of search might return irrelevant results (false positives). To view how keyword search works take a look at the following diagram.

In the above diagram, each small box shows the documents that contain a term specified e.g “A”. The diagram shows a user-entered query “raining cats and dogs” and how the system has retrieved the relevant documents to the terms that they used. In this case, the system retrieved all the documents which contained “raining”, “cats”, “and” and “dogs” and showed them to the user. But “raining cats and dogs” is a phrase in English used to describe heavy rain. This system might also get some relevant results but those results would be very small in number and also could have been ranked randomly (depending upon the database structure). Moreover, each word in the query is contributing independently regardless of its meaning being governed by its neighbors. Scaling the keyword search is also a problem and can slow down the response time of the search engine if you have millions of documents. Keyword searches may also fail to retrieve related documents that don’t specifically use the search term (false negatives). Under these conditions, researchers can miss pertinent information. There is also the danger of making business decisions based on less than a comprehensive set of search results.

We use semantic search in Traindex. Unlike keyword search, the semantic search takes into account the meaning of the words according to their context. In semantic search, a latent vector representation of documents is inferred during the training process to project into the latent space. At the inference time, the incoming query is converted into the same latent space representation and projected into space in which the documents are already projected. The nearest points in the space to the query are retrieved as the similar documents to the query.

Take a look at the following illustration using a 3D latent space, although these latent spaces can be from hundreds to a few thousands. At Traindex, we use the latent space representations ranging from 200 to 600 dimensions for capturing better representations of the documents.

The red dot represents the query a user might have entered, whereas all the blue dots are the documents projected in 3D latent space. From here there are a number of ways to locate similar documents, for example, one can use Euclidean distance of these dots to calculate the most similar documents or a Cosine angle from the origin which is also known as cosine similarity, and the list of similarity matrices goes on.

Now we have a solid understanding of how both searches work, let's compare them side by side.

Keyword Search	Semantic Search
Synonyms could be neglected during the search	Incorporate the meaning of words hence comprehends the synonyms as well
Need to carefully pick the keywords for search	The query is automatically enriched by the latent encoding
The information which is retrieved is dependent on keywords and page ranking algorithms that can produce spam results	The information retrieved is independent of keywords and page rank algorithms that produce exact results rather than any irrelevant results

Semantic search seeks to improve search accuracy by understanding a searcher’s intent through the contextual meaning of the words and brings back the results the user intended to see.

In patents, technical and domain-specific terms are heavily used which might not be present in English dictionaries and hence can be missed by a patent analyst while they is searching for prior art. Semantic search in Traindex processes the whole patent as input for (prior art and similarity) search and uses vectorized representation of the text to find synonyms and same words in other patents during the search, which is impossible with a keyword search.

Moreover, keyword search is restricted to use up to a fixed number of words which is at most 50 words generally and reduces the response time as the number of keywords grow, whereas in Traindex you can search with any number of words including the whole patents which are hundreds of thousands of words and can get results in real time. Semantic search in Traindex converts the whole patent into a vector representation and matches the most similar documents in the database, hence overcomes the problem of limited keywords search.