Forem: Tina

Need your oppinion

Tina — Thu, 31 Jan 2019 06:21:14 +0000

Greetings like-minded people!:)

I really need your opinion. I think I need to divide my blogs here and https://www.quora.com/profile/Tina-Ward-91/all_posts into two parts. One of them will be about pure Data Science and another about applied tools of Python for DS. I really need your oppinion. Thanks!

PS: You also can read article here https://wardtina.quora.com/How-to-Create-Unique-Image-of-Gambling-Topic-Using-Python

Parse in Gambling: How to Write Your Parser in 15 Minutes?

Tina — Wed, 19 Dec 2018 09:37:15 +0000

Step 1 - Parsing: What? Why? How?

Generally, parsing is a linear comparison of words sequence with the rules of a language. The concept of "language" is considered here in the widest context - it may be a human language (for example, English, German, Japanese, etc.) used for communication between people. As well as it can be a formalized language, in particular - any programming language.
Parsing of web-sites - is sequential syntactic analysis of information posted on Internet pages. Focus that information on the web pages is a hierarchical data set, structured using human and computer programming languages. Creating a website, the developer inevitably faces the task of determining the optimal page structure. But where to take an example of the optimal page? Do not reinvent the wheel in the initial stages of automating optimization process! It’s enough just to analyse your direct competitors, especially in such a saturated and highly competitive niche as gambling. There is a lot of such data, so a number of non-trivial tasks for its extraction should be solved, such as:

collection of search engine results;
large amounts of information provided in the net, which processing is hardly possible for one person or even a team of analysts;
one person or even a well-coordinated team of operators are not able to provide frequent updates - maintaining a huge stream of dynamically changing information, because sometimes information changes every minute and its updating is hardly advisable manually; so automating this process allows you to save time on monitoring changes for instance in casino promotions and automate its updates on your site. Compared to a human, computer parser program can:
quickly bypass thousands of web pages;
neatly separates technical information from "human";
unmistakably select the right and discard the superfluous;
effectively pack the final data in the required form.

In most cases the subjected to additional processing database or spreadsheet is the result of parsing. Currently, parsers are written in a large scale of programming languages such as Python , R, C ++, Delphi, Perl, Ruby, PHP. But I certainly choose Python as the most universal language with a simple syntax. At the same time the uniqueness of Python lies in its syntax. It allows a large number of programmers to read someone else’s code, written in Python with no trouble.

Step 1 P.S. - Ways to Improve

If you want to improve your script in the future or write a smarter parser, then you may find some useful tips here: https://www.seleniumhq.org/download/, and https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
The result (whether it’s a database or spreadsheet) needs further processing for sure. However, the subsequent manipulations with the collected information do not concern the topic of parsing.

Step 2 - Software implementation

First, let's discuss the algorithm. What do we want to get? We want to get the optimal structure of the document relative to the keyword. So it’s likely that the most successful structure is presented by those who are in the top for the chosen keyword.
Thus, the algorithm of the forthcoming work can be divided into several parts:
1) Choose an example of the page to check the structure: https://casino-now.co.uk/mobile-casino/
2) Identify a keyword: mobile casino
3) Get a list of the most optimized competitors
4) Get the structure of their pages and check the optimization parameters for the keyword

In order to ensure the correctness of the request processing, we have to programmatically set a time delay equal to the page load time. The function body will be:


    
// hidden setup JavaScript code goes in this preamble area
const hiddenVar = 42


    
def readystate_complete(d):
    return d.execute_script("return document.readyState") == "complete"

Then, it’s necessary to determine the keyword for which we want to view the output and implement Selenium driver to simulate keyboard input:


    
// hidden setup JavaScript code goes in this preamble area
const hiddenVar = 42


    
mainKey = "mobile casino"

driver = webdriver.Firefox()
driver.get("http://www.google.com")

elem = driver.find_element_by_name("q")
elem.send_keys(mainKey)
elem.submit()

WebDriverWait(driver, 30).until(readystate_complete)
time.sleep(1)

htmltext = driver.page_source

As the result, the source code of the page shown below will be stored in ‘htmltext’ variable:

It is worth paying attention that the robot icon presented on the screenshot means that at the moment the browser is under remote control, in our case - by Python.

After the raw html text is obtained, it's time to unload the pages for parsing. The easiest way is when you check the code for the element you are interested in, and then use regular expressions to isolate the information you need, forming a list of objects. For example, to collect the URLs of competitors:

Then let’s check the occurrence of the pattern of interest:

And write out regular expression that looks like:


    
// hidden setup JavaScript code goes in this preamble area
const hiddenVar = 42


    
pages = re.compile('(.*?)' , re.DOTALL | re.IGNORECASE).findall(str(htmltext))

As the result, we get a list of top 10 competitors’ pages for the keyword of interest.
Next stage is re-parsing, similar to the above but for each URL-address of the chosen keyword. The results are formed in the dataframe, the full presentation can be viewed at GitHub: https://github.com/TinaWard/FirstStepForParsing/.
The following is only a snippet of code that’s responsible for clearing your raw html document from tags and scripts. From this perspective, the result of this command will be the cleared text of the site, which will be used to calculate the keyword’s density.


    
// hidden setup JavaScript code goes in this preamble area
const hiddenVar = 42


    
html = driver.page_source
soup = BeautifulSoup(html)
"kill all script and style elements"
for script in soup(["script", "style"]):
   "rip it out"
    script.extract()
"get text"
text = soup.get_text()
"break into lines and remove leading and trailing space on each"
lines = (line.strip() for line in text.splitlines())
"break multi-headlines into a line each"
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
"drop blank lines"
text = '\n'.join(chunk for chunk in chunks if chunk)

The text document will be the result of the parsing script on the basis of which the necessary numerical characteristics are determined to evaluate competitors - for example, keyword density estimation.
The result is presented in the form of a file containing the competitor's page address, html page as well as the document structure and the calculated keyword density:

Wrapping Up

Thus, this material has discovered two key points of parsing - automatic browser control using selenium and raw-html pages processing by means of Beautiful Soup.
Create your web-sites based on the best practice! Good luck!
Leave comments and propose topics you would like to know more: tina.ward@mail.uk
Wordcloud of the article. Have fun!

Data Science - Does it work at all?

Tina — Thu, 13 Dec 2018 11:10:55 +0000

First, explain what Data Science is!

Generally speaking, Data Science is a set of specific disciplines from various fields responsible for analyzing data and finding the best solutions based on them. Previously, only mathematical statistics was involved in this, then they began to use machine learning and artificial intelligence, which added optimization and computer science as a method for analyzing data.

And what do scientists from this field do?

First, programming, mathematical models and statistics. But not only. It is very important for them to understand what is happening in the subject area (for example, in financial processes, bioinformatics, banking or even in a computer game) in order to answer real questions: what risks accompany this or that company, what sets of genes correspond to a certain a disease, how to recognize fraudulent transactions, or what kind of human behavior corresponds to the players who should be banned.

And why is it even needed?

First of all, thanks to the analysis of large data, it turns out to make decisions more efficiently. This, for example, was shown by the latest election campaigns in the USA: using algorithms based on a data array, it is possible to capture the mood of the audience and more precisely target advertising messages (as Donald Trump’s team probably showed during the election campaign).

Benefit from data analysis can be extracted in all more or less applied areas where there is enough data. For example, in medicine, algorithms allow you to better diagnose diseases and prescribe a treatment plan. Personnel management can be improved if the algorithms help to identify in advance that the team started having problems with

And when did they start using it?

Recently. With the growth of both data and computing power, it became possible to more effectively solve old problems. Many of the algorithms used today are known for more than a dozen years, they just became more relevant and more efficient. Machine learning algorithms require a huge amount of information. Image recognition with greater accuracy than man is capable of, more accurate translators and weather forecasts that have appeared lately - all this is similar to a space rocket, to which, finally, a suitable fuel was found.

But all the same, people make decisions?

Now mostly yes. But in general, with sufficient technical knowledge, it is already possible to automate the adoption of simple decisions - where there are clear, executable rules. For example, cybersecurity systems today almost entirely work on machine learning algorithms, deciding whether to send a letter to spam or block a suspicious transaction. Of course, they do this on the basis of already existing data.

The next step in using Data Science is to automate making more complex decisions or creating a smart assistant. Something like this is now working navigators, but you can still remember the T9 on old phones, which was trained in our phrases and adjusted. Next comes the automation of chains of tasks or even specific professions.