Python

Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in other languages. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive set of standard libraries, including NumPy, SciPy, Django, PyQuery, and PyLibrary.

Share tech news, updates, or what's on your mind.

Sign up to Post

hi,

i have a massive python project without Readme file. what is the best way to understand how to run that project?

many thanks.
0
Free Tool: ZipGrep
LVL 9
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Hello,
Back when C# came out; I did not know how to pronouns this combination of C and a # (sharp).  Talking to a  developer, he chuckled when I said "C pound."
And now, when "reading" about generic decorators; it is not written as to  how to pronouns these;
*args, **kwargs
Might any one have the phonetics /fəˈnediks/ on these two decorators;
*args, **kwargs

Thanks
0
I have a cheap Microsoft RFID reader from eBay.  I am using 125mhz chips.  I have a Python program that writes the card number and date to a text file on a server.  Groups of people scan their card and it writes each number and date on a new line in this text file.  Sometimes however, it writes a string of numbers all on the same line with just one date at the end.  Could the problem be that a delay should be written into the Python program between reading the card and writing it - say 100ms?
Each entry should be on a new line and look like this:

0268340542, , , , 11:28:41, 2017-06-07

I get a lot like the above and then it does the following:

026835201202683520560268339729026833971602683537820268339746026835368202683500370268351993026833974102683397410268350159026835013402683375290268350089026835203902683501050268350105, , , , 09:25:48, 2017-06-12

So instead of recording each RFID card it puts them all into the one line with one , , ,  , time and date at the end.
Someone has suggested that it is parallel writing and to lock the file between writes but I don’t know how to do this.
The code is below:
0
Hello everybody,
i wanna know what's the use of "yield" in python and what differentiates  it from "return"
0
We are using ELK(Elastic search, logstash, and Kibana) for our log management. Everytime I will export the setting from kibana UI from Management--->Save Objects--->Export Everything. For demo, one can checkout the url: Kibana Demo

cURL
I want to automate this export process with some scripting on linux. I tried CURL but the header/payload data looks to be dynamic and might have to update frequently. I want to try some web scraping techniques with some web automation tools like Selenium and pythion.

Selenium Webdriver
I tried to record this export action with Selenium IDE and from the IDE I exported the test plan/case into pythonkibana_python.py . As we click on Export Everything button on Kibana, a firefox window opens asking where to save. This action is not handled by selenium, meaning download file is not supported by selenium.

So, I'm looking for some scripting to export the json(Export Everything) file from Kibana UI and it mush be Headless as I would be scheduling this script in linux server. To be headless, I also tried phantomjs, but even this doesn't support file download.

Simply, I just want a script to automate clicking export everything button on kibana and want it be saved in a file. Please share your thoughts or any idea,  i'm trying this for days...
0
Hi All,

I have written a short program to read  a private Microsoft Message Queue and reformat the data to make it more
legible and store it in a file with a filename identifying the part or assembly. This code is part of a system to keep people informed about engineering changes to SolidWorks PDM Works files.

The problem I am having is that the command I am using to read the queue (queue.Receive()) blocks.

I have been unable to set the timeout or remove the block. The code compiles -- whatever that means for an interpretive language (using Geany) --- and runs.

It processes all the information correctly and generates the files as designed to do until the queue is empty.  Then it hangs (blocks) and waits for another message to arrive in the queue.

I want it to process all the messages in the queue and then exit.

I will set a trigger to run the program again when the private queue receives more messages.

Along the lines of this problem, is the import command and missing attributes and methods. For example, I import queue and queue.qsize(), queue.empty() and some others are missing. I saw that qsize and empty might be eliminated.

 However, I looked at queue.py and could not find  Receive. I was hoping to find the structure of Receive and figure out how to set the timeout and block.

No luck,

I  have attached the py file.

Thanks in advance for the help,

Andre'
PDM-Update-Reader.py
0
Hi.
I converted  dex reader code sample  from the python  example Python version  to java but i am failing on   master handshake (after slave in master mode).
Failed on State 3 - Sending master key.
Have some one experience to read audit data from DEX  vending machine from android Bluetooth or DEX  cable.
File Provided.
My task: Allows the user to read DEX from a vending machine using a Bluetooth DEX device.
DexReader.java
0
So i'm trying to parse some xml with ElementTree, but it's got smileys in what seems to be UTF-16 decimal.
it's got this `&#55357;&#56835;` in it but says it's UTF-8 in the <?xml?> tag.


How do I decode UTF-16? Is that the right question to ask?
0
Update: Found my own solution, but would prefer a tal: solution if there is instead of javascript

I am running Plone 5, and have created two content types:
1. Purchase Order
2. Purchase Order Details.

I have created a purchase order custom page template which includes information about the selected Purchase Order (Title/Summary), and a table with the Purchase Order Details (Item Description/Item Summary/Quantity/Cost/SubTotal) as pictured below.  I am struggling with how to calculate the total of the purchase order in this scenario.  In the code below I only initialize a value of 0, but need to add the code to increase the total after every iteration of the purchase order details.

Purchase Order Content Type Page Template
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
      xmlns:tal="http://xml.zope.org/namespaces/tal"
      xmlns:metal="http://xml.zope.org/namespaces/metal"
      xmlns:i18n="http://xml.zope.org/namespaces/i18n"
      lang="en"
      metal:use-macro="context/main_template/macros/master"
      i18n:domain="plone">

<metal:css fill-slot="style_slot">
<style type="text/css">
    <!-- Replace this with your views' custom CSS -->
</style>
</metal:css>

<metal:javascript fill-slot="javascript_head_slot">
<script type="text/javascript">
jQuery(function($) {
    // Replace this with your view's custom onLoad-jQuery-code.
});
</script>
</metal:javascript>

<body>

<metal:content-core fill-slot="content-core">
    <metal:content-core 

Open in new window

0
hi,

how do i install and run a python project from getlib on my mac computer?

thanks.
0
[Webinar] How Hackers Steal Your Credentials
LVL 9
[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

I have a cheap Microsoft RFID reader from eBay.  I am using 125mhz chips.  I have a Python program that writes the card number and date to a text file on a server.  Groups of people scan their card and it writes each number and date on a new line in this text file.  Sometimes however, it writes a string of numbers all on the same line with just one date at the end.  Could the problem be that a delay should be written into the Python program between reading the card and writing it - say 100ms?
0
Hi everyone,

So I've spent the past week reading and trying to understand the python language and more importantly the CherryPy web framework.
To be honest Im getting no-where quickly...

What I'm trying to do is simply build a frontend to a Raspberry Pi project I've been developing.
The CherryPy web framework seems perfect for this as it is small and contains its own web server that would suite my needs perfectly.

An admin user attaches to the Raspberry PI AP and is immediately directed to a webpage (This I have solved quite easily).
The webpage is presented by the CherryPy web server.

I have an index.html which it returned from my CherryPy script.

import os, os.path, sys
import cherrypy

# Configuration file to access server over network and define ports
cherrypy.config.update("server.conf")

class menu(object):
    @cherrypy.expose
    def index(self):
        return open('index.html')

(Fairly simply until now I know)

What I would like to do is have buttons on my index.html that can be pressed and return os.system('mkdir boom')  (obviously my system commands will be a little more than this.. They will start and stop services.

The buttons will eventually be toggle switches, so going to the page will need to return the current status of the running process. Red if the process is not running and Green if it is running. I think I need to be interacting with jquery on my index.html page to achieve this...

This whole area is new to me and …
0
I'm currently trying to create a script that will search for specific data in an excel document and print out a range of cells.

For example, it would search for "color" in column A. If "color" was found in A4 it would print out the data in A4, B4, C4, D4, etc.

import openpyxl
wb = openpyxl.load_workbook('example.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')

for rowOfCellObjects in sheet['A1':'C3']:
    for cellObj in rowOfCellObjects:
        print(cellObj.coordinate, cellObj.value)

Open in new window


This is the only code I have so far which is an excerpt from Al Sweigart. “Automate the Boring Stuff with Python: Practical Programming for Total Beginners.” iBooks. and the only thing it does is search for specific cell addresses and prints those.

Is this a feasible project? I'm very new to Python so I don't fully grasp its limitations.

Any help would be appreciated!
0
*I was tasked with writing a script (suggested in python but can be anything that the Crontab will work with) but my programming skills are definitely lacking. In the past week I have been researching and trying to learn as much as I can and below is what I came up with but I could use some help from the EE community. Thanks!

I am trying to write a script in python that will run on our Crontab of our Linux VM server. The script is supposed to recursively go through all the files on the remote server directory and download the specified file type (.csv in this case) and then promptly delete whatever files successfully download from the server. I have the following code using pysftp but I am open to using other modules such as paramiko or lftp if it's easier. It just needs to be able to be run as a Cron job in our Crontab on the linux server.

Below is my current code but it isn't work in terms of remotepath, localpath. I tried to tell it to download the file to a certain path but it's not going there. Also I don't know the command to remove the files it downloaded only. In other words files might already be on the local server and if so they should NOT download so that means the file wouldn't be removed from the remote server (if it doesnt download it shouldn't get removed).

For security purposes I have removed the server info. I did create a public/private key for ssh but unsure of how to implement it into my code.

import pysftp
import sys
import glob

srv 

Open in new window

0
I have a txt file containing data in following format:

abc 123 456
cde 45 32
efg 322 654
abc 445 856
cde 65 21
efg 147 384
abc 815 078
efg 843 286
and so on. How can transpose it into following format using Python:

abc 123 456 cde 45 32 efg 322 654
abc 445 856 cde 65 21 efg 147 348
abc 815 078 efg 843 286
Also, in case cde/efg is missing after abc, it should insert blank spaces instead, since it is a fixed width file.
One more thing , abc will always be present, sometimes row starting with cde or efg will not be there .
0
Hi,

I'm looking to use the GoogleFinance module
https://github.com/hongtaocai/googlefinance

to record the stock price every few minutes into a MySQL database

From the documentation I can see that it will require a list of stocks  - that I will keep in csv file

I'm having difficulty parsing the JSON that comes back in order read it into a variable.
'TypeError: list indices must be integers, not str'

Can anyone direct me to some resources that might give me a steer on how best to achieve this?

I will look to create a table for each stock being tracked
0
Hi,

Im been using Scapy for 2 months and is a very good piece of software... I been able to do all other things beside this one.

I want to send an RTS package and receive an CTS.

--------------------ENVIROMENT---------------------
$lsb_release
LSB Version:      core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch

$uname -a
Linux EDUARDO 4.4.0-81-generic #104-Ubuntu SMP Wed Jun 14 08:17:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep scapy
ii  python-scapy                                                2.2.0-1                                       all          Packet ge...

$ python --version
Python 2.7.12

2 x Alfa Wireless AWUS036NH Cards

--------------------PROBLEM and TESTING Scenario---------------------

I want to be able to see the responses in Wireshark, by sending an RTS/Scapy in one Alfa card and receiving RTS+CTS in Wireshark in another.

How can this be achieved?

Eduardo

--------------------CODE---------------------
import datetime
from scapy.all import *
from scapy.all import Dot11,Dot11Beacon,Dot11Elt,RadioTap,sendp

ifacx="mon1"

addr1='60:a4:d0:21:fa:3c'
addr2='22:2B:22:23:22:22'
addr3='33:3B:43:33:33:33'

i=1
while 1:
    time.sleep(.100)
    i = i + 1

    #Send RTS
    Doto11 = Dot11(type=1,subtype=11,addr1=addr1,addr2=addr2,addr3=addr3,ID=0x99)
    pkt = RadioTap()/Doto11
    sendp(pkt,iface=ifacx,realtime=True)
 …
0

Technologically assisted systematic reviews in

empirical medicine

Shruti Gupta, Illinois Institute of Technology




Abstract--Evidence-based medicine has become an important strategy in health care domain and policy making. In order to practice evidence-based medicine, it is important to have a clear overview over the current scientific consensus. These overviews are provided in systematic review articles, that summarise all evidence that is published regarding a certain topic (e.g., a treatment or diagnostic test). In order to write a systematic review, researchers have to conduct a search that will retrieve all the documents that are relevant. This is a difficult task, known in the Information Retrieval (IR) domain as the total recall problem. With medical libraries expanding rapidly, the need for automation in this process becomes of utmost importance. We investigate three techniques from the information retrieval (IR) domain, using a custom build search engine. We search through PubMed Central and use the Cochrane review library as a golden standard. Improving the search results by expert feedback seems specially promising, as it is an easy process that increases recall.


I. INTRODUCTION


The applicability of data mining for Evidence based Medicine[1], which means integrating individual clinical expertise with the best available external clinical expertise with the best available external clinical evidence from systematic searches. The objective of data mining in medicine include the derivation of valuable knowledge which will be able to provide new comprehension beyond conventional medical experience. Although data mining is useful to explore the hidden knowledge, the outcome is usually just has low-grade evidence because of the non-controlled bias and confusing. In practice of evidence based medicine, we emphasize the point that experimental studies provide strong evidence and observational traditional studies are hard to contribute for generating strict clinical evidence. Hence the mined hypothesis must be refined through the validation process for evaluating validity of the mining results to reefing mined knowledge and integrate the knowledge obtained at several research groups.


A. General Setup:


1) The Library:


The PubMed Central library[2] is used and all documents were downloaded, and the meta fields (title, publish date, keywords), body, and abstract were extracted from the raw XML.


2) The search Engine: 

The Elasticsearch 5.2.2 engine[3] (running on Windows 10, 64 bit computer) was used to index and search through the documents.


1. Data:


  1. Data Source 


A development set consisting of ~20 topics for Diagnostic Test Accuracy (DTA) reviews from Cochrane library. 


We will be performing our experiments and result analysis on the development set and qrels file shared.

The qrels may well contain PIDs that are not included in the results of the Boolean query. Researchers often search multiple databases, with multiple search strategies, hence all included and excluded at document level PIDs are also included in the qrels. 


Following table represent some further information on the dataset:


Topic   ID No.   of PIDs Topic   ID No.   of PIDs
CD010438 3249 CD011984 8221
CD007427 1469 CD010409 43484
CD009593 15076 CD010771 316
CD011549 12704 CD009591 8082
CD011134 1952 CD008691 1322
CD011975 8227 CD010632 1508
CD009323 3857 CD007394 2542
CD009020 1576 CD009944 1225
CD011548 12706 CD008643 15078
  1. Data Format:


(i) Development Dataset: The file format of the each 20 dataset files provided are of the following format:


Topic: [Represents the Topic Number/Topic Id]

Title: [Represents details about the query is most concerned about .]

Query: [It is the Boolean query formulated by Cochrane experts in order to address some task.] 

Pids:[ List of related PubMed PID’s as returned by the query listed above.]


(ii) QRELS file format: The file format of qrels file which is as per the TREC format is mentioned as below:


TOPIC ITERATION DOCUMENT# RELEVANCY 


where TOPIC is the topic number, ITERATION in our case is a dummy field always zero and not used, DOCUMENT# is the PubMed document identification number (PID), and RELEVANCY is a binary code of 0 for not relevant and 1 for relevant. The order of documents in a qrels file is not indicative of relevance or degree of relevance. Only a binary indication of relevant (1) or non-relevant (0) is given. Documents not occurring in the qrels file were not judged by the human assessor and are assumed to be irrelevant in the evaluations.


1.3 Programming Language: We are using Python 3 for our project implementation.


Packages: We are using  the following packages used for carrying out our project requirement. PubMed, NLTK, Pandas, NumPy, matplotlib and ElasticSearch.

              

2. Experiments:


2.1 Preliminary Experiments and the approaches employed for our experiments: 


In order to write a systematic review, we need to devise a system that will first fetch all the relevant documents related to the query. This is a difficult task, known in the Information Retrieval (IR) domain as the total recall problem. The current approach to this problem in the field of systematic reviews can roughly be devised into two phases: 


(i) Search phase: The goal of the search phase is to obtain all relevant documents. Although this is practically impossible, the search is optimized to return as many relevant documents as possible. This is done at the cost of obtaining many additional irrelevant documents. 


(ii) Filtering phase: Because the search phase yields many irrelevant results, the relevant documents have to be filtered out. The selection is done manually, and is usually divided in multiple stages (e.g., a quick scan based on abstract followed by a full-text scan). 


The approaches that we use in our experiments for implementing the best search are as follows:

 

1) Elastic search for Ranking the relevant documents: This search is performed by creating a virtual server on our windows operating system in order to fast retrieve the documents based on the Boolean search and also term-frequency(tf) and Inverse Document Frequency(idf) calculations. 

2) Secondly, we also tried to implement a Boolean Search query expansion which manually helps in searching the terms from the query in all the related PID document.


The following lists some of the preliminary experiments performed to process the data given:


2.1.1 Preprocessing:


On development data preprocessing is performed in order to remove irrelevant literals and perform the following on the data given:


  • Tokenization: Removing non-alphanumeric characters, punctuations etc.(by using regular expressions)
  • Normalization: Data is normalized by removing stop words(by English language tool-kit).
  • Separating the different attributes of our file data which constitutes of ‘Topic:’, ‘Title:’, ‘Query:’ and ‘Pids:’. We have to separate all these attributes differently in order to  process later operations on ‘Query’ and fetching data for all related ‘Pids’ for each ‘Topic’ and ‘Title’ as mentioned.
  • Also a replacement of “\n” to “” has been done, wherever applicable. 


In this way preprocessing is done in order to make our data ready for further processing while removing more irrelevancy in our data files given.


2.1.2. Fetching the related Topic Id data from PubMed library:


As previously stated, we  are using PubMed library for fetching data corresponding to its related Pids. All documents were downloaded, and the meta fields (i.e., title, publish date, keywords), body, and abstract were extracted from the library and stored in the XML file.


2.1.3. Loading the Data using Elastic Search engine:


The Elasticsearch 5.2.2 engine was used to index and search through the documents. Load Data Steps: 

(a) Start ElasticSearch Server. 

(b)Create index for storing data. 

(c) Stored fetch data. 

We storing only relevant data as required by fetching it from XML.

 Handling problems with the Data Loading: As fetching the data and loading it using Elastic Search is a continuous process and it is unprecedented that if we face any problems like stack overflow, connection timeout etc. during this experiment we have to keep a track and ensure that if something bad happens how it will be handled until the required data is not properly loaded on the server instance. To overcome this problem, we had implemented the “Buffer” concept in which we keep the track of Pids which are not loaded on the server. This process is continuing until Pids buffer empty.


2.2 Implementations and evaluating quality of the results and experiments performed:


(i) For implementing the search and ranking the documents on the basis of relevancy we find Elastic search to be a better approach to find the relevant documents in our case and the scoring each documents too accordingly. 

(ii) For further filtering more irrelevant documents found from the above scoring/ranking by using various clustering mechanisms and tried K-Means and Meanshift algorithms for clustering our relevant documents found from step(i). 


Evaluation Methods and Measures used for evaluating the accuracy and the quality of our experiments and results:


1) The Cochrane Library Standard:


In order to evaluate a search engine, it requires a need for evaluation against a search for which the desired output is known. This evaluation set is known as the  standard. And we used the Cochrane review library as a standard. The Cochrane library consists of many review articles that are clearly structured. For our challenge problem statement, we have been provided a development dataset containing the following build query(by Cochrane experts) and related PID’s from PubMed library.

 

Input: For each topic we are provided with: 

  1. Topic-ID
  2. The title of      the review, written by Cochrane experts;
  3. The Boolean      query manually constructed by Cochrane experts;
  4. The set of PubMED      Document Identifiers (PID's) returned by running the query in MEDLINE.

This is in turn focused on Diagnostic Test Accuracy (DTA) review articles. Search in this area is generally considered the hardest, and a breakthrough in this field would likely be applicable to other areas as well.


2) Measures


  • Elastic search score

Elastic search score depends on Term Frequency(TF) and Inverse Document Frequency(IDF). This score has been calculated by using the Elastic Search for the query as written below:


There are several other metrics to quantify the accuracy of the fit: 

  • Recall: Recall      expresses the proportion of documents that are correctly retrieved. This      measure is also known as sensitivity in the systematic review domain. 

recall = # relevant documents retrieved      xxxxxxxxxxxxxxxx# all relevant documents

  • Precision: Precision      expresses the proportion of the retrieved documents that are correct.

precision = # relevant documents retrieved aaaaaaaaaaaaaaaa#all documents retrieved 

  • F1: F1 defines      the geometric mean between recall and precision.

F1 = 2 *(recall * precision)

 Sssssss(recall + precision)

  • Fβ:      In      areas like systematic reviewing, recall may be more important than      precision. The Fβ      measure allows to put more weight on one of the two. It is defined so that      a β value of 10 means that recall is 10 times as important as precision. 

 Fβ = (1 + β^2 ) *(recall * precision)

Qqqqq(recall + (β^2 * precision))

 In this report a value of β = 10 will be used, as suggested by for systematic review articles.

  • MAP: As the      search engine performance is evaluated at a specific rank by all the above      measures. So these measures seems insufficient to evaluate the overall      performance for different queries, as each query will have different      precision and Fβ values.      The average precision is a measure of performance over the entire      ranklist. It equals to the area under the precisionrecall plot. The      average precision can in turn be averaged over queries resulting in the      Mean Average Precision (MAP). 

We have calculated precision, recall and F scores at the end of which each last ranked document is retrieved. And it is found that F1 seemed to be proportional to Fβ also it is found that MAP seemed to be a good descriptor of our search performance.


2.3 Approaches Selected and my own experiments: 


The following experiments and approaches we used to implement searching the relevant documents. 


1) Preprocessing, Data Fetching and loading the data: 

These have been describes in the section 2.1.1


2) Ranking the Documents using Elastic Search: We are using ElasticSearch score function for ranking the documents. We parse query as parameter for ranking each document based on elasticsearch score. ElasticSearch uses Lucene’s practical scoring function which is similarity model based on Term Frequency (tf) and Inverse Document Frequency (idf) that also uses the Vector Space Model (vsm) for multi-term queries. 

3) Filtering the documents by using Clustering mechanisms and later on by Query Expansion/Mining: 


(a) By Clustering:

[My Experiment]: I have worked on performing and testing the below clustering experiments: 

After the ranking of the documents has been done using the elastic search using the query that is given. Our next task is to filter those documents based on abstract and documents and came up with only most relevant documents that will give us the best technical reviews for that particular query given. From the elastic search and by using the results of the scored documents that are obtained from that Boolean query we have tried different clustering methods in order to find most relevant documents based on the ranking score value that has been generated in the first stage of using that elastic search using that Boolean query. We have employed the following clustering methods to perform our task of filtering the documents. I have explained below why we have chosen these methods and their relevance later:

(i) K-Means Clustering Method: k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

(ii) Mean-shift Clustering Method: Meanshift clustering algorithm assigns data points to the clusters iteratively by shifting points towards the Mode. The Mode generally lies in the highest density of data points and that is the reason why it is called as Mean-shift or Mode seeking algorithm. For a given a set of data points, the algorithm iteratively assign each data point towards the closest cluster centroid. The direction to the closest cluster centroid is determined by where most of the points nearby are at. So each iteration each data point will move closer to where the most points are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned to a cluster. 

This experiment was later studied further and compared with the other methods available for obtaining a better recall and this approach and have been proved to be discarded later on.


(b) Query mining:


Each review article in the Cochrane library contained a detailed description of the query used to search through PubMed. Because the features of the PubMed search engine do not match completely with their implementation in Elasticsearch, some changes had to be made. The general approach was to adjust the query in such a way that recall was maintained, at the possible cost of decreasing precision.

• MeSH terms (a popular medical ontology) were omitted from the queries. 

• wildcards within literal strings were replaced with singe terms (e.g. ”retinal nerve fib*” → ”retinal nerve” AND fib) 

• near-clauses were converted into AND clauses (e.g. fib* adj2 retinal → fib* AND nerve)


2.4 Summary of the test results and future analysis: 


3.1 Recent Experiments and Summary of the Results:

3.1.1. Performed Elastic Search(Boolean Query search) on 14 more extended data files: 

In our last project report II, we have performed our experiments on 4 Topic Ids out of 20 Topic Ids given in our dataset. This time we have processed 10 more files out of the given 20 Topic Id files and performed the elastic search on them and obtained the ranking and scoring to find the relevant documents.



TOPIC ID No. of PIDS
TOPIC ID No. of PIDS
1. CD010438 3249 11. CD010771 316
2. CD011984 8221 12. CD009944 1225
3. CD007427 1469 13. CD008691 1322
4. CD009593 15076 14. CD010632 1508
5. CD011549 12704


6. CD011134 1952


7. CD011975 8227


8. CD009323 3857


9. CD009020 1576


10 CD011548 1706


Loaded More Data files: Loading more data on ElasticSearch server. This process includes both fetching and loading processes. Both are this continuous process in which data is fetched using the PUBMED library in XML format and data are stored on the ElasticSearch server for further analysis. 

3.1.2 Performed “Boolean Query Expansion” used in the Elastic Search:

As we know that, a term in the medical dictionary can come in various co-relations/synonyms, forms and diacretics/accents. So, we can do this by adding the “most_field” in our Boolean query for ranking and scoring the documents using the elastic search server. The functionality of this field can be written as below:

  • Stemmer: for example, use a stemmer to index jumps, jumping, and jumped as their root form: jump. Then it doesn’t matter if the user searches for jumped; we could still match documents containing jumping.
  • Synonyms for example include synonyms like jump, leap, and hop.
  • Remove diacritics, or accents: for example, ésta, está, and esta would all be indexed without accents as esta.

We used ‘or’ operator for searching long query. We also tried “minimum_should_match“ parameter, which allows you to specify the number of terms that must match for a document to be considered relevant but later as per its results this was discarded. 

 3.1.3 Qrels File Splitting as per Topic Ids: For evaluating and testing relevant classification of the documents:  

In order to find the Topic Id and relevant PIDS related to it from the given qrels file and this is done by splitting the “qrels” file which contains filed like TOPIC ,ITERATION, DOCUMENT#, RELEVANCY where TOPIC is the topic number, ITERATION in our case is a dummy field always zero and not used, DOCUMENT# is the PubMed document identification number (PID), and RELEVANCY is a binary code of 0 for not relevant and 1 for relevant. “qrels” file split with respect to TOPIC and after that split with respect to RELEVANCY. All these files are store in .csv format for further classification. 

3.1.4  Improvising the Search[My Experiment]: 

We know that implementing automatic applications for filtering irrelevant documents can be done by improving the search by Boolean search query but as the data keeps on increasing day by day writing a Boolean query is becoming more complex too so devising the search engine to perform better search. Many discoveries in IR have resulted in enormous improvements in search engines. We propose to improve the document retrieval by refining the search engine. The goal is to make search more intuitive, and improve recall (leading to a less costly filtering phase).  

  1. Search by Topic-Title: 


We performed a search by TITLE as the title contains few words and is supposed to represent the most essential keywords. This search is performed on different part of the database like title and abstract. We performed a search only on the title, only on abstract and on both title-abstract. As a result, we get ranked document for different search.

  1. Search by Topic-Query:


 We performed a search by QUERY. This search is performed on different part of the database like title and abstract. We performed a search only on the title, only on abstract and on both title-abstract. As a result, we get ranked document for different search.

3.2 Analyzing the output parameters i.e., Recall/Precision/F1-Score from the above above improvised search implementation: 

Our search returns a (ranked) list of documents, and there are several metrics to quantify the accuracy of the fit. The measures we used:

Results of these measures for our experiment on the 3 Topic Ids mentioned below are:

  1. Search with Topic Title with PID Title, PID Abstract and PID Title-Abstract:
Topic   ID  Recall Precision F1   Score F   beta
CD011975 0.8497 0.1346 0.2325 0.80732
CD010771 0.875 0.156134 0.264984 0.83685
CD011134 0.8744 0.11339 0.200747 0.81993

Search by Topic TITLE on PIDS title field

Topic   ID  Recall Precision F1   Score F   beta
CD011975 0.8691 0.137772 0.237843 0.825742
CD010771 0.8541 0.152416 0.258675 0.816926
CD011134 0.8511 0.110374 0.195408 0.798125

Search by Topic TITLE on PIDS abstract field


Topic   ID  Recall Precision F1Score F   beta
CD011975 0.9499 0.150576 0.25994 0.90248
CD010771 0.875 0.156134 0.26498 0.83685
CD011134 0.92093 0.119421 0.21142 0.86354

 Search by Topic TITLE on PIDS title-abstract field


  1. Search with Topic Title with PID Title, PID Abstract and PID Title-Abstract:
Topic   ID  Recall Precision F1Score F   beta
CD011975 0.9483 0.081562 0.15020 0.8580
CD010771 0.833333 0.168067 0.27972 0.8019
CD011134 0.930233 0.107181 0.19221 0.8645

          Search by Topic QUERY on PIDS title field


Topic   ID  Recall Precision F1Score F   beta
CD011975 0.8901 0.07656 0.14099 0.8054
CD010771 0.854167 0.172269 0.28671 0.8219
CD011134 0.851163 0.098071 0.17587 0.7910

        Search by Topic QUERY on PIDS abstract field


Topic   ID  Recall Precision F1   Score F   beta
CD011975 0.9709 0.083507 0.153787 0.87848
CD010771 0.875 0.176471 0.293706 0.84200
CD011134 0.9302 0.107181 0.192215 0.86450

Search by Topic QUERY on PIDS title-abstract field


Results Analysis and Summary :


In our experiments we have processed 4 files with Topic Ids: CD009944, CD01077, CD010632, CD008691. Each containing different PID documents of 1225, 316,1508, 1322 files respectively. 


1) Ranking/Scoring using ElasticSearch result and score analysis: The graph below (Fig. 1) indicates the Elastic Search using Boolean search query on the above files with Topic Id: CD010632 and the following Ranking of its related PIDs are calculated based on the score value(which indicates the relevant occurrence of the queried data in the documents).

Fig. 1. Ranking of relevant documents using Elasticsearch


2) Search analysis using Title, Abstract and Title-  Abstract (full text):

  1. Searching By Title:


Fig.   2.  Performance measure using search by   title 

Analysis for Title related search on PID Title, PID   Abstarct, PID Title+Abstract: 

Fig.   2 shows the performance for search using TITLE on title, abstract and full   text (title-abstract). As expected searching on full text improves recall as   compared with search on title and abstract.

  1. Searching By Query:
Fig.   3.  Performance measure using search by   query 
Analysis for Topic Query search plot on PID Title,   PID Abstract and PID Title+ Abstract:
 
Fig.   3 shows the performance for search using QUERY on title, abstract and full   text (title-abstract). As expected searching on full text improves recall as   compared with search on title and abstract. 


  1. Comparision of Title vs Query Recall: 
Fig.   4.  Performance measure using search by   title - abstract 

Analysis of the Comparison of using Title vs Query   for and title vs abstract vs title+abstract(Full Text):

Fig.   4 represents result of recall after performing search using TITLE and QUERY.   As a result, we observed that performance is improve after search using   QUERY.





3) Comparison of Precision-Recall graph for Query Mining on Text vs Full-text:

The below Fig. 5 and 6 shows the performance for the boolean query on abstract (text) and on full text (title and abstract). As expected, searching on full text improves recall as compared with search on abstract and meta only. However, the cost of this is decreased precision, meaning that more documents have to be examined in order to find the same number of relevant documents.

Fig. 5. Precision-Recall graph using Title 

Fig. 6. Precision-Recall graph using Query  

The graph in Fig. 7 indicates the average precision-recall plot when we process all 14 data files together and indicates the relevance of using both Title-Abstract as the best search criteria, as the Recall for full text is most when the search is performed using both the parameters.

Fig. 7. Average Precision-Recall plot by using Boolean query 

The point with largest value is indicative for the title-abstract search and then only abstract and then title only. 

4) All Measures:

Fig. 8. Performance evaluation using all measures 

The Fig. 8 shows the evaluation for all measures using all 14 files averaged on the values gave us the plot as above which is indicative of the higher recall using full text rather than only title or only abstract. Also the MAP value is also lower for full-text search, and according to the Fβ measure, searching on abstract only is more desirable.

5) Parameter Tuning using Query expansion: 

While performing the task of query expansion the motive of which has been stated earlier. We have recorded the effect of using various query parameters on our output measures and we have find that by using various kinds of parameters and some selective inclusions and exclusions in it will significantly affect our  recall and precision curves and thus affect overall accuracy. Some of those parameters that we examined and are using in our project for the fine tuning of the output variables are listed in Appendix A section of this report. 

3.3 Error Analysis: 

The following approaches that we tried in our experiments and which did not work out well enough. They are mentioned as below: 

  1. Clustering experiments using K-Means and Meanshift algorithms:

The graphs below shows the k-means and Meanshift clustering results for our one dimensional data points. 


  • Meanshift clustering results of CD010632(Yellow and Red indicates relatively more important relevancy of the documents)

          


Fig. 8. Meanshift clustering results of CD010632(Yellow and Red indicates relatively more important relevancy of the documents


  • K-Means clustering results of CD010632 (Yellow being most relevant in this case)

Fig. 8. K-Means clustering results of CD010632(Yellow being most relevant in this case) 


In our previous Project Report II we have worked on using the various clustering algorithms to filter out more relevant documents from the ranking and scoring generated by implementing the Elastic search engine and querying on that. 

But out analysis of clustering results show that it is the number of relevant documents generated out of that approach is not enough to carry out the Recall measure calculations required in our experiments. So, we are dropping the above approach to be utilized in our project.

(2)  Query expansion experiments:

While experimenting with the various approaches in order to expand our query and make it return most relevant search results. We have tried the following and found them to be irrelevant to use in our case. Some of them are listed below:


a) Boolean Query vs Best Match: 

Though we have indicated the complexity of constructing the Boolean query in this very complex and sensitive domain, we have found that the Best Match though seems to be a good approach to tackle the problem goal but the results obtained from it does not fully account for increasing the overall recall and precision to a level when the same is compared using the Boolean query. So we discard this Best Match querying which is still less known in the field of information retrieval and instead continue our search using the Boolean query technique which returns significant good recall when tuned with some parameters.


b) Other failed experiments on query expansion:

We tried to use the “minimum_should_match” for pruning some documents which are less important or contains only few information. For implementation, we manually passed some value which is useful for some documents but not all. For some documents, it does not generate any output. So, for final implementation we skipped this parameter from our Boolean search query.


4.   Conclusion:

Three improvements to document retrieval in systematic reviewing have been suggested. Effectiveness of each suggestion varies, but in general there seem to be many opportunities for improvement. 

Full-text, as opposed to abstract search increases recall, but does so at the cost of decreasing precision. 

Investigating best-match search is difficult, because an alternative query needs to be formulated. Out of the investigated options, the boolean query is the most effective.

Furthermore, our research suggest that tweaking the parameters may lead to even more improvement. Further research should investigate the optimal combination of settings.


Trends in MAP seemed to be consistent with trends observed at the query level. Precision and recall (at the final rank) also described important characteristics, although the exact rank at which they were measured (or the total number of documents retrieved) was needed in order to interpret them right. Fβ measures did not seem to describe the data well. Taken altogether MAP seems to be the best aggregated descriptor. However, a downside for the systematic review domain, is that we cannot increase the relative importance of recall compared with precision.

APPENDIX A

Parameter Description
stopword All stopword   like ‘a’,’an’,’the’ etc. are ignore.
multi_match The multi_match query   builds on the match query to allow   multi-field queries:
type The most_fields type   is most useful when querying multiple fields that contain the same text   analyzed in different ways. For instance, the main field may contain   synonyms, stemming and terms without diacritics. A second field may contain   the original terms, and a third field might contain shingles. By combining   scores from all three fields we can match as many documents as possible with   the main field, but use the second and third fields to push the most similar   results to the top of the list.
fields It is often   useful to index the same field in different ways for different purposes. This   is the purpose of multi-fields.
operator ‘OR’ operator   used for searching long query.
size Maximum size   defines for document retrieval. 


0
 

Author Comment

by:Shruti Gupta
Comment Utility
Technology Assisted Reviews in Emperical Medicine.
0
I've recently decided to try and develop a script that will allow me to search for a specific word or value in a row and copy all of the data in the corresponding column, but it's proven to be too much for myself as I'm very new to Python.

If I could figure out how to search for cells within excel using Python I could go from there, but I haven't found any good resources to teach myself that skill.

Any help would be greatly appreciated.
0
Free Tool: IP Lookup
LVL 9
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

I am in the process of moving legacy codes in c to Intel 64 (ksh) from Unix Tru64 (ksh).  
 I created a basic hello world application in c so I could setup the debugger.   INTEL has instructed me to use the gdb-ia debugger.   I don't know where any manuals are to read this for myself.   I have gotten a few errors along the way that have lead me to add the following lines to my ksh users .profile.  
 export LD_LIBRARY_PATH=/opt/intel/debugger_2017/libipt/intel64/lib
 export PYTHONHOME=/usr
 Currently I am getting an error:  $>gdb-ia ImportError: No module named site
 With every question I ask INTEL, it takes me a business day to get an answer.  So, I've had great luck with Experts Exchange in the past, so I'm back to get some direction.  
 My last correspondence with INTEL was:  
      "Looks like you forgot to source the environment."
        "#source /opt/intel/bin/compilervars.sh intel64"

 Where am I going to add this line of code?   The "#source" makes me think it would go into an include file somewhere.   Can anyone see through this and give me some advice???
 I also have another post out there for mapping Tru64 compiler options to INTEL.  

 Thanks everyone for your help!
0
For a program i have nearly ready i need some code what i can't find in VB, only in Python.

import serial
import operator

def ibisString( str ):
bytestring = str+" "
bits = chr(0x7f) + bytestring
checksum = reduce(lambda x,y:chr((ord(x)^ord(y))&0xff),bits)
message = bytestring+checksum
output = map(lambda x:hex(ord(x)),message)
bytes = bytearray(int(x, 16) for x in output)
print output
return bytes

ser = serial.Serial(‘/dev/ttyUSB0‘, 1200, parity=‘E‘, stopbits=2, timeout=1)
ser.write(ibisString(‘l300‘))

Open in new window


Can anyone help me to convert this Python code to VB?
0
hi All,

kindly help me to write a shell or python script to delete the weblogic files. The files are rotated but need to be deleted which are 30 days old. the find command doesn't delete the files because the number of files are huge.

below is the  .logs format of the weblogic servers
/wls1034/Middleware/user_projects/domains/Domain/servers/server3/logs/

server3.log00585
 server3.log00586
 server3.log00587
server3.log00588
server3.log00589
server3.log00590
server3.log00603
server3.log00604

server3.out00035
server3.out00034
server3.out00036
server3.out00037
0
Hi all,

Need to write a python or shell script for email alert on memory utilisation of the server.
we are glance tool to monitor the cpu and memory utilisation for our weblogic servers. If the memory utilisation reached 80% it should alert us through mail so that we can check and restart if required.

below is the output of glance file
B0000A Glance C.04.70.000       04:56:38    x86_64                                                                     Current  Avg  High
------------------------------------------------------------------------------------------------------------------------------------------------------
CPU  Util   SU                                                                                                                       |  1%    3%   27%
Disk Util                                                                                                                            |  0%    0%    7%
Mem  Util   U                      UB     B                                                                                          | 26%   26%   26%
Swap Util                                                                                                                            |  0%    0%    0%
------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                     PROCESS LIST                     …
0
Hello All,

I am new to python and coding overall. I am trying to make sense out of a particular line in the code below. This code will be used to see the files in my recycle bin. In  the second line, what are "curr, dirs, files"? are those fields, arguments of os.walk() function? I have noticed that if I change for instance "curr" for any other name, and change that parameter in the "path" line, the program still works. I am actually a little confused.

import os
rootrec = "C:\\$Recycle.Bin"
for curr, dirs, files in os.walk():
    for f in files:
        path = "%s/%s" % (curr,f)
        print(path)

Any clarification will be helpful!
0
I am looking to replace the value between two spaces in a string with a backslash. Can someone assist with the syntax.

E.G. I want

RED 123456789 White

to become

RED\White

Thanks
0

Python

Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in other languages. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive set of standard libraries, including NumPy, SciPy, Django, PyQuery, and PyLibrary.