5000 PDFs on Your Computer: Knowledge Management Software

Korea related or in general
Frank Hoffmann
Posts: 18
Joined: May 7th, 2012, 9:21 am

5000 PDFs on Your Computer: Knowledge Management Software

Unread post by Frank Hoffmann »

 
5000 PDFs on Your Computer:
Knowledge Management Software


FRANK HOFFMANN



Situation:
- you have hundreds or thousands of mostly PDF files (and maybe fewer MS Word and image files), in various languages (English, Korean, etc.) and stored in various folders on your computer's hard drive, with no fast and easy access to the file or files you are looking for, because they are hard to get into a sensible order

Goals (I will concentrate on PDFs in this posting):
- find files/publications on your computer quickly an easily
- search for text (in both English and East Asian languages) within all files simultaneously, quickly and easily, either within all PDFs and other files on your computer or limited to a certain project folder and its sub-folders
- show search results in an informative way (ideally showing the text around search terms, allowing one-click accesses to the publication/s and the correct page or pages in the publication/s that were found)
- finally, let's see what knowledge management (KM) programs can do to gather texts, images, and information in general, and how they can serve to share and disseminate our research ... and see what seems really useful and what does possibly just sound 'cool'


Introduction:

Knowledge Management (KM) exists as an academic discipline since 1991, includes certain areas of business administration, information systems, and also library and information sciences. But lately KM is not anymore limited to disciplines such as business administration. It is becoming a very essential issue for the social sciences as well, for cultural studies, East Asian Studies--basically for all disciplines. The result--well, I am not too sure if it is the result or the method--is that technological tools are most essential for the formation of how a new generation of students and researchers (and everybody else with Internet access) is gathering knowledge, reading and interpreting information, evaluating information, structuring sources, reproducing and disseminating knowledge. It seems that what is possible technologically within any given time frame brings major changes in how we are going about conducting research. Libraries of books are converted into digital libraries, and this physical change still under way for this main knowledge "containers" means changes that go far beyond issues of "tools."

On the surface much of this comes across like a democratization of knowledge, an assimilation of differences in the access to knowledge. On a purely structural level I now see an amazing synchronization of how knowledge is being digested and then structured and re-packaged at all levels of society. Your 12-year-old son's or daughter's iTunes "library" has the same structure than your "library" of PDF files, and the way how the new KM software tools for students and researchers work was again strongly inspired and shaped by how YouTube or Facebook and other social media websites and tools work--we will come to that. Then again, in terms of "access" we still see very essential differences (with state-run universities in Europe now being far behind, hardly offering any access for their students and faculty to any academic databases.)

Let me come to the practical part: What is KM software and how does it work? In this posting I will concentrate on how to deal with PDF files, as this is the most widely used digital format these days for published or semi-published research. KM software does exactly what Google and what other social media companies do: assembling various, otherwise separate computer scripts into one big software program, and then, in addition, adding online accessibility and server storage space to create a 'share point' (like e.g. YouTube to upload, search for, play, and to interact with other users, and download videos); KM software allows the user to gather bibliographic information, to up- and download information and full-text documents, to share documents, to have some social interaction with other users. Other than the older bibliographic tools, most successful is EndNote, that are mostly desktop programs which import references and texts from online library catalogs to the user's computer (online to off-line), these new tools try very hard to (a) be two-directional and interactive (from the Web, on the Web, to the Web) and (b) they usually come as a bundle of tools rather than one clearly defined desktop application: that is, they come as Web browser add-on/extension/module, as well as a hosted website with various server-hosted tools, and also as a bundle of program tools (all under one name though) in your Mac's or Windows' program folder. Bundling a desktop application, online tools (on an Internet server), and some free or paid services all merged together into one product is certainly the current trend in all areas. From a technical perspective, this makes sense mostly trough the development of Java as a programming language. Java programmed tools can easily be used (or easily be created) for cross-platform usage, on various OS, but also on both Internet servers and desktops. It is again Adobe that plays an important role here. Programming tools like "Adobe Flash Builder" or RunRev's easy-to-use "LiveCode" (formerly "Runtime Revolution," this one is not Java based) are used to deploy cross-platform apps. These programming languages and tools allow anyone with some interest to put together freely available code "libraries" and "modules." The latter program even allows to integrate code written in other codes, e.g. in C (for example a code library for searches, or another one for OCR conversion in Korean language, and so forth), and the created scripts can then simultaneously be deployed for Mac, Windows, Linux, iOS (iPhone), Android, the Web, and various Internet server OS. In more than one way I think it is really quite essential to understand the basics of how programming works these days, as it shapes our daily life in so many ways. The same modular way that e.g. Christian or Buddhist iconography with its set of symbols was shaping the minds of generations earlier, the same way do these modular tools work as patterns that shape the ways we experience the digital world now, also in terms of visual experiences! The general idea for KM programs is to create a software and service bundle that allows the typical student and researcher to streamline his or her workflow by being able to do all the necessary tasks from within one single application and service environment. But in addition, and other than in an environment of industrial production of physical goods, to be commercially successful, any software company will have to follow the exact same patters, functions, and visual and communicative experiences and mechanisms that Google, YouTube, and Facebook have build up, since here the production worker (the researcher) is at the same time the consumer. We are therefore talking about individually perceived efficiency, pleasure, and stimulation, even conditioning.

Software that we learned to use in the past three decades is usually experienced in terms of "workflow." You click on a file name or icon, a document opens, you write and may highlight some text, do something in the program's menu to set it in bold or Italics, you hit a keyboard combination to save the text, and so forth. But with the new KM programs the workflow is not any more that simple. When you get into such an environment you might at first be overwhelmed. This is not a desktop program anymore, and also not a Website alone, nor just a service. This is more like a Baroque church or Disneyland, very hard to define what is real and functional and what just illusion and blending and stimulating your senses: a synesthesia effect, stimulation of one sense while another sense is automatically stimulated simultaneously. Pavlovian conditioning also, where a conditioned stimulus signals a second stimulus (thus unconditioned stimulus). You see certain icons, patterns, and setup, and being pre-conditioned by your Facebook, YouTube, etc. experiences, together with the promise to deliver easy access to knowledge, you receive enough stimulation to associate the software product you got yourself "into" with those other products that play at your urge to be cool, in charge, and up-to-date. ... maybe just kitchen psychology ... it just seems so very obvious.


Minimal vs. Maximal Approach:

After looking through all the programs, services, and options I have come to the conclusion that it makes sense to talk about a minimal and a maximal approach ... a do-it-yourself and an all-inclusive package. Let me explain, and my apologies to those reading this who are already familiar with the programs and options. The new generation of programs, maybe indeed best described as "Knowledge management programs & services" as I have it in the intro, are true all-inclusive packages. Not surprisingly, I found that they produce many of the same issues that travel discounters and their all-inclusive offers are known for. Still, right now it seems the future is theirs. Good, slick, efficient, and elegant programming though, that is something else in my eyes. It would mean to make more use of the modular capabilities of the operating system (OS) itself, to have various programs for various tasks, as not everyone might be that convinced of or interested in using every function. Those smaller programs would then work together just like many other programs already do today (that is what I mean by making more use of the modular capabilities of the OS), so, for example, you can do a simple copy/paste of an Excel table into a MS Word file--no need for MS Word and Excel to be merged into the same application. Instead of a "big fat" all-inclusive program that dictates what you will eat for breakfast and that creates a dependency of the user to that software (as all his data are only accessible within that one program) a break-down in more applications would be desirable, in my opinion.

(A) Minimal Approach (Do-It-Yourself Tools):

Let me then start with the minimal do-it-yourself approach. So, this approach should achieve goals (1) and (2) listed above, and hopefully also (3), but no more.

PDF: A PDF file is only searchable if it has text. Roughly 30% of PDF files on my computers were created from scanned images though, which means 'pictures' of texts. As such they are not searchable. As you know, you can apply an OCR (Optical Character Recognition) treatment to such scanned text, and you have several options here: (a) convert them into text-only or formatted text files, (b) create a mix of the original scanned image with some textual corrections (possible through machine intelligence scripts being applied), and (c) adding a layer of text to the existing scanned picture files within the PDF; this layer of text is invisible, but you can still perform a word search then and copy text using a PDF Reader program. The only option that should interest us here, in my opinion, is option (c), to convert all image-based PDFs we have into searchable PDFs, without changing the layout of the original document. Last I looked into OCR technology was 17 or 18 years ago. And well, that technology was not all too impressive then, and it has not gotten that much better now either. To ocr a document is still extremely slow, and there are still plenty of errors. I tried out four different technological frameworks and programs based on different libraries last week; none is really satisfactory. Yet, we need to ocr image-based PDFs in order to search their content.


(A) (Part 1): Selecting image-based PDFs:

Before applying a text layer (via OCR) to image-based PDFs, you need to find out which PDFs are indeed image-based. You can open every PDF and try to highlight some text, and if you are not able to do that and copy text, then it's an image-based PDF. But if you have 500, or 1000, or more PDFs, this will just take too much time. Here is what you can do instead (I found no smoother way to do it):

[Mac]
Download the free tool "EasyFind" (http://www.devontechnologies.com/download/products.html, in the Freeware section). Search all your PDF files on your HDD, or better, the PDFs in a specific folder and its sub-folders. Do the following search:
Search for "FontName". In the "Settings" tab, click the "Scan all files" option. And in the search screen, under "Search for:" select "File Contents". See the screenshot below:
EasyFind.jpg
EasyFind.jpg (59.84 KiB) Viewed 48826 times
(Those PDFs without OCR layers will *not* have the "FontName" tag in them. The search result will therefore only list all PDFs that are text-based or that already were ocr'd and have a text layer applied, in short the ones already searchable that you do *not* need to ocr.)
Now create a new FOLDER and then push these PDFs to it:
--> Edit --> Select All, then simply push them all from within that EasyFind window to the new folder you created. You can now start to ocr those PDFs left in the folder you were performing the search in.

[Windows]
Download the free tool "Agent Ransack" (http://www.mythicsoft.com/agentransack).
Open the program and search for "FontName" in the "Containing text" window (and in the OPTIONS tab leave everything unchecked!). You then do a "Select All" (of the results), and hit "DELETE," which will put indexed PDFs into the Recycle bin. Do not forget them there!!! Move them from here to another new FOLDER.
AgentRansack.jpg
AgentRansack.jpg (101.74 KiB) Viewed 48825 times
NOTE: Under Windows this search is unfortunately much much slower than on a Mac.


Three more notes:

"DEVONthink Pro Office" [Mac]--discussed in part (B) below--will clearly indicate if a PDF is indexed and is searchable or if it is image-based, and that program does also quite easily allow you to select one category of PDFs and move them to another folder (to then apply OCR.) Using EVONthink is clearly the easiest solution *if* you use it anyway.

"Zotero" [Mac, Windows], another of the KM programs briefly discussed below, does also show if PDF documents are indexed (have TEXT components) or not. Unfortunately there seems no function to create a list by that criteria; it seems one has to click on each document to see if it is or is not indexed.

"Adobe Acrobat X Pro" [Mac, Windows] (see below) is supposed to detect text-based PDFs automatically when doing OCR batch processing, but at least for me this did not work, instead it added an additional unneeded text layer to these files also, adding to the file size and costing time to process these files.


(A) (Part 2): OCR Processing of PDFs:

[Mac, Windows]
- "Adobe Acrobat X Pro" (current version 10.1.3, price $199, with educational discount $119); home page: http://www.adobe.com/products/acrobatpro.html

This seems the most advanced tool to ocr image-based PDFs to readable format. It comes with a built-in OCR plugin called "PaperCapture." You can initiate an OCR conversion by trying to perform a search in a non-searchable PDF: have the PDF document open, in the main menu "Edit" / "Find" -- enter anything and start the search. You see some sort of warning box, clock "Ok" and then you get a window allowing you to choose various OCR option if you click on "Edit" (in that window): choose the document's main language (Korean, Japanese, and Chinese is also available!). As "PDF Output Style" you need to choose "Searchable Image (Exact)"--that is the option to add a layer of searchable but otherwise invisible text. Then click "Ok" to start the OCR conversion. Note that it even detects vertically printed Korean or Japanese texts without problems.
AdobeXPro.jpg
AdobeXPro.jpg (89.16 KiB) Viewed 48820 times
There will certainly be errors, but the text should always be good enough as the base to perform keyword searches.
The lower priced "Standard" edition of Adobe Acrobat does also do this OCR job--for a comparison of features see here: http://adobe.com/products/acrobatstanda ... guide.html. The Standard edition, however, does not allow any batch processing. That means you will have to open up each PDF separately and do the OCR process. Please note that OCR processes take a *long* time, with Adobe easily half an hour or longer for a 300 pages book or dissertation (with all other programs I tried even up to four times longer!), so, in that case you could not just have your computer process 20 or 50 documents while you go to bed, as you have to be there and monitor the process.
OCR batch processing is very well explained in a video at the Princeton U website: http://blogs.princeton.edu/etc/2012/05/ ... acrobat-x/
You can follow those instructions to set it up--just make sure to choose (other than in the video) "Searchable Image (Exact)" as output style and ENGLISH as language (or whatever main language your PDFs are in). You can then save this as a macro and next time just hit one button to process 20 or 30 or 50 PDFs in the same folder. Note that OCR processing will add 25 to 100% to the size of an image-based PDF.

Drawbacks:
- Adobe calls this a feature, I call it absurd: No matter what language you choose to ocr, some pages that have images or tables will be reversed, because Adobe Acrobat (in versions 9 and X) is trying to read the text (e.g. the caption to an image arranged vertically), so it wrongly "corrects" page orientation ("auto-rotation"), even if the OCR option is set to "Searchable Image (Exact)." It really should not at all alter the original, but it does. You can only--and that is very easy--later change the page orientation of such pages. To do that you will need to go through every page of every document you converted! Or you live with having documents that suddenly show a page or two or three in horizontal layout.
- If you have two Asian languages in one document only one will be recognized and converted. The setting to "Korean" includes Hanja but not Japanese Hiragana or simplified Chinese (PRC).

Recommendations:
- OCR just takes a lot of time, and it takes a lot of your computer's CPU ... your computer might slow down. Best, if you can, do it on a powerful *multi-processor* machine, ideally 8 or 16 processors. This really is a task where power makes a difference. Let your computer do this while you are away or in bed, it otherwise may slow down your normal work if you let it run in the background.
- Other than documents whose main language is e.g. Korean or Japanese, I suggest not to set the OCR language to Korean (or Japanese). For the typical English (or French or German) East Asian Studies academic text that includes terms and names in East Asian script, you will have to make a compromise. An English language academic work that includes Han'gŭl and Hanja will be ocr'd just fine if you set the language to "Korean." That is more problematic with French or German, as the accented letters or ß will be misread, so you better set the main language for the OCR process to French or German (whatever applies).


Alternatives Programs (to Adobe) to do OCR:

There are many such programs. Most of the new ones now use ABBYY FineReader libraries. Some of those programs are less expensive than Adobe and text accuracy is the same. Others, like the "ABBYY FineReader" editions for Windows and Mac themselves (http://finereader.abbyy.com) cost about the same. The problem I saw, though, is that the resulting PDF files were *huge*--tried it with my 20 years old M.A. theses, 272 pages, and the file size after it was ocr'd had grown from 7 MB to 142 MB (!), while Adobe had just added 3 MB.

- "PDF OCR X" [Mac, Windows], at http://solutions.weblite.ca/pdfocrx/ ($30), is the only real alternative I could find to Adobe that makes some sense. Other than Adobe Acrobat it is a very small and simple program, basically a visual shell build around PDFBox, an Open Source library. It is extremely simple to use and can do batch processing.
ADVANTAGE (!) (over Adobe): It is too simple to mess around with the original layout in image-based PDFs, no "auto-rotation" of pages.
Drawbacks: Same as all non-Adobe OCR programs it is up to 4 x slower than Adobe, and OCR of East Asian language PDFs works not to add a text layer (which is what we want), only to create separate text files. The developer told me that this is due to a bug in the mentioned PDFBox library that was never corrected.


(A) (Part 3): Performing Global Text Searches in PDFs & Accessing Pages Found:

To start with, there are major differences here between Mac OSX and all Windows OS. Mac OSX has a built-in index that comes with the operating system: every single word is indexed as soon as you type it or copy it to your Mac. This allows any sort of simple or complex searches in *any* text document at a stunningly fast speed (a few seconds to search the text of 200,000 or a million files on your hard drive).

[Mac]
As just mentioned, Mac OSX (10.4 and up) comes with a super fast search tool that Apple named "Spotlight." You find it in the upper right corner of your screen--click on the magnifier icon. Alternatively, you can also go to "File" / "Find" in the Finder to access it, then showing more search options. See the Wikipedia for a description of "Spotlight": http://en.wikipedia.org/wiki/Spotlight_(software).
To ONLY search PDF files, add "kind:pdf" in the search. For example, to search for "Yesurwŏn" I can do this (in the "Spotlight" search window):
   Yesurwŏn or 藝術院 or 예술원 kind:pdf
Boolean searches are also possible. If you access Spotlight through the Finder ("File" / "Find"), you can do the same through the menu.
Spotlight.jpg
Spotlight.jpg (52.96 KiB) Viewed 48824 times
Still, the problem with Spotlight is that the search interface is neither really well designed nor as functional as it could and should be. The newer Lion (10.7) OS is a little better there, but still not good. I therefore suggest using a tool that is 100% based on Spotlight technology, uses that as its motor, but provides us with a much nicer interface that adds some functions also (e.g. limiting a search to a certain project folder):

There are several such Spotlight-based search engines available: HoudahSpot, Tembo, DataLore, and others.
- "HoudahSpot" (http://www.houdah.com/houdahSpot/, $30) is the one that has most of what we want and need. I recommend this as a replacement for Spotlight.
HoudahSpot_1.jpg
HoudahSpot_1.jpg (56.39 KiB) Viewed 48824 times
It also has a nice preview function which is a much faster way opening and looking at PDFs than Adobe's Reader (OSX Lion now also has that).
HoudahSpot_2.jpg
HoudahSpot_2.jpg (76.63 KiB) Viewed 48824 times
HoudahSpot does not automatically display the pages of documents where a search term is located as they appear in the PDF document--the only missing function I really would wish for. There is a Mac application that does exactly this, however, and every Mac user (on OSX) has it: "Preview" can be found here: /Applications/Preview
Searching with the "Preview" application will show you all occurrences of a searched term within a PDF document, will highlight the term and list all the pages that it appears on as clickable thumnails. Unfortunately you can only search one document at a time. But back to HoudahSpot: the program has an option that is almost as good. If you activate "Text Preview" (main menu --> Window --> Show Text Preview), the searched word or term will be displayed within the unformatted surrounding TEXT or text layer of the PDF (see image below, here a search for "Satsuma"). The nice thing is that the same works for all other file types with text, e.g. MS Word, Excel, PowerPoint, plain text files, etc., and thus you can perform one search in all your texts, no matter if PDFs or other file types.
HoudahSpot_3.jpg
HoudahSpot_3.jpg (79.99 KiB) Viewed 48718 times

[Windows]
- "PDF-XChange Viewer" (http://www.tracker-software.com/product ... nge-viewer), a free tool, seems much nicer for both viewing and searching PDFs than Adobe's Reader or Pro edition. You can use it for global searches of all PDFs in a folder, but it is by no means as super fast than Spotlight or Spotlight-based tools on a Mac. The more documents and the more pages, the longer a search lasts (about the same speed as Acrobat's "Advanced Search" function, which can also be used to search multiple PDFs simultaneously). Even with the free version of "PDF-XChange Viewer" you can ocr image-based PDFs, and there are Korean, Japanese, and Chinese language packs downloadable as well. Using this as an OCR tool, though, takes longer than Adobe.
The very big advantage over the above described Mac tool, HoudahSpot, is that the results of searches will get you a listing of the highlighted search terms within the surrounding text, together with the page info for each quote, AND opening a found document with a mouse-click will get you to that page.
PDF-XChange.jpg
PDF-XChange.jpg (131.47 KiB) Viewed 48645 times
Isn't that wonderful! In above screenshot you see a search for "globalization OR globalisation" in a folder with PDF documents, and "PDF-XChange Viewer" then lists all incidences found in all documents, highlights the search term/s while displaying the entire line of text this appears, and when clicking on any of these in the list the PDF document opens immediately in the built-in viewer, at the correct page, with the term or terms again highlighted. Very neat!


My Personal Summary for the 'Minimal Approach':
---------------------------------------------------------

With little cost and effort I can raise the efficiency level of working with a large number of PDF documents on my personal computer and save a lot of time. The main task with this setup is to ocr the 25% to 35% of image-based PDFs I have now or will get later. Using batch processing I best let my computer do this when I am away or sleeping--still, it will takes days. (Of course, I can also just leave those docs as non-searchable PDFs.) The major drawbacks differ between Mac and Windows OS: The Mac allows super fast searches but there seems no program that would then display the results, the searched words or phrases within the context they appear in the PDFs and get me directly to the pages where searched terms or names were found; for Windows, on the other hand, exactly this is possible, but searches are rather slow, and they get slower with each additional page to be searched. I should also mention that it would be best to do the OCR processing on a new, fast, and powerful desktop computer.



(B) Maximal Approach (All-Inclusive Code Resorts, KM Software):

- EndNote
- DEVONthink Pro Office
- Zotero
- Mendeley
- Papers
- Qiqqa


[Mac, Windows]
- "EndNote" (http://www.endnote.com), $250, educational discounts available; many North American universities provide EndNote for free to their students and professors via volume licensing. EndNote does not quite belong to the new generation of knowledge management software. I mention it here because it tries to be, tries to compete with the new KM brands. It seems rather that Thomson Reuters, the maker of Endnote, feels some market pressure from these new KM software players to stretch itself into this direction, as is also demonstrated by its 2009 lawsuit making reverse-engineering accusations against "Zotero" (Open Source software), one of the two most popular KM tools to be discussed below. EndNote was for a long time strictly bibliographic software, which did push its competitors from the market--for good reasons, I think, because it simply is the better tool. I have used EndNote since 1995, and still appreciate it as the most useful tool to search and import references from hundreds of online sources. It is also very precise when it comes to bibliographic styles (to export data to text documents) and has all the search and selection capabilities one would wish for. It provides the widest range of import filters to libraries, ProQuest, and many other commercial bibliographic databases (over 4,000). EndNote has continuously adjusted its software to the changing digital environments, and that is fine. But its attempt to create more interactivity with the Internet as regards to sharing files and bibliographies (with the "EndNote Web" service) is in my eyes a flop, and using it to download and manage full-text articles is as well.

[Mac]
- "DEVONthink Pro Office" (http://www.devontechnologies.com/produc ... rview.html), $150, but 25% educational discounts available. And there also is a less expensive version: "DEVONthink Personal" (http://www.devontechnologies.com/produc ... sonal.html), $50, but 25% educational discounts available.
"DEVONthink" is mostly about organizing, converting, annotating (e.g. PDF documents or images), and searching documents in various file formats (PDF, MS Word, Excel, Email folders from Apple Mail and Microsoft Outlook, images, Websites and Web archives, etc.) already on your computer. I am certainly interested in all these functions. This is not mainly a bibliographic program then, and it is also not one that tries to help you collect full-text articles from various Internet sources. It still is a KM program though. For a listing of all its claimed functions, see here:
http://www.devontechnologies.com/produc ... rison.html
(Click on the "Capture" / "Edit" / "View" etc. tabs to see all.)
After trying out the full "Pro Office" version of it for about three hours I am still at a loss what I can actually do with it. Yes, I can search text in various file formats simultaneously (but I can also do that with the cheaper and more light-weight, much less obstructive "HoudahSpot" (on the Mac) discussed above. Other than that I found nothing that I would really want to use. The program creates a growing database, and the build-in OCR (again using ABBYY FineReader, does not have Korean) for PDFs is creating huge file sizes, 10 to 15 times the original size on non-indexed PDFs. The only function I found really useful is that it does list PDFs and shows if those are indexed or not indexed, and it allows to easily sort them by that criteria and move selected files to a different folder (e.g. in order to then ocr them). It is also easy to just copy selected files to different folders after doing a search, something useful for specific research projects. Other functions I tried out did simply not work for me at all, and I got a little impatient with that program. Not a program I would have any use for as a Mac user, but maybe Windows users would because global searches on Windows are so slow, other than on Macs, and because this program does its own indexing of imported documents, which then allows fast searches. Problem is: there is no Windows version of it.


Fine! Now let me finally get to the real new generation of full-fledged knowledge management software programs and services: Zotero, Mendeley, Papers, and Qiqqa. Zotero is the oldest of these, first released in 2006, and it is the only Open Source project among those I am looking at. Mendeley is also free, but is not Open Source, was first released in 2008. These two, Zotero and Mendeley are very clearly the most popular ones in the growing list of KM software and service solutions; Mendeley seems to be the one that is making the race though. And since 2010 there is Qiqqa (slang for "quicker"), also freeware, and programmed by a single young developer, James Jardine, who at first developed it while working at his PhD thesis at U of Cambridge. As of today Qiqqa seems not yet as popular as Zotero and Mendeley, but I would give it a very good change to beat them both in no time (see below). Incidentally I stumbled over a posting by James Jardine, alias Jimme, from July 2010 at the competitor Zotero's forum (http://forums.zotero.org/discussion/133 ... -and-pdfs/, middle of page); and it is truly interesting to read what he wrote about Qiqqa (comparing it to Zotero and Mendeley), explaining what it is *not* trying to be--and today, less than two years later, Qiqqa is exactly all that! That is not meant as criticism. It only is fascinating to see the market forces at work, and the market forces are clearly defined by companies like Google, YouTube, file sharing service companies like Rapidshare, and a few others who usually combine desktop software or web browser add-ons with services and server space allocations with a cloud setup, thus financing their business operations through service fees and advertisements. In any case, other than e.g. a software such as EndNote, what they offer is no longer desktop software but a wide mix of software, free and paid services, and server storage space--all intermingled with each other. The entertainment aspect, as one might call it, that 'discounters' like YouTube have brought into the Internet landscape, seems to overwhelm anything else. That again, is anything but surprising, given that the largest customer base is now a no-name generation in their 20s and 30s whose coolness seems reduced to babble talk in half-liners and acronyms over Facebook and cell phones, while, on top of their own cell phone bills, being busy paying off the life-style debts of their parents' generation. So, if you are not part of the no-name generation and open up one of these knowledge management applications and feel slightly disoriented, then that is why so. ... Almost forgot: Qiqqa, by the way, is only available for Windows, while the others mentioned do also have Mac versions.


[Mac, Windows]
- "Zotero" (http://www.zotero.org), free Open Source program.
An mentioned in the beginning, these are no extensive reviews of all the programs. Zotero has many functions, and explaining and reviewing them all would take pages. You can look at Zotero's "Quick Start Guide" (https://www.zotero.org/support/quick_start_guide) for a good description of all the features. That page has a short 3-minutes video on top that will get you a first overview of what you can do with the program. But allow me also to quote the main description from its home page:
"Zotero (...) is a free, easy-to-use tool to help you collect, organize, cite, and share your research sources. (...) Zotero is the only research tool that automatically senses content (...). Zotero collects all your research in a single, searchable interface. You can add PDFs, images, audio and video files, snapshots of web pages, and really anything else. Zotero automatically indexes the full-text content of your library, enabling you to find exactly what you're looking for with just a few keystrokes. (...) Whether you need to create footnotes, endnotes, in-text citations, or bibliographies, Zotero will do all the dirty work for you. (...) Zotero automatically synchronizes your data across as many devices as you choose. (...) [Web based services:] Zotero groups can be private or public, open or closed. You decide. Create and join research groups to focus on any topic you choose. Each group can share its own research library, complete with files, bibliographic data, notes, and discussion threads."
Physically it comes both as a Web browser plugin and also as a stand-alone desktop application, you only need to install one of the two. I recommend going with the Firefox plugin, as you would not have to open a separate program when doing some online research. Together with Mendeley and Qiqqa this is in my opinion one of the three serious attempts to create an all-inclusive KM tool, no doubt about that. My impression, though, is that the development of the program has been relatively slow, many problems have not been solved, and as the oldest of those tools I looked at, it seems that it already fights with the typical issues of having to be backward compatible and to run on more than just one OS--resulting in numerous compromises in functionality and dramatically slowing down the speed of development. Zotero works with plugins, e.g. for MS Word. That is another aged programming approach, and we see that it does have the exact same problems that EndNote users have experienced over the years: each new version and update of MS Word or Open Office creates problems with these plugins, and OS upgrades sometimes do as well. There are just too many components of two many programs that all need to work together seamlessly. Then again, looking closer at the bibliographic functionality, I must say that EndNote is far better with that for serious scholars. It takes more time to correct auto-imported bibliographic references (with or without attached full-text PDFs) than it does in EndNote, and the output has just too many problems, starting with missing item categories. So, for that part, it already makes it useless for me.
Its advantage is that it handles all kind of files. Files can be imported or just linked to. That allows you to e.g. index a digital image/photo collection, adding titles and/or description to images with a MS Word style text editing tool--so you can later search images by keyword or anything else in your description. The same applies for all other file types. Of course, all those 'notes' attached to images and all other files are stored in the Zotero database, not in any way physically attached to the files that they relate to directly. If you send these files to someone else he or she won't have any of these notes, nor would backups of those files alone have them. Zotero is also quite good at downloading references from hundreds of library catalogs (incl. WorldCat), but also full-text articles (as so-called 'attachments' to these references) and thesis from databases such as JSTOR, Google Scholar, ProQuest, EBSCO, Project MUSE, and others. Full list here: http://www.zotero.org/support/translators. Of course, without institutional subscription, e.g. by your university, only free resources are downloadable.
Zotero, unfortunately, does only provide very limited preview capabilities of the files in its database: clicking onto a image file does show it in the browser (if Zotero is installed as plugin), but MS Word files open only in the external MS Word application. That is very disappointing, and not very up-to-date when looking at its competitors. Same as its competitors, you can sign up for free online storage to share bibliographies and texts with other users, and you would have to pay when extending the storage space (fees are similar to average web hosting fees). Finally, a last observation, Zotero, other than Mendeley or Qiqqa, is not able (not sure why so) to search within text-based or indexed PDFs if those are password protected (to disallow e.g. copying or editing). The only way around this is to remove any protection, for which you can use the program "Wondershare PDF Password Remover" (http://www.wondershare.com/pro/pdf-pass ... mover.html), trial version limited to 3 pages; works great with *any* password protection: so much about the efficiency of 'password protection.' [EDIT 06/09/2012: For Windows the following free tool by Reezaa Media also works fine to remove limitations from PDFs: http://download.cnet.com/PDF-Password-R ... 65065.html]

All over, I think Zotero is nicely unobtrusive to anybody's workflow, especially as Firefox browser plugin (and here it is better than the others discussed here), useful to order files and to make notes to files--and here again I personally am considering to use it for image files. It is certainly a very big timesaver when searching and finding publications in databases such as JSTOR or ProQuest that you want to download, and afterwards possibly copy into a specific project folder, and to have bibliographic references created automatically--references that you can then just push over to the documents you need them in. All that is done very nicely and works well as long as there are no further MS Word or Opn Office updates. Since Zotero is a very popular Open Source project were many users are involved we can expect it to be around for a long time. The major shortcomings are the limited preview options and also that searches do not give any further infos about exactly where in a document and how often a certain searched word or name or phrase is located.


[Mac, Windows]
- "Mendeley" (http://www.mendeley.com, free, but other than Zotero not an open Source script. Not quite sure if and how the Mendeley company makes profit, maybe just by offering extended server storage space, but I doubt it. The general description of features is at its base quite similar to that of Zotero, discussed above. The Wikipedia entry summarizes, I quote:
"Mendeley is a freeware desktop and web program for managing and sharing research papers, discovering research data and collaborating online. It combines Mendeley Desktop, a PDF and reference management application (...) with Mendeley Web, an online social network for researchers. Mendeley requires the user to store all basic citation data on its servers - storing copies of documents is at the user's discretion. Upon registration, Mendeley provides the user with 1 GB [edit FH: only 500 MB now] of free web storage space, which is upgradeable at a cost."
For an (all too) brief summary of "Mendeley" see the video at:
http://www.mendeley.com/features/ - ("CLICK TO PLAY")
The script's home page displays a counter that currently lists 243 million (!) user documents, 1.7 million members, and over 158,000 research groups. That *is* impressive. Although, many reviews comparing Zotero and Mendeley point out that the large majority of students and researchers using Mendeley (and participating in groups) are in the "hard sciences" ... medicine, biology, chemistry, etc., and accordingly the same is true for the majority of full-text papers. Zotero, on the other hand, is more oriented towards the social sciences. My own insights are until now far too limited to be able to judge how useful it can be to join on of the 'groups' at Mendeley and thus having easier access to bibliographies and full-text papers; it would likely take a few months to work with both to get a better idea of what makes sense here and what not. Let me therefore just follow with a few notes about the technical setup.
To efficiently use Mendeley you have to register for a Web account with 500 MB space, and that allows you then to join interest groups or to create one. Searching for "Korea" as keyword in that group area I got a list of a little under 1,000 registrants who either had Korea given as their physical location (in most cases) or among the keywords as an area of interest. Not too many! That also means that all the profiles of all members (including your's) are visible to all others, just as in Facebook, the very same mechanics and "social media" approach.
Mendeley_0.jpg
Mendeley_0.jpg (68.99 KiB) Viewed 48824 times
You see what each researcher reads, uploads, publishes, and when so, with full statistical analysis.
Mendeley_1.jpg
Mendeley_1.jpg (101.9 KiB) Viewed 48824 times
The program's "Web Importer" lets you import references and documents from over 30 academic databases with a single mouse click (see full list above).
Mendeley_2.jpg
Mendeley_2.jpg (87.87 KiB) Viewed 48824 times
As shown above, one can just click on the green tab "Save reference to library" to import the bibliographic data of such a single reference, but if that article or document is available as full-text PDF (either freely available, or if you are subscribed via your institution to e.g. ProQuest), then you can also import/download the entire document with just a click. As you may also note in the above example, the bibliographic details may not always be correct according to whatever bibliographic style you need--here "during" is being capitalized in the title, but prepositions should not be capitalized in any bibliographic style. You will have to manually correct such errors; the application of bibliographic styles to format references will not help you there. I should note that EndNote is less complicated and faster for doing such jobs, and you can easier do it globally (in an entire library). Further interesting though, Mendeley will, when clicking onto any reference, also show you related publications--in above screenshot, see under the header "Related research" and on the right under "Related Full-Text Papers." So, you can from the same screen import a whole bunch of references and PDFs of full-text articles, dissertations, etc. That seems truly useful, as you are likely to find some publication you would have otherwise overlooked. Yet, as already mentioned, Mendeley is more aimed towards the hard sciences at this time, and many databases, e.g. from Korea or Asia are simply not connected. Would be great if Korean institutions like AKS and others could work with Mendeley to change this!
Mendeley_4.jpg
Mendeley_4.jpg (80.14 KiB) Viewed 48630 times
References where full-text PDFs are available can immediately be opened with the great built-in PDF viewer.
Mendeley_5.jpg
Mendeley_5.jpg (76.54 KiB) Viewed 48824 times
Performing a SEARCH within your library (searching all your PDFs), this PDF-viewer will nicely show you where in every text file a searched term or name appears and will highlight each appearance. This is *very* useful and efficient. (Above shows a search for "typhus.")
Mendeley_6.jpg
Mendeley_6.jpg (72 KiB) Viewed 48824 times
Mendeley also allows you to do your own highlighting in PDFs, and to attach notes, stickies. I myself am not used to work this way (yet). But I suppose this might just be something to get used to, and I imagine that a student starting to work this way will get real fast with it. All the notes and highlighting you do are then stored in Mendeley's database, *not* directly attached to the PDF file (the PDF file will not be altered in any way). That again means, and this seems important, once we start using Mendeley we are tight to this software and can only hope it will survive in the long run, as we otherwise might loose our notes etc. (unless there will be some importer for some other program later, which might well be so).
Mendeley_7.jpg
Mendeley_7.jpg (64.81 KiB) Viewed 48824 times
Of course, you can also just 'push' any references from Mendeley (above image on the left) to an MS Word (above on the right) or any other type of document and have that formatted in whatever bibliographic style you need it, here Chicago Manual of Style.

There is much more to say about Mendeley and its many functions, it is a complex program. Let me also mention the main limitation: even though you can add MS Word, Excel, and image files to your Mendeley database, the built-in search function does not work to perform any search within the content of other than PDF files. This is disappointing! The above discussed "HoudahSpot" [Mac] and "DEVONthink Pro Office" [Mac], as well as the free "Agent Ransack" [Windows] search tools, for example, all find text within MS Word, Excel, PDF, HTML, and plain-ASCII text files (and on the Mac such a global text search can, even within a hundred thousand files, if necessary, be completed within a few seconds, making use of Spotlight technology).


[Mac, Windows]
- "Papers" resp. "Papers2" (http://www.mekentosj.com/papers/), $79 (educational discounts available). I make this one short: "Papers" turned out to be the most obnoxious piece of software I looked at. It was programmed for the Mac; a Windows version was only released earlier this year. I only tested two Mac versions, the older version 1.9.7 from 2010, I believe, and the latest from last month (v. 2.2.10). The program has *some* of the functions that Mendeley has, but so many functions simply do not work whatsoever, and what works is of little use, in my opinion. On a related forum many Mac users of "Papers" v.1.x who had upgraded the software to version 2.x did report that they afterwards downgraded again. Still, I found neither version helpful. (I save myself the time to list all of the many problem areas, malfunctions, showing but non-implemented functions, and limitations.)


[Windows]
- "Qiqqa" (http://www.qiqqa.com/) is a free Windows only tool; a Mac version is not being planned, according to the programmer. "Qiqqa" compares to and competes with Mendeley, has basically the same functions as Mendeley, but goes even beyond these (e.g. the "Brainstorming" feature and the integrated ocr-ing of image-based PDF files). I will not list all these here now, but strongly suggest to take some time to watch the 'Qiqqa Tutorial' video (32 minutes) at http://www.youtube.com/watch?v=kYa9KzpVvn8.

Or go through the 'Features' section at http://www.qiqqa.com/About/Features. The comparison chart (Qiqqa vs. EndNote, Zotero, and Mendeley), of course set up to show Qiqqa being superior, is also helpful to get an overview of functions: http://www.qiqqa.com/About/Compare

"Qiqqa" is a great example of streamlined programming coupled with great design, shaped to assist the workflow of most researchers, all this is true for most aspects of the program. But one of the recently implemented new features that makes it stick out and differentiates it from Zotero and even Mendeley, the built-in OCR function, seems to me at the same time the biggest problem. The concept is easy: the user imports whatever PDF he needs and does not pay any attention to see if it is a text-based or already indexed image-based PDF, or if it is not indexed, thus not searchable, because "Qiqqa" will automatically ocr it to make it searchable. The OCR function can be disabled though (!), but if it is enabled, then the problem is: (a) as discussed before, the ORCR process requires a lot of resources, will easily slow down your computer, may cause your machine to become excessively hot, should ideally be performed on a powerful multi-processor machine (8 or ideally 16)--so, while you work in Qiqqa the program is always busy running OCR, as it takes a very (!) long time. Furthermore, and surprisingly so, Qiqqa also ocr's PDFs that already have a text layer, as it creates its own database and puts the texts into that database. That again means that although Qiqqa works endless hours running those OCR processes, the PDF files will in the end still not have text layers: you're stuck with Qiqqa then and its database. I find that very problematic and not a clean approach. (b) Qiqqa will create copies of all PDFs you import (that you have already on your HDD), then store these copies under different names deeply 'hidden' at C:\Users\[username]\AppData\Local\Quantisle\Qiqqa\Guest\documents\2\...
Even these copies are completely untouched, other than having been renamed to names like "YTY2HSF678KSTR455JYRB34KSY6FR2EA.pdf." The advantage of this is that you can move around, also delete, the original PDF files you imported without affecting Qiqqa; the drawback is that you will require a lot of space to work with Qiqqa, not only the doubling of files, also that the OCR process will often require 10x the space a image-based PDF file takes. (d) Another drawback is that Qiqqa only works for PDF files, no other file types (no MS Word, no images, etc.). Given the particular way how Qiqqa integrates OCR (until now it is the only piece of KM software that integrates it) I think it would be a wonderful tool to run on an external server, either in your own home if you can allocate a PC as just a Qiqqa server for your needs, and then have that run at all times and access it through your home network, or for universities and other institutions, there maybe with a cloud server setup--a single HP blade server (HP ProLiant BL685c G7, or similar) could likely do the job for an entire university.
On the positive side, and Qiqqa has many really helpful features, the search function is even better than that of Mendeley.
Qiqqa.jpg
Qiqqa.jpg (91.1 KiB) Viewed 48824 times
It shows every single instance the search has found, listing all the page numbers of a document that a word, term, or name was found on, and then highlights (in the built-in PDF viewer) all occurrences in the text and in addition displays the text (as ocr'd) at the bottom of each page again (see above screenshot). That function alone makes Qiqqa a highly effective tool to work with your PDF library. As mentioned before, it does have pretty much all other functions Mendeley does have as well, including annotating PDFs, quoting references to MS Word, etc. For additional functions, mostly "Brainstorming," please see the video at its home page (a function I consider of more importance for those in the 'hard sciences' than those in Asian studies.


A final thought: Since more and more of the usual sources used for academic research, old ones and new ones, are available at any time at the Web, importing anything to one's personal computer will over short or long become unnecessary. I think in the hard sciences we are almost there (computer sciences or medicine, for example). Everything is then a matter of access. Access is power, not knowledge, because knowledge is daily being "updated" and acquired knowledge becomes outdated faster and faster. Consequently, it is the access to the open as well as administered pools of collected knowledge that counts. Such a development is certainly 'softer' in Asian studies, but I think we get there. We will import less and less to our physical desktop computer's hard drives, more and more just use our PC as an access tool. It then means that programs and services such as Mendeley or Qiqqa might become more of a service tool (not a desktop application), and these and other KM software and service companies might merge with services such as ProQuest. That at least seems the direction.

Thanks for your attention.

Frank :geek:
 
timeline
Posts: 6
Joined: November 25th, 2022, 3:53 am

Re: 5000 PDFs on Your Computer: Knowledge Management Softwar

Unread post by timeline »

I find a lot of interesting information whenever I visit your website trap the cat, which I do frequently. Not only are there some wonderful articles, but also some wonderful comments. I am grateful to you, and I hope that you will spread the word about my page.
orabelle

Re: 5000 PDFs on Your Computer: Knowledge Management Software

Unread post by orabelle »

Suika Game, also known as Watermelon game, is an exciting new game: merging and stacking fruits will form bigger and more valuable fruits.
Post Reply