I started working on the question answering system Q4R4 Question Answering for Revit API.
The first step is to import The Building Coder blog posts into Elasticsearch and experiment with full-text queries on them.
Furthermore, we are proud to present yet more enhancements to the revamped version of RevitLookup:
- Q4R4 sources and result presentation
tbcblog posts into Elasticsearch
- Listing and clearing the Elasticsearch
- Strip and clean up HTML for JSON document
- Q4R4 GitHub repo and
- RevitLookup bug fixes
- RevitLookup icons
One aspect of q4r4 is searching, and another is what results to present and how.
One useful approach that comes to mind might be:
Given a query, return the most relevant results separately from several different resource collections:
- The Revit API help file
- The Revit add-in developer guide
- The Revit SDK sample collection
- Revit API discussion forum threads
- The Building Coder blog posts
- Anonymised ADN case answers
- StackOverflow queries
As mentioned in the last post on q4r4, I should start off implementing a simple but intelligent search engine without worrying about machine learning or AI in any of its forms.
I am still reading about Elasticsearch and figuring out how to set up an experimental system to try this out.
I started with the The Building Coder blog posts, since I have them all in handy text format, either HTML or Markdown, publicly accessible in the tbc GitHub repository.
I want to import all posts' full text into Elasticsearch.
A similar topic is discussed in having fun with Python and Elasticsearch, Part 1.
For testing purposes, it is useful to be able to list all posts imported so far and delete the entire collection to clean up and retry; here are two
curl commands to achieve that:
- List all posts:
curl -XGET 'localhost:9200/tbc/_search?pretty'
- Clear the
curl -XDELETE 'localhost:9200/tbc?pretty'
After reading the main blog post index file, I need to extract the text from the HTML contents and put it into a JSON document for Elasticsearch to imbibe.
Some useful hints for this are provided here:
I settled for a very simple HTML text extractor using the
It initially wrote the text to standard output, but I was able to pass a file-like
StringIO object into the
DumbWriter constructor to intercept it.
On the first attempt, I successfully imported the first nine posts.
Post number 10, Selecting all Walls, failed with a
UnicodeDecodeError error message.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 2595: invalid start byte
As it turned out, the offending file was stored in a Windows encoding. I converted it to UTF-8.
Next, I went one step further and eliminated all non-ASCII characters by adding
re.sub( r'[^\x00-\x7f]', r'', my_stringio.getvalue() ) to the result of stripping the HTML tags.
This will presumably corrupt some foreign names, expressions, and text passages. I would not expect those passages to be of any major importance for Revit API related queries anyway.
I also added an assertion to ensure that the filenames listed in
index.html really do exist.
A surprising number of errors were discovered and fixed in the process.
Now I have successfully imported all The Building Coder blog posts into Elasticsearch.
Here is the script in its current state:
The next thing to do is to start experimenting with queries, and presumably with ways to optimise the resulting hits.
While I am fiddling with q4r4, the Revit API discussion forum and other Revit API related issues remain as vibrant as ever.
Some new enhancements were added to our irreplaceable Revit BIM database exploration tool RevitLookup.
In the last few weeks, it was significantly restructured to use
Reflection and reduce code duplication:
Reflectionfor cross-version compatibility
- Basic clean-up of the new version
- Restore access to extensible storage data
- Further enhancements
CollectorExtElementfield initialization to constructor, use linq extension methods instead of linq syntax
- Get types only from
AppDomain.CurrentDomain.BaseDirectory, the Revit.exe directory path. I have a dll with a name that contains the substring "revit". This library depends on another library in another location. I have an
Assembly.Resolveevent subscription to load dependencies correctly. In such case this code fails, because it can't be aware of correct paths to load referenced libraries.
- Fix bug in getting
Application.Documentswhen more than one document is opened. The
Closemethod must not be called – it successfully closes non-active documents and fails to get information about them.
Many thanks to Alexander for these improvements!
I integrated them into RevitLookup release 2017.0.0.19.
- Added and updated icon package
- Added icon for RevitLookup button in Revit UI
- Added icon to RevitLookup forms
- Revised icons for RevitLookup menu bar
- Added exception handling
System.ArgumentException if the assemblyLocation` is null.
Many thanks to Ehsan for these improvements!
I integrated them into RevitLookup release 2017.0.0.20.
The most up-to-date version is always provided in the master branch of the RevitLookup GitHub repository.
If you would like to access any part of the functionality that was removed when switching to the
Reflection based approach, please grab it
from release 2017.0.0.13 or earlier.
I am also happy to restore any other code that was removed and that you would like preserved. Simply create a pull request for that, explain your need and motivation, and I will gladly merge it back again.