Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse RTF files? #22

Open
ZedZipDev opened this issue Nov 12, 2021 · 10 comments
Open

How to parse RTF files? #22

ZedZipDev opened this issue Nov 12, 2021 · 10 comments

Comments

@ZedZipDev
Copy link

It was removed and I cannot find how to parse RTF files. Is it possible?

@tonyqus
Copy link
Member

tonyqus commented Nov 13, 2021

Yes, it's removed from .net core version since I don't find a good library on .NET core to parse it. Give me some time to find a new RTF library on .NET core.

@ZedZipDev
Copy link
Author

Yes, I understand. The RTF format is a "wild" ;-). Btw, I have used some code in my SQLCLR function to parse rtf ; rtf in -> pure text out. Finally it works fine. I can share it .

@tonyqus
Copy link
Member

tonyqus commented Nov 14, 2021

I'm reviewing RockNHawk's code of RTFTextParser
https://github.com/RockNHawk/Toxy.NetCore/blob/netcore/ToxyFramework/Parsers/RTFTextParser.cs

Do you think this ToHtml method can meet your need?

@ZedZipDev
Copy link
Author

Probably it is what I need but:
I have created a small test app and tried to parse rtf files from your \testdata folder.

        static void TestParseRTFFromSample1()
        {
            string path = HelperClass.GetRTFPath("Blank.rtf");// ("htmlrtf1.rtf");// ("Simple text.rtf");
            var parser = new RTFTextParser(new ParserContext(path));
            string result = parser.Parse();//<-----------error is here
            Console.WriteLine("Result:{0}", result);
        }
public override string Parse()
        {
            using (var fs = new FileStream(Context.Path, FileMode.Open))
            {
                var html = Rtf.ToHtml(fs);//<---------
                return html;
            }
        }

The exception text:

System.TypeInitializationException
  HResult=0x80131534
  Message=The type initializer for 'RtfPipe.TextEncoding' threw an exception.
  Source=RtfPipe
  StackTrace:
   at RtfPipe.TextEncoding.get_RtfDefault()
   at RtfPipe.RtfStreamReader..ctor(Stream stream, Int32 bufferSize)
   at RtfPipe.RtfStreamReader..ctor(Stream stream)
   at RtfPipe.RtfSource.op_Implicit(Stream value)
   at Toxy.Parsers.RTFTextParser.Parse() in D:\MyProjects3\NET\Toxy.NetCore-netcore\ToxyFramework\Parsers\RTFTextParser.cs:line 20
   at ConsoleApp1.Program.TestParseRTFFromSample1() in D:\MyProjects3\NET\Toxy.NetCore-netcore\ConsoleApp1\Program.cs:line 21
   at ConsoleApp1.Program.Main(String[] args) in D:\MyProjects3\NET\Toxy.NetCore-netcore\ConsoleApp1\Program.cs:line 13

  This exception was originally thrown at this call stack:
    [External Code]

Inner Exception 1:
ArgumentException: 'Windows-1252' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')


@ZedZipDev
Copy link
Author

ZedZipDev commented Nov 15, 2021

Ok, finally, it can be fixed in caller code or in RTFPipe NuGet.
And it works.
But my idea is: to have something works like iFilter: to extract pure text. For example.
If I see
Hello World
in the Word I'd like to receive the text
Hello World
but I receive something like this:
<div style="font-size:12pt;font-family:&quot;Times New Roman&quot;, serif;"><p style="text-align:justify;font-size:10.5pt;margin:0;"><br>Hello World</p></div>

And SQL Server FTS will index all these words but it is not correct.

@tonyqus
Copy link
Member

tonyqus commented Nov 15, 2021

Your requirement makes sense. I also have concerns on RTFPipe. The extracted html result is not what most users need.

@ZedZipDev
Copy link
Author

By the way, I have tested and this recommendation works fine:

https://stackoverflow.com/questions/46119392/how-do-i-convert-an-rtf-string-to-a-markdown-string-and-back-c-net-core-or/54755138#54755138

It really extracts pure text from RTF file.

@ZedZipDev
Copy link
Author

Can you add this (BracketPipe-like, see the previous link) implementation to your framework?

@tonyqus
Copy link
Member

tonyqus commented Nov 20, 2021

I still have some concern on this method. It converts RTF to HTML and then convert HTML to markdown but it's still not plain text.

Instead, I'm investigating if this post will work or not.

@tonyqus
Copy link
Member

tonyqus commented Nov 20, 2021

English extraction works but something wrong with far-east character (e.g. Chinese) extraction

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants